Skip to main content

Logs Drilldown Compatibility

This track measures compatibility with the Grafana Logs Drilldown app, not generic Loki clients.

Scope​

  • Grafana datasource resource endpoints consumed by the app
  • Service selection, service-detail log volume, fields, labels, and field values
  • Log frame expectations that affect labels, level coloring, and field visibility

CI And Score​

  • Workflow: compat-drilldown.yaml
  • Score test: TestDrilldownTrackScore
  • Runtime coverage: pinned Grafana runtime plus current-family and previous-family Grafana smoke on PRs, with the fuller Grafana matrix kept for scheduled/manual runs
  • Version matrix: source-contract checks across the current Drilldown family and one family behind

The Drilldown matrix is also a moving window. We support the current app family and one family behind, with the contract list sliding forward as upstream releases move. We do not keep an open-ended tail of older app families.

Version Matrix​

Grafana runtime profiles​

Grafana versionCoverage pathVersion-specific focus
13.0.1PR/main pinned runtime + scheduled/manual runtime matrixFull Drilldown runtime score; current pinned build; React 19
12.4.2PR/main previous-family smoke + scheduled/manual runtime matrixdatasource catalog, base Drilldown resource contracts, explicit 2.x runtime-family assertions
12.4.1Scheduled and manual runtime matrixdatasource catalog, base Drilldown resource contracts
11.6.6Scheduled and manual runtime matrixdatasource catalog, base Drilldown resource contracts, explicit 1.x runtime-family assertions

Logs Drilldown app versions​

Logs Drilldown versionCoverage pathVersion-specific focus
2.0.4PR/main pinned runtime + scheduled/manual contract matrixCurrent pinned contract; patterns tab requires patterns-autodetect as Grafana default
2.0.3Scheduled and manual contract matrixdetected_level coloring, service-detail panels, patterns
2.0.2Scheduled and manual contract matrixdetected_level coloring, service-detail panels
2.0.1Scheduled and manual contract matrixdetected_level coloring, service-detail panels
2.0.0Scheduled and manual contract matrixdetected_level coloring, service-detail panels
1.0.41Scheduled and manual contract matrixService buckets, detected-fields filtering, labels field parsing
1.0.40Scheduled and manual contract matrixService buckets, detected-fields filtering, labels field parsing
1.0.39Scheduled and manual contract matrixService buckets, detected-fields filtering, labels field parsing
1.0.38Scheduled and manual contract matrixService buckets, detected-fields filtering, labels field parsing
1.0.37Scheduled and manual contract matrixService buckets, detected-fields filtering, labels field parsing
1.0.36Scheduled and manual contract matrixService buckets, detected-fields filtering, labels field parsing
1.0.35Scheduled and manual contract matrixService buckets, detected-fields filtering, labels field parsing
1.0.34Scheduled and manual contract matrixService buckets, detected-fields filtering, labels field parsing

Runtime Detection And Version Coupling​

Proxy-side Drilldown detection is based on deterministic request signals:

  • X-Query-Tags: Source=grafana-lokiexplore-app identifies Drilldown-origin resource calls
  • User-Agent: Grafana/<version> provides Grafana runtime version family

Important limit:

  • exact Drilldown app semver is not emitted on the datasource HTTP request path by default

Because of that, version-specific behavior should be gated by:

  1. explicit request source tag,
  2. Grafana runtime family (12.x, 13.x),
  3. compatibility matrix contract version bands (1.0.x, 2.0.x), validated in CI.

Field Histogram Series Cap​

Grafana Logs Drilldown renders per-field histograms by sending sum by (field) (count_over_time({...}|json|drop __error__,__error_details__|field!="" [...])) queries to query_range. For high-cardinality fields (trace_id, span_id, session_id) over long time ranges, an unfiltered VL response can reach 50 MB+ (300k+ unique values). The proxy currently bounds this through two complementary mechanisms:

  1. Primary path — /select/logsql/hits top-N (introduced 2026-06): every stats-compat shape routes through VL's /hits endpoint, which natively computes top-drilldownHitsFieldsLimit (20) field values per request and returns a remainder bucket. The proxy emits one Loki series per top-N value (the remainder is dropped). Cap is enforced server-side by VL — proxy never reads the full unbounded result.
  2. Legacy stats fallback (only when /hits fails — older VL versions, parse errors, or unique-value numeric fields where every value lands in the remainder bucket): proxy appends as _c | sort by (_c desc) | limit 500 to the VL stats query, pushing the 500-series cap into VL per time bucket. Two-phase fallback (single-bucket Phase 1 + filtered Phase 2) handles the high-cardinality cases that overflow the direct path.

As of 2026-06 the routing is source-agnostic — Explore, Drilldown, dashboard panels, and direct API clients all reach the /hits fast path. The historical X-Query-Tags: Source=grafana-lokiexplore-app gate was removed because Explore was hitting the unbounded direct stats path and OOMing at 24h+ for high-cardinality fields. The Long-Range Histograms section below documents the related leftover-chunk suppression that depends on Grafana-client detection.

Long-Range Histograms And Grafana querySplitting​

For ranges ≥ 24h, Grafana's Loki datasource splits every metric range query into 24h chunks before sending them to the proxy. The split logic lives in public/app/plugins/datasource/loki/metricTimeSplitting.ts -> splitTimeRange() and fires for any client built on the Loki datasource — Drilldown, Explore, and dashboard panels. The proxy must produce a chunk response shape that Grafana's in-browser mergeFrames + closestIdx + splice algorithm (public/app/plugins/datasource/loki/mergeResponses.ts) can glue back into a coherent timeline. This section pins the contract.

What Grafana sends​

splitTimeRange(start, end, step, oneDayMs) produces:

  1. floor(range / aligned_day) chunks of aligned_day - step length (aligned to step boundary).
  2. One residual chunk of (range mod aligned_day) length, which is often smaller than step (e.g. for an exactly-24h request the residual is 0–120 s wide when step is 120 s; for 25 h it is ~1 h).

Chunks are dispatched newest-first because runSplitGroupedQueries recurses with partition[totalRequests - 1] first.

What the proxy must return per chunk​

Each chunk sub-request must produce a Loki matrix with:

  • A single shared timestamp axis whose first and last buckets fall on step boundaries inside the chunk's start..end window.
  • ≤ drilldownHitsFieldsLimit (currently 20) named series, with all top-N values spread across the chunk's full axis. One-bucket series, where every top-N value lands on the same timestamp, will be stacked by mergeFrames into a tall right-edge spike — the 2026-06 incident root cause.

How the proxy enforces this​

  • proxyStatsQueryRangeDrilldown routes every count_over_time stats-compat query through the /hits path regardless of range (the historical 6h "hybrid threshold" is gone). Mixing /hits with stats_query_range | limit 500 across chunks produced disjoint series sets that mergeFrames unioned into a 500-series block at the chunk boundary.
  • proxyStatsQueryRangeDrilldownHits returns an empty Loki matrix (with response header X-Proxy-Drilldown-Path: hits-leftover-suppressed) when isGrafanaSourcedRequest(r) is true AND end - start ≤ 2 × step. The residual leftover from splitTimeRange falls into this bucket and the chart loses ≤ 1 step of width on the right edge instead of showing a spike.
  • isGrafanaSourcedRequest detects ANY Grafana client: X-Query-Tags: Source=grafana-… (Drilldown / Explore), User-Agent: Grafana/X.Y.Z (dashboard panels), and any X-Grafana-* header (backend-routed Explore). The suppression therefore covers every Grafana surface, not only Drilldown.
  • The bare-integer step Grafana sends (e.g. step=120) is normalised to a VL duration (120s) before reaching /hits. Without the suffix VL's parser rejects the request and the entire /hits fast path silently falls back to legacy stats.

Acceptable irreducible bump (25h–47h cases)​

For ranges like 25h the proxy cannot safely suppress the 1h leftover chunk because users genuinely query 1h ranges in Drilldown / Explore. The merged frame can therefore show a modest right-edge bump (≤ 65% of nonzero buckets fall into the rightmost bin) from the leftover chunk's chunk-only top-N. This is the price of preserving short-range queries; e2e thresholds reflect the trade-off and document it explicitly.

Do Not Regress​

The following invariants are pinned by internal/proxy/drilldown_regression_lock_test.go and test/e2e-compat/drilldown_chunked_merge_lock_test.go. Each is named TestLock_* / TestE2ELock_* and breaking any of them fails CI:

  1. Routing is source-agnostic for stats-compat shapes — Explore, Drilldown, dashboard panels, and raw API clients all reach the /hits fast paths.
  2. /hits runs for every range (no hybrid threshold gate).
  3. Leftover chunks (end - start ≤ 2 × step) from ANY Grafana source are suppressed and the response carries X-Proxy-Drilldown-Path: hits-leftover-suppressed.
  4. Step normalisation appends s to bare-integer Grafana steps.
  5. The /hits remainder bucket (fields:{}) is dropped before emission.
  6. Every emitted series shares the same timestamp axis (the chart cannot render distributed bars otherwise).
  7. extractStreamSelectorOnly handles both Loki-bracketed ({namespace="prod"}) and VL-native (namespace:="prod") selectors.
  8. fieldHasExistenceFilter matches both unquoted (pod:!"") and quoted ("k8s.pod.name":!"") forms.
  9. Stats fallback (stats_query_range) only fires when /hits actually fails — never as a parallel call that overwrites a successful /hits result.
  10. maxStatsQueryRangeBytes and drilldownHitsFieldsLimit stay within their pinned windows (see the lock tests for the exact bounds).
  11. Grafana mergeFrames simulation on the live VL stack must produce a merged frame whose rightmost bin holds < 65 % of nonzero timestamps for 24h, 25h, 2d, and 7d ranges, for both Drilldown and dashboard sources.

Drilldown Capability Profiles​

Drilldown version familyCapability profileProxy handling focus
2.0.xdrilldown-v2detected-level defaults, modern service-detail scenes, patterns and field-value drill flows
1.0.xdrilldown-v1legacy service buckets, filtered detected-fields path, prior labels/field rendering behavior

These profiles are matrix-level compatibility profiles (contract and CI guidance). Runtime request handling must stay Loki-compatible and should not depend on guessed app build strings.

Known Issues​

Drilldown 2.0.4: Patterns Tab Initialization​

Drilldown 2.0.4 contains a bug in subscribeToLokiConfig() where void 0 === null (always false) prevents re-enabling the Patterns tab after it was disabled. Concretely:

  • If the Grafana default datasource has pattern_ingester_enabled=false in /loki/api/v1/drilldown-limits, $patternsData is set to null.
  • Switching to a datasource where pattern_ingester_enabled=true does NOT re-show the tab because the void 0 === null guard treats null as "already initialized".

Workaround: Configure the patterns-autodetect proxy variant as the Grafana default datasource. Since pattern_ingester_enabled=true is returned on first load, $patternsData is correctly initialized and the Patterns tab appears.

In the e2e-compat compose stack, loki-vl-proxy-patterns-autodetect is set as isDefault: true in grafana-datasources.yaml for this reason.

Release Watchlist​

Potential next family move:

  • current: 2.0.x (pinned: 2.0.4)
  • next expected family to evaluate: 2.1.x (then 3.0.x when released)

Promotion criteria for a new family:

  1. add versions to matrix manifest,
  2. verify TestDrilldownTrackScore and TestDrilldown_RuntimeFamilyContracts on pinned + smoke runtimes,
  3. confirm no regressions in patterns, labels/fields, and service detail flows.

Contracts We Enforce​

  • index/volume must expose real service_name buckets
  • index/volume_range must expose non-empty detected_level series names
  • detected_fields must show parsed fields like method, path, status, duration_ms
  • detected_fields must not leak indexed labels like app, cluster, or namespace
  • detected_fields must suppress high-cardinality terminal timestamp fields (timestamp_end, observed_timestamp_end) so Drilldown field discovery does not trigger expensive backend stats paths that can flap into intermittent no-data responses
  • In hybrid field mode, detected_fields may expose both native dotted fields and translated aliases such as service.name and service_name
  • labels and label/{name}/values should stay stream-shaped; they should prefer VictoriaLogs stream metadata endpoints and only fall back to generic field endpoints for older backend versions
  • detected_fields, detected_labels, and detected_field/{name}/values should prefer native VictoriaLogs metadata lookups where they map cleanly, then fall back to bounded raw-log sampling for parsed and derived fields
  • Alias resolution must keep exact native matches working, allow unique translated aliases to resolve automatically, and avoid silently choosing the wrong native field when multiple dotted names collapse to the same Loki-safe alias
  • Label-value resources for additional filters such as cluster must return real values
  • Unknown label and detected-field lookups should keep a success payload shape instead of flipping into transport errors
  • patterns must return non-empty grouped pattern payloads with sample buckets for Drilldown
  • Multi-tenant Drilldown queries with repeated var-levels=detected_level|=|... selections must stay valid and return logs instead of backend parse errors
  • (v1.17.1) When detected_level is synthesized in metric results, the raw level label is removed from those same results to prevent Drilldown from showing both labels simultaneously — the include button for detected_level values must work correctly without level duplication
  • (v1.17.1) Nested JSON objects in the log body (e.g., service={"name":"api-gateway"}) must be excluded from the detected_fields field breakdown; exposing them previously broke the field breakdown view when users clicked on such a field

Edge Cases Covered​

  • Mixed parser query path: | json ... | logfmt | drop __error__, __error_details__
  • Labels object parsing in returned log frames
  • App-level field suppression for detected_level, level, and level_extracted
  • High-cardinality terminal timestamp keys (timestamp_end, observed_timestamp_end) are excluded from Drilldown detected-field responses while regular parsed fields stay visible
  • 1.x service-selection buckets, detected-fields filtering, and labels field parsing stay explicit in the source-contract checks
  • 2.x detected-level default columns, field-values breakdown scenes, and additional label-tab wiring stay explicit in the source-contract checks
  • Grafana runtime 11.x explicitly asserts 1.x-style service buckets, filtered detected fields, and extra label values at runtime
  • Grafana runtime 12.x explicitly asserts 2.x-style detected-level series, field-value breakdowns, and extra label values at runtime
  • Grafana runtime 13.x uses the same 2.x-style contract as 12.x — added to RuntimeFamilyContracts in v1.15.0
  • Service-detail field breakdowns and additional label filters
  • Multi-tenant Drilldown log views filtered by cluster plus multiple selected detected_level values
  • Multi-tenant Grafana resource calls with __tenant_id__!~... and __tenant_id__="missing" keep the correct narrowed or empty-success behavior
  • Native field-value discovery for indexed metadata such as service.name, with parser-stage stripping before the backend lookup
  • Fallback scanning for parsed-only fields such as method when no safe native metadata path exists
  • Patterns grouping across repeated request shapes
  • (v1.17.1) detected_level/level metric deduplication: raw level label removed from metric results when detected_level is synthesized, fixing the Drilldown include button for detected_level filter selections
  • (v1.17.1) Nested JSON object field suppression: service={"name":"..."} and similar compound body fields are excluded from the field breakdown to prevent broken Drilldown field-click behavior