Logs Drilldown Compatibility
This track measures compatibility with the Grafana Logs Drilldown app, not generic Loki clients.
Scope​
- Grafana datasource resource endpoints consumed by the app
- Service selection, service-detail log volume, fields, labels, and field values
- Log frame expectations that affect labels, level coloring, and field visibility
CI And Score​
- Workflow:
compat-drilldown.yaml - Score test:
TestDrilldownTrackScore - Runtime coverage: pinned Grafana runtime plus current-family and previous-family Grafana smoke on PRs, with the fuller Grafana matrix kept for scheduled/manual runs
- Version matrix: source-contract checks across the current Drilldown family and one family behind
The Drilldown matrix is also a moving window. We support the current app family and one family behind, with the contract list sliding forward as upstream releases move. We do not keep an open-ended tail of older app families.
Version Matrix​
Grafana runtime profiles​
| Grafana version | Coverage path | Version-specific focus |
|---|---|---|
13.0.1 | PR/main pinned runtime + scheduled/manual runtime matrix | Full Drilldown runtime score; current pinned build; React 19 |
12.4.2 | PR/main previous-family smoke + scheduled/manual runtime matrix | datasource catalog, base Drilldown resource contracts, explicit 2.x runtime-family assertions |
12.4.1 | Scheduled and manual runtime matrix | datasource catalog, base Drilldown resource contracts |
11.6.6 | Scheduled and manual runtime matrix | datasource catalog, base Drilldown resource contracts, explicit 1.x runtime-family assertions |
Logs Drilldown app versions​
| Logs Drilldown version | Coverage path | Version-specific focus |
|---|---|---|
2.0.4 | PR/main pinned runtime + scheduled/manual contract matrix | Current pinned contract; patterns tab requires patterns-autodetect as Grafana default |
2.0.3 | Scheduled and manual contract matrix | detected_level coloring, service-detail panels, patterns |
2.0.2 | Scheduled and manual contract matrix | detected_level coloring, service-detail panels |
2.0.1 | Scheduled and manual contract matrix | detected_level coloring, service-detail panels |
2.0.0 | Scheduled and manual contract matrix | detected_level coloring, service-detail panels |
1.0.41 | Scheduled and manual contract matrix | Service buckets, detected-fields filtering, labels field parsing |
1.0.40 | Scheduled and manual contract matrix | Service buckets, detected-fields filtering, labels field parsing |
1.0.39 | Scheduled and manual contract matrix | Service buckets, detected-fields filtering, labels field parsing |
1.0.38 | Scheduled and manual contract matrix | Service buckets, detected-fields filtering, labels field parsing |
1.0.37 | Scheduled and manual contract matrix | Service buckets, detected-fields filtering, labels field parsing |
1.0.36 | Scheduled and manual contract matrix | Service buckets, detected-fields filtering, labels field parsing |
1.0.35 | Scheduled and manual contract matrix | Service buckets, detected-fields filtering, labels field parsing |
1.0.34 | Scheduled and manual contract matrix | Service buckets, detected-fields filtering, labels field parsing |
Runtime Detection And Version Coupling​
Proxy-side Drilldown detection is based on deterministic request signals:
X-Query-Tags: Source=grafana-lokiexplore-appidentifies Drilldown-origin resource callsUser-Agent: Grafana/<version>provides Grafana runtime version family
Important limit:
- exact Drilldown app semver is not emitted on the datasource HTTP request path by default
Because of that, version-specific behavior should be gated by:
- explicit request source tag,
- Grafana runtime family (
12.x,13.x), - compatibility matrix contract version bands (
1.0.x,2.0.x), validated in CI.
Field Histogram Series Cap​
Grafana Logs Drilldown renders per-field histograms by sending sum by (field) (count_over_time({...}|json|drop __error__,__error_details__|field!="" [...])) queries to query_range. For high-cardinality fields (trace_id, span_id, session_id) over long time ranges, an unfiltered VL response can reach 50 MB+ (300k+ unique values). The proxy currently bounds this through two complementary mechanisms:
- Primary path —
/select/logsql/hitstop-N (introduced 2026-06): every stats-compat shape routes through VL's/hitsendpoint, which natively computes top-drilldownHitsFieldsLimit(20) field values per request and returns a remainder bucket. The proxy emits one Loki series per top-N value (the remainder is dropped). Cap is enforced server-side by VL — proxy never reads the full unbounded result. - Legacy stats fallback (only when
/hitsfails — older VL versions, parse errors, or unique-value numeric fields where every value lands in the remainder bucket): proxy appendsas _c | sort by (_c desc) | limit 500to the VL stats query, pushing the 500-series cap into VL per time bucket. Two-phase fallback (single-bucket Phase 1 + filtered Phase 2) handles the high-cardinality cases that overflow the direct path.
As of 2026-06 the routing is source-agnostic — Explore, Drilldown, dashboard
panels, and direct API clients all reach the /hits fast path. The historical
X-Query-Tags: Source=grafana-lokiexplore-app gate was removed because Explore
was hitting the unbounded direct stats path and OOMing at 24h+ for
high-cardinality fields. The Long-Range Histograms section below documents the
related leftover-chunk suppression that depends on Grafana-client detection.
Long-Range Histograms And Grafana querySplitting​
For ranges ≥ 24h, Grafana's Loki datasource splits every metric range query into
24h chunks before sending them to the proxy. The split logic lives in
public/app/plugins/datasource/loki/metricTimeSplitting.ts -> splitTimeRange() and
fires for any client built on the Loki datasource — Drilldown, Explore, and
dashboard panels. The proxy must produce a chunk response shape that Grafana's
in-browser mergeFrames + closestIdx + splice algorithm
(public/app/plugins/datasource/loki/mergeResponses.ts) can glue back into a
coherent timeline. This section pins the contract.
What Grafana sends​
splitTimeRange(start, end, step, oneDayMs) produces:
floor(range / aligned_day)chunks ofaligned_day - steplength (aligned to step boundary).- One residual chunk of
(range mod aligned_day)length, which is often smaller thanstep(e.g. for an exactly-24h request the residual is 0–120 s wide when step is 120 s; for 25 h it is ~1 h).
Chunks are dispatched newest-first because runSplitGroupedQueries recurses with partition[totalRequests - 1] first.
What the proxy must return per chunk​
Each chunk sub-request must produce a Loki matrix with:
- A single shared timestamp axis whose first and last buckets fall on step
boundaries inside the chunk's
start..endwindow. - ≤
drilldownHitsFieldsLimit(currently 20) named series, with all top-N values spread across the chunk's full axis. One-bucket series, where every top-N value lands on the same timestamp, will be stacked bymergeFramesinto a tall right-edge spike — the 2026-06 incident root cause.
How the proxy enforces this​
proxyStatsQueryRangeDrilldownroutes every count_over_time stats-compat query through the/hitspath regardless of range (the historical 6h "hybrid threshold" is gone). Mixing/hitswithstats_query_range | limit 500across chunks produced disjoint series sets that mergeFrames unioned into a 500-series block at the chunk boundary.proxyStatsQueryRangeDrilldownHitsreturns an empty Loki matrix (with response headerX-Proxy-Drilldown-Path: hits-leftover-suppressed) whenisGrafanaSourcedRequest(r)is true ANDend - start ≤ 2 × step. The residual leftover fromsplitTimeRangefalls into this bucket and the chart loses ≤ 1 step of width on the right edge instead of showing a spike.isGrafanaSourcedRequestdetects ANY Grafana client:X-Query-Tags: Source=grafana-…(Drilldown / Explore),User-Agent: Grafana/X.Y.Z(dashboard panels), and anyX-Grafana-*header (backend-routed Explore). The suppression therefore covers every Grafana surface, not only Drilldown.- The bare-integer step Grafana sends (e.g.
step=120) is normalised to a VL duration (120s) before reaching/hits. Without the suffix VL's parser rejects the request and the entire/hitsfast path silently falls back to legacy stats.
Acceptable irreducible bump (25h–47h cases)​
For ranges like 25h the proxy cannot safely suppress the 1h leftover chunk because users genuinely query 1h ranges in Drilldown / Explore. The merged frame can therefore show a modest right-edge bump (≤ 65% of nonzero buckets fall into the rightmost bin) from the leftover chunk's chunk-only top-N. This is the price of preserving short-range queries; e2e thresholds reflect the trade-off and document it explicitly.
Do Not Regress​
The following invariants are pinned by internal/proxy/drilldown_regression_lock_test.go
and test/e2e-compat/drilldown_chunked_merge_lock_test.go. Each is named
TestLock_* / TestE2ELock_* and breaking any of them fails CI:
- Routing is source-agnostic for stats-compat shapes — Explore, Drilldown,
dashboard panels, and raw API clients all reach the
/hitsfast paths. /hitsruns for every range (no hybrid threshold gate).- Leftover chunks (
end - start ≤ 2 × step) from ANY Grafana source are suppressed and the response carriesX-Proxy-Drilldown-Path: hits-leftover-suppressed. - Step normalisation appends
sto bare-integer Grafana steps. - The
/hitsremainder bucket (fields:{}) is dropped before emission. - Every emitted series shares the same timestamp axis (the chart cannot render distributed bars otherwise).
extractStreamSelectorOnlyhandles both Loki-bracketed ({namespace="prod"}) and VL-native (namespace:="prod") selectors.fieldHasExistenceFiltermatches both unquoted (pod:!"") and quoted ("k8s.pod.name":!"") forms.- Stats fallback (
stats_query_range) only fires when/hitsactually fails — never as a parallel call that overwrites a successful/hitsresult. maxStatsQueryRangeBytesanddrilldownHitsFieldsLimitstay within their pinned windows (see the lock tests for the exact bounds).- Grafana mergeFrames simulation on the live VL stack must produce a merged frame whose rightmost bin holds < 65 % of nonzero timestamps for 24h, 25h, 2d, and 7d ranges, for both Drilldown and dashboard sources.
Drilldown Capability Profiles​
| Drilldown version family | Capability profile | Proxy handling focus |
|---|---|---|
2.0.x | drilldown-v2 | detected-level defaults, modern service-detail scenes, patterns and field-value drill flows |
1.0.x | drilldown-v1 | legacy service buckets, filtered detected-fields path, prior labels/field rendering behavior |
These profiles are matrix-level compatibility profiles (contract and CI guidance). Runtime request handling must stay Loki-compatible and should not depend on guessed app build strings.
Known Issues​
Drilldown 2.0.4: Patterns Tab Initialization​
Drilldown 2.0.4 contains a bug in subscribeToLokiConfig() where void 0 === null (always false) prevents re-enabling the Patterns tab after it was disabled. Concretely:
- If the Grafana default datasource has
pattern_ingester_enabled=falsein/loki/api/v1/drilldown-limits,$patternsDatais set tonull. - Switching to a datasource where
pattern_ingester_enabled=truedoes NOT re-show the tab because thevoid 0 === nullguard treatsnullas "already initialized".
Workaround: Configure the patterns-autodetect proxy variant as the Grafana default datasource. Since pattern_ingester_enabled=true is returned on first load, $patternsData is correctly initialized and the Patterns tab appears.
In the e2e-compat compose stack, loki-vl-proxy-patterns-autodetect is set as isDefault: true in grafana-datasources.yaml for this reason.
Release Watchlist​
Potential next family move:
- current:
2.0.x(pinned:2.0.4) - next expected family to evaluate:
2.1.x(then3.0.xwhen released)
Promotion criteria for a new family:
- add versions to matrix manifest,
- verify
TestDrilldownTrackScoreandTestDrilldown_RuntimeFamilyContractson pinned + smoke runtimes, - confirm no regressions in patterns, labels/fields, and service detail flows.
Contracts We Enforce​
index/volumemust expose realservice_namebucketsindex/volume_rangemust expose non-emptydetected_levelseries namesdetected_fieldsmust show parsed fields likemethod,path,status,duration_msdetected_fieldsmust not leak indexed labels likeapp,cluster, ornamespacedetected_fieldsmust suppress high-cardinality terminal timestamp fields (timestamp_end,observed_timestamp_end) so Drilldown field discovery does not trigger expensive backend stats paths that can flap into intermittent no-data responses- In hybrid field mode,
detected_fieldsmay expose both native dotted fields and translated aliases such asservice.nameandservice_name labelsandlabel/{name}/valuesshould stay stream-shaped; they should prefer VictoriaLogs stream metadata endpoints and only fall back to generic field endpoints for older backend versionsdetected_fields,detected_labels, anddetected_field/{name}/valuesshould prefer native VictoriaLogs metadata lookups where they map cleanly, then fall back to bounded raw-log sampling for parsed and derived fields- Alias resolution must keep exact native matches working, allow unique translated aliases to resolve automatically, and avoid silently choosing the wrong native field when multiple dotted names collapse to the same Loki-safe alias
- Label-value resources for additional filters such as
clustermust return real values - Unknown label and detected-field lookups should keep a success payload shape instead of flipping into transport errors
patternsmust return non-empty grouped pattern payloads with sample buckets for Drilldown- Multi-tenant Drilldown queries with repeated
var-levels=detected_level|=|...selections must stay valid and return logs instead of backend parse errors - (v1.17.1) When
detected_levelis synthesized in metric results, the rawlevellabel is removed from those same results to prevent Drilldown from showing both labels simultaneously — the include button fordetected_levelvalues must work correctly withoutlevelduplication - (v1.17.1) Nested JSON objects in the log body (e.g.,
service={"name":"api-gateway"}) must be excluded from thedetected_fieldsfield breakdown; exposing them previously broke the field breakdown view when users clicked on such a field
Edge Cases Covered​
- Mixed parser query path:
| json ... | logfmt | drop __error__, __error_details__ - Labels object parsing in returned log frames
- App-level field suppression for
detected_level,level, andlevel_extracted - High-cardinality terminal timestamp keys (
timestamp_end,observed_timestamp_end) are excluded from Drilldown detected-field responses while regular parsed fields stay visible 1.xservice-selection buckets, detected-fields filtering, and labels field parsing stay explicit in the source-contract checks2.xdetected-level default columns, field-values breakdown scenes, and additional label-tab wiring stay explicit in the source-contract checks- Grafana runtime
11.xexplicitly asserts1.x-style service buckets, filtered detected fields, and extra label values at runtime - Grafana runtime
12.xexplicitly asserts2.x-style detected-level series, field-value breakdowns, and extra label values at runtime - Grafana runtime
13.xuses the same2.x-style contract as12.x— added toRuntimeFamilyContractsin v1.15.0 - Service-detail field breakdowns and additional label filters
- Multi-tenant Drilldown log views filtered by
clusterplus multiple selecteddetected_levelvalues - Multi-tenant Grafana resource calls with
__tenant_id__!~...and__tenant_id__="missing"keep the correct narrowed or empty-success behavior - Native field-value discovery for indexed metadata such as
service.name, with parser-stage stripping before the backend lookup - Fallback scanning for parsed-only fields such as
methodwhen no safe native metadata path exists - Patterns grouping across repeated request shapes
- (v1.17.1)
detected_level/levelmetric deduplication: rawlevellabel removed from metric results whendetected_levelis synthesized, fixing the Drilldown include button fordetected_levelfilter selections - (v1.17.1) Nested JSON object field suppression:
service={"name":"..."}and similar compound body fields are excluded from the field breakdown to prevent broken Drilldown field-click behavior