Skip to main content

Honest TLDR β€” measured 2026-06-05 on the included e2e stack

Real numbers from bench/drilldown-vs-loki.sh against the live test/e2e-compat compose stack. This is the long-form companion to the README one-glance comparison; the README has the headline ratios, this page has the per-query breakdown, methodology, and tuning we tried.

Workload + dataset​

Workload: 14 minutes of Grafana Logs Drilldown traffic against namespace=prod. 36 cold queries at 30 m / 1 h / 2 h / 3 h / 6 h / 24 h ranges, covering the six call shapes Drilldown actually emits (sum by pod, sum by trace_id, sum by k8s_pod_name, sum by service_version, /detected_fields, /labels). Each query sent with the real Grafana headers (X-Query-Tags: Source=grafana-lokiexplore-app, User-Agent: Grafana/11.5.0) so Loki's own partial-results carve-out is active.

Dataset: 8 M log entries across 15 services over 7 days. namespace=prod carries ~5 000 active pods, ~50 000 unique trace_id values per hour, plus the usual app / level / app_kind / k8s.* labels. Host: Apple M5 Pro, 64 GB system RAM, Docker Desktop allocated 17.3 GiB.

Important caveat about Loki's container budget. The first runs of this bench gave Loki the historical 8 GiB container limit and Loki OOM-looped continuously β€” 35 restarts in one session, 0 successful queries. That was a setup artifact: 8 GiB is too little for a 7-day dataset with 6.7 GiB of stored bigParts (Loki just idling consumed 7.6 GiB). After bumping to 12 GiB with GOMEMLIMIT=10GiB + GOGC=80 Loki actually serves queries. All numbers below are from the fair (12 GiB) run β€” committed at test/e2e-compat/results/drilldown-vs-loki-fair-*.tsv.

One-glance comparison (full table)​

Loki directVictoriaLogsProxyvmauth (optional gateway)
Cold queries returning real data9 / 36 (25 %)n/a (proxy-fronted)35 / 36 (97 %)–
Cold queries silently returning empty13 / 36 ← blank panels in Grafana–0–
Cold queries returning HTTP 500 / 5029 / 36–1 (known VL parser-pipe limit on trace_id 6 h)–
Cold queries that timed out (60 s)5 / 36–0–
Median cold latency on shared-success queries (pod 30 m – 2 h, labels 30 m – 24 h)9 ms – 6 302 ms–8 ms – 436 ms (10 – 25Γ— faster on metric paths; tied on labels)–
Total CPUΒ·s consumed across the 17.5-min bench2 712.6 cpuΒ·s (45 cpuΒ·min)117.2 cpuΒ·s5.2 cpuΒ·s (~0.09 cpuΒ·min)15.1 cpuΒ·s
Avg CPU during the bench2.58 cores0.11 cores0.005 cores0.014 cores
Peak CPU during the bench14.44 cores6.99 cores0.14 cores0.90 cores
Steady-state RSS (process_resident_memory_bytes)8 466 MiB idle, 10 445 MiB peak459 MiB idle, 2 129 MiB peak38 MiB idle, ~44 MiB peak55 MiB idle, ~60 MiB peak
Go heap (go_memstats_heap_inuse_bytes)––43 MiB62 MiB
Cache contents observed via /metrics––L1: 10 objects / 1 976 bytes (most queries had unique timestamps β†’ cache miss)request-buffer pool, drops to idle between bursts

Headline ratios. Loki burned ~520Γ— more CPU than the proxy (2 712 vs 5.2 cpuΒ·s) and ~190Γ— more peak RSS (10 445 vs 44 MiB) β€” to serve fewer than a quarter of the successful queries. Combined proxy + VL + vmauth: ~140 cpuΒ·s and ~2.2 GiB peak (steady-state β‰ˆ 550 MiB) for 97 % of the workload.

Methodology β€” both numbers come from each program's own /metrics, not docker stats. The two RSS sources can disagree by orders of magnitude. docker stats MemUsage includes Go's sys_bytes (mmap regions reserved from the OS but not actively dirtied) β€” for a Go program that can show 2.4 GiB while the actual working set is 60 MiB. The RSS column above is process_resident_memory_bytes from each program's own /metrics, which reflects real anonymous RSS. CPU numbers are integrated from the docker-stats stream captured during the bench (test/e2e-compat/results/docker-stats-fair-*.tsv) β€” process_cpu_seconds_total from each program's /metrics corroborates within ~10 %.

Outcomes β€” 36 cold queries each (6 query shapes Γ— 6 ranges from 30 m to 24 h)​

OutcomeLoki directProxy
200 OK with data935
200 OK but empty result (silent fail)13 ← Grafana renders blank panel0
HTTP 500 / 50291 (known VL parser-pipe limit on trace_id 6 h)
Timeout / no response50

The proxy serves 35 / 36 of Drilldown's actual call shapes; Loki direct serves 9 / 36. The most user-hostile failure mode is the silent-empty bucket β€” Loki returns HTTP 200 + result:[] and Grafana shows a blank panel with no error, so the operator assumes there's no data when in fact Loki gave up.

Side-by-side cold latency where both succeed with real data​

QueryRangeLokiProxySpeedup
sum by (pod)30 m6 302 ms (9 274 series)436 ms (5 000 series via /hits)14.5Γ—
sum by (pod)1 h5 412 ms (17 342 series)223 ms (5 000 series)24.3Γ—
sum by (pod)2 h3 502 ms (35 388 series)354 ms (5 000 series)9.9Γ—
/loki/api/v1/labels30 m – 24 h9 – 27 ms8 – 31 msparity

Loki returns the full unbounded series set on pod (9 k – 35 k unique values); the proxy routes through VL's /select/logsql/hits and returns top-N + remainder. Both render the same chart shape in Grafana; the proxy uses an order of magnitude less wall-time and bandwidth.

Where only the proxy returns a usable answer​

These are the queries Grafana Logs Drilldown emits when a user clicks a field on namespace=prod β€” and where Loki direct silently breaks the UX:

  • sum by (trace_id) (...) 30 m – 24 h β†’ Loki HTTP 500 every range; proxy 406 ms – 1 232 ms.
  • sum by (service_version) (...) 1 h – 24 h β†’ Loki returns HTTP 200 + 0 series (silent); proxy 76 – 240 ms with 100 versions.
  • sum by (k8s_pod_name) (...) 30 m – 24 h β†’ Loki silent empty; proxy 116 – 257 ms with 2 956 – 4 994 series.
  • /loki/api/v1/detected_fields 1 h – 24 h β†’ Loki silent empty or 500; proxy 36 – 323 ms with the OTel field map populated.
  • sum by (pod) (...) 24 h β†’ Loki HTTP 500 (too_many_series); proxy 549 ms (16-series chart via /hits top-N).

Resource consumption (14 min bench window)​

ContainerPeak CPUPeak RSSNotes
e2e-loki1 444 % (β‰ˆ 14 cores)10 445 MiBwithin 12 GiB budget, no OOM
e2e-victorialogs699 % (β‰ˆ 7 cores)2 129 MiBserved everything the proxy asked for
e2e-proxy14 % (β‰ˆ 0.1 cores)44 MiBnegligible
e2e-proxy-vmauth90 % (β‰ˆ 0.9 cores)2 254 MiBcache layer

Combined VL + proxy stack: ~7 cores and ~2.2 GiB to serve 35/36 queries. Loki standalone: 14 cores and 10.4 GiB to serve 9/36. Roughly half the CPU and one-fifth the RAM to serve four times more of the workload.

Proxy heap behaviour β€” why it spikes, and what we did about it​

During the bench the proxy's process RSS climbed from a 38 MiB idle baseline to a transient peak of ~1.4 GiB under unbounded load. A pprof snapshot (test/e2e-compat/results/pprof/heap-*.pb.gz) attributed 96 MB (61 %) of live heap to bytes.growSlice from compatCacheMiddleware β†’ CompressionHandlerWithOptions β€” the response buffer being grown to hold the gzipped cache entry for each large query_range response (5 000 series Γ— ~25 KB β‰ˆ 5 MB per response, ~20 concurrent in flight). Cumulative allocations since process start were 33.2 GB, with 23.4 GB (70 %) in fastjson.(*cache).getValue and 3.99 GB in the same compression growSlice path.

The post-fix proxy (commit 2340928) pools both buffers:

  • EncodeResponseBody uses a pooled bytes.Buffer with a 4 MiB cap-trim β€” kills the per-cache-write allocation cascade.
  • compressedResponseWriter.buf switched from value to pooled *bytes.Buffer with an explicit acquire/release lifecycle β€” caps the per-request hold-buffer cost.
  • buildHitsRangeMetricMatrix pre-sizes the per-series values slice to the actual bucket count instead of starting at cap 16 β€” eliminates the 1.31 GB cumulative growSlice cascade for that one function.

Both pools are guarded by heap-bounded regression tests: TestEncodeResponseBody_PoolKeepsHeapBounded (1 000 sequential 288 KiB encodes, asserts heap delta < 32 MiB; current run measures -0.28 MiB β€” pool freed memory mid-test), TestEncodeResponseBody_PoolStableUnderConcurrency (32 Γ— 100 concurrent, asserts < 128 MiB), TestCompressedResponseWriter_HeapStableUnderRepeatedRequests (1 000 handler calls, < 32 MiB), TestBuildHitsRangeMetricMatrix_HeapBoundedAcrossManyCalls (100 Γ— 20-series builds, < 16 MiB). Plus an e2e lock: TestE2ELock_ProxyHeapBoundedUnderDrilldownLoad drives 30 concurrent workers Γ— 60 s of Drilldown traffic against the live proxy and asserts heap_inuse < 500 MiB and process_resident_memory_bytes < 800 MiB. Any future PR that removes a pool or replaces it with per-request allocation fails CI with a named test pointing back at this work.

What we did not fix in this round: fastjson's 23 GB cumulative is intrinsic to its per-Parse cache reset pattern; the existing three fj.ParserPool instances (statsQRFJPool, statsTranslateFJPool, vlFJParserPool) already amortize Parser allocation, but the cache rebuilds per call. Reducing it further requires replacing the parser β€” a deeper refactor, separate PR. Also pending separate PR: NDJSON line-by-line translation (eliminates the residual buffer-then-translate pattern) and a size-threshold cache skip (Drilldown queries with unique timestamps barely cache-hit, so skipping cache for >2 MiB responses saves a pure-overhead buffering pass).

What this is and isn't​

  • This is honest. Real numbers from real queries on the live stack. Setup, raw data, and reproduction command are in test/e2e-compat/results/ and bench/drilldown-vs-loki.sh.
  • It is NOT "Loki is always broken." Loki holds up on the labels endpoint (tied with the proxy) and on pod at short ranges. It legitimately works for many workloads β€” just not for the high-cardinality per-stream aggregations Drilldown emits.
  • Loki's structural issue. Loki's read path requires materializing one series per unique label combination per step bucket. For sum by (FIELD) (count_over_time({namespace="prod", FIELD!=""}[2m])) on a 5 000-pod namespace, the working set blows past max_query_series and any chunk-store budget before the result is ready. No Loki config we tested (max_query_series up to 1 M, cardinality_limit up to 1 M, max_query_parallelism up to 256, GOMEMLIMIT up to 20 GiB) made these succeed at full cardinality β€” it's algorithmic, not a tuning gap. The proxy bypasses it by routing through VL's columnar /select/logsql/hits which computes top-N server-side.
  • Errors are converted, not leaked. When VL legitimately fails (parser-pipe row scan limit on shapes like |json|trace_id!="" at long ranges), the proxy converts the upstream 4xx/5xx into Loki's HTTP 200 + warnings:[...] partial-results envelope β€” the same shape Loki itself emits for X-Query-Tags: Source=grafana-lokiexplore-app traffic. Grafana renders a warning badge, not an error toast. Non-Grafana clients (curl, internal tooling) still see the real upstream error. Locked by TestLock_VLErrorsConvertedToPartialResults (5 subtests).

Reproduce: ./bench/drilldown-vs-loki.sh --ranges=30m,1h,2h,3h,6h,24h --queries=pod,trace_id,k8s_pod_name,service_version,detected_fields,labels. All numbers above are cold-path (cache miss, first request).