Honest TLDR β measured 2026-06-05 on the included e2e stack
Real numbers from bench/drilldown-vs-loki.sh against the live test/e2e-compat compose stack. This is the long-form companion to the README one-glance comparison; the README has the headline ratios, this page has the per-query breakdown, methodology, and tuning we tried.
Workload + datasetβ
Workload: 14 minutes of Grafana Logs Drilldown traffic against namespace=prod. 36 cold queries at 30 m / 1 h / 2 h / 3 h / 6 h / 24 h ranges, covering the six call shapes Drilldown actually emits (sum by pod, sum by trace_id, sum by k8s_pod_name, sum by service_version, /detected_fields, /labels). Each query sent with the real Grafana headers (X-Query-Tags: Source=grafana-lokiexplore-app, User-Agent: Grafana/11.5.0) so Loki's own partial-results carve-out is active.
Dataset: 8 M log entries across 15 services over 7 days. namespace=prod carries ~5 000 active pods, ~50 000 unique trace_id values per hour, plus the usual app / level / app_kind / k8s.* labels. Host: Apple M5 Pro, 64 GB system RAM, Docker Desktop allocated 17.3 GiB.
Important caveat about Loki's container budget. The first runs of this bench gave Loki the historical 8 GiB container limit and Loki OOM-looped continuously β 35 restarts in one session, 0 successful queries. That was a setup artifact: 8 GiB is too little for a 7-day dataset with 6.7 GiB of stored bigParts (Loki just idling consumed 7.6 GiB). After bumping to 12 GiB with GOMEMLIMIT=10GiB + GOGC=80 Loki actually serves queries. All numbers below are from the fair (12 GiB) run β committed at test/e2e-compat/results/drilldown-vs-loki-fair-*.tsv.
One-glance comparison (full table)β
| Loki direct | VictoriaLogs | Proxy | vmauth (optional gateway) | |
|---|---|---|---|---|
| Cold queries returning real data | 9 / 36 (25 %) | n/a (proxy-fronted) | 35 / 36 (97 %) | β |
| Cold queries silently returning empty | 13 / 36 β blank panels in Grafana | β | 0 | β |
| Cold queries returning HTTP 500 / 502 | 9 / 36 | β | 1 (known VL parser-pipe limit on trace_id 6 h) | β |
| Cold queries that timed out (60 s) | 5 / 36 | β | 0 | β |
Median cold latency on shared-success queries (pod 30 m β 2 h, labels 30 m β 24 h) | 9 ms β 6 302 ms | β | 8 ms β 436 ms (10 β 25Γ faster on metric paths; tied on labels) | β |
| Total CPUΒ·s consumed across the 17.5-min bench | 2 712.6 cpuΒ·s (45 cpuΒ·min) | 117.2 cpuΒ·s | 5.2 cpuΒ·s (~0.09 cpuΒ·min) | 15.1 cpuΒ·s |
| Avg CPU during the bench | 2.58 cores | 0.11 cores | 0.005 cores | 0.014 cores |
| Peak CPU during the bench | 14.44 cores | 6.99 cores | 0.14 cores | 0.90 cores |
Steady-state RSS (process_resident_memory_bytes) | 8 466 MiB idle, 10 445 MiB peak | 459 MiB idle, 2 129 MiB peak | 38 MiB idle, ~44 MiB peak | 55 MiB idle, ~60 MiB peak |
Go heap (go_memstats_heap_inuse_bytes) | β | β | 43 MiB | 62 MiB |
Cache contents observed via /metrics | β | β | L1: 10 objects / 1 976 bytes (most queries had unique timestamps β cache miss) | request-buffer pool, drops to idle between bursts |
Headline ratios. Loki burned ~520Γ more CPU than the proxy (2 712 vs 5.2 cpuΒ·s) and ~190Γ more peak RSS (10 445 vs 44 MiB) β to serve fewer than a quarter of the successful queries. Combined proxy + VL + vmauth: ~140 cpuΒ·s and ~2.2 GiB peak (steady-state β 550 MiB) for 97 % of the workload.
Methodology β both numbers come from each program's own
/metrics, not docker stats. The two RSS sources can disagree by orders of magnitude.docker stats MemUsageincludes Go'ssys_bytes(mmap regions reserved from the OS but not actively dirtied) β for a Go program that can show 2.4 GiB while the actual working set is 60 MiB. The RSS column above isprocess_resident_memory_bytesfrom each program's own/metrics, which reflects real anonymous RSS. CPU numbers are integrated from the docker-stats stream captured during the bench (test/e2e-compat/results/docker-stats-fair-*.tsv) βprocess_cpu_seconds_totalfrom each program's/metricscorroborates within ~10 %.
Outcomes β 36 cold queries each (6 query shapes Γ 6 ranges from 30 m to 24 h)β
| Outcome | Loki direct | Proxy |
|---|---|---|
| 200 OK with data | 9 | 35 |
| 200 OK but empty result (silent fail) | 13 β Grafana renders blank panel | 0 |
| HTTP 500 / 502 | 9 | 1 (known VL parser-pipe limit on trace_id 6 h) |
| Timeout / no response | 5 | 0 |
The proxy serves 35 / 36 of Drilldown's actual call shapes; Loki direct serves 9 / 36. The most user-hostile failure mode is the silent-empty bucket β Loki returns HTTP 200 + result:[] and Grafana shows a blank panel with no error, so the operator assumes there's no data when in fact Loki gave up.
Side-by-side cold latency where both succeed with real dataβ
| Query | Range | Loki | Proxy | Speedup |
|---|---|---|---|---|
sum by (pod) | 30 m | 6 302 ms (9 274 series) | 436 ms (5 000 series via /hits) | 14.5Γ |
sum by (pod) | 1 h | 5 412 ms (17 342 series) | 223 ms (5 000 series) | 24.3Γ |
sum by (pod) | 2 h | 3 502 ms (35 388 series) | 354 ms (5 000 series) | 9.9Γ |
/loki/api/v1/labels | 30 m β 24 h | 9 β 27 ms | 8 β 31 ms | parity |
Loki returns the full unbounded series set on pod (9 k β 35 k unique values); the proxy routes through VL's /select/logsql/hits and returns top-N + remainder. Both render the same chart shape in Grafana; the proxy uses an order of magnitude less wall-time and bandwidth.
Where only the proxy returns a usable answerβ
These are the queries Grafana Logs Drilldown emits when a user clicks a field on namespace=prod β and where Loki direct silently breaks the UX:
sum by (trace_id) (...)30 m β 24 h β Loki HTTP 500 every range; proxy 406 ms β 1 232 ms.sum by (service_version) (...)1 h β 24 h β Loki returnsHTTP 200+ 0 series (silent); proxy 76 β 240 ms with 100 versions.sum by (k8s_pod_name) (...)30 m β 24 h β Loki silent empty; proxy 116 β 257 ms with 2 956 β 4 994 series./loki/api/v1/detected_fields1 h β 24 h β Loki silent empty or 500; proxy 36 β 323 ms with the OTel field map populated.sum by (pod) (...)24 h β Loki HTTP 500 (too_many_series); proxy 549 ms (16-series chart via /hits top-N).
Resource consumption (14 min bench window)β
| Container | Peak CPU | Peak RSS | Notes |
|---|---|---|---|
e2e-loki | 1 444 % (β 14 cores) | 10 445 MiB | within 12 GiB budget, no OOM |
e2e-victorialogs | 699 % (β 7 cores) | 2 129 MiB | served everything the proxy asked for |
e2e-proxy | 14 % (β 0.1 cores) | 44 MiB | negligible |
e2e-proxy-vmauth | 90 % (β 0.9 cores) | 2 254 MiB | cache layer |
Combined VL + proxy stack: ~7 cores and ~2.2 GiB to serve 35/36 queries. Loki standalone: 14 cores and 10.4 GiB to serve 9/36. Roughly half the CPU and one-fifth the RAM to serve four times more of the workload.
Proxy heap behaviour β why it spikes, and what we did about itβ
During the bench the proxy's process RSS climbed from a 38 MiB idle baseline to a transient peak of ~1.4 GiB under unbounded load. A pprof snapshot (test/e2e-compat/results/pprof/heap-*.pb.gz) attributed 96 MB (61 %) of live heap to bytes.growSlice from compatCacheMiddleware β CompressionHandlerWithOptions β the response buffer being grown to hold the gzipped cache entry for each large query_range response (5 000 series Γ ~25 KB β 5 MB per response, ~20 concurrent in flight). Cumulative allocations since process start were 33.2 GB, with 23.4 GB (70 %) in fastjson.(*cache).getValue and 3.99 GB in the same compression growSlice path.
The post-fix proxy (commit 2340928) pools both buffers:
EncodeResponseBodyuses a pooledbytes.Bufferwith a 4 MiB cap-trim β kills the per-cache-write allocation cascade.compressedResponseWriter.bufswitched from value to pooled*bytes.Bufferwith an explicit acquire/release lifecycle β caps the per-request hold-buffer cost.buildHitsRangeMetricMatrixpre-sizes the per-seriesvaluesslice to the actual bucket count instead of starting at cap 16 β eliminates the 1.31 GB cumulative growSlice cascade for that one function.
Both pools are guarded by heap-bounded regression tests: TestEncodeResponseBody_PoolKeepsHeapBounded (1 000 sequential 288 KiB encodes, asserts heap delta < 32 MiB; current run measures -0.28 MiB β pool freed memory mid-test), TestEncodeResponseBody_PoolStableUnderConcurrency (32 Γ 100 concurrent, asserts < 128 MiB), TestCompressedResponseWriter_HeapStableUnderRepeatedRequests (1 000 handler calls, < 32 MiB), TestBuildHitsRangeMetricMatrix_HeapBoundedAcrossManyCalls (100 Γ 20-series builds, < 16 MiB). Plus an e2e lock: TestE2ELock_ProxyHeapBoundedUnderDrilldownLoad drives 30 concurrent workers Γ 60 s of Drilldown traffic against the live proxy and asserts heap_inuse < 500 MiB and process_resident_memory_bytes < 800 MiB. Any future PR that removes a pool or replaces it with per-request allocation fails CI with a named test pointing back at this work.
What we did not fix in this round: fastjson's 23 GB cumulative is intrinsic to its per-Parse cache reset pattern; the existing three fj.ParserPool instances (statsQRFJPool, statsTranslateFJPool, vlFJParserPool) already amortize Parser allocation, but the cache rebuilds per call. Reducing it further requires replacing the parser β a deeper refactor, separate PR. Also pending separate PR: NDJSON line-by-line translation (eliminates the residual buffer-then-translate pattern) and a size-threshold cache skip (Drilldown queries with unique timestamps barely cache-hit, so skipping cache for >2 MiB responses saves a pure-overhead buffering pass).
What this is and isn'tβ
- This is honest. Real numbers from real queries on the live stack. Setup, raw data, and reproduction command are in
test/e2e-compat/results/andbench/drilldown-vs-loki.sh. - It is NOT "Loki is always broken." Loki holds up on the
labelsendpoint (tied with the proxy) and onpodat short ranges. It legitimately works for many workloads β just not for the high-cardinality per-stream aggregations Drilldown emits. - Loki's structural issue. Loki's read path requires materializing one series per unique label combination per step bucket. For
sum by (FIELD) (count_over_time({namespace="prod", FIELD!=""}[2m]))on a 5 000-pod namespace, the working set blows pastmax_query_seriesand any chunk-store budget before the result is ready. No Loki config we tested (max_query_series up to 1 M, cardinality_limit up to 1 M, max_query_parallelism up to 256, GOMEMLIMIT up to 20 GiB) made these succeed at full cardinality β it's algorithmic, not a tuning gap. The proxy bypasses it by routing through VL's columnar/select/logsql/hitswhich computes top-N server-side. - Errors are converted, not leaked. When VL legitimately fails (parser-pipe row scan limit on shapes like
|json|trace_id!=""at long ranges), the proxy converts the upstream 4xx/5xx into Loki'sHTTP 200 + warnings:[...]partial-results envelope β the same shape Loki itself emits forX-Query-Tags: Source=grafana-lokiexplore-apptraffic. Grafana renders a warning badge, not an error toast. Non-Grafana clients (curl, internal tooling) still see the real upstream error. Locked byTestLock_VLErrorsConvertedToPartialResults(5 subtests).
Reproduce: ./bench/drilldown-vs-loki.sh --ranges=30m,1h,2h,3h,6h,24h --queries=pod,trace_id,k8s_pod_name,service_version,detected_fields,labels. All numbers above are cold-path (cache miss, first request).