Benchmarks

Hardware: Apple M5 Pro, 18 cores, 64 GB RAM, macOS 26.4.1, Go 1.26.4 darwin/arm64, Docker Desktop 29.4.0 (17.3 GiB allocated to Docker).

Stack: Loki 3.6.x, VictoriaLogs v1.50.0, loki-vl-proxy latest. ~8 M log entries across 15 services, 7-day window.

VictoriaLogs flags: -defaultParallelReaders=8 -fs.maxConcurrency=64 -memory.allowedPercent=80 -search.maxConcurrentRequests=100 -search.maxQueueDuration=60s. See VictoriaLogs tuning for rationale.

Loki flags: querier.max_concurrent=16, max_query_parallelism=64, result + chunk caching enabled.

Label Metadata Performance

Label endpoints (/loki/api/v1/labels, /loki/api/v1/label/{name}/values) are the first calls Grafana makes when opening Explore or a dashboard. Their latency determines perceived "snappiness" and the completeness of the label picker.

How the proxy makes label fetches fast and accurate

Progressive two-stage fetch — when a request arrives for a wide time range (e.g. 7d) the proxy returns an initial response from a 1h VL scan immediately, then triggers a background goroutine that fetches the full user-requested range. The second request for the same window gets the complete historical label set from cache (sub-ms). This means:

First request: fast (~200 µs proxy overhead + one VL round-trip against 1h of data)
Second request: complete and fast (cache hit, sub-ms)

Labels do change over time (services added, removed, renamed across deployments), so the full-range background fetch ensures historical labels are not silently omitted.

Time-bucketed cache keys — Grafana's time picker slides by seconds between dashboard refreshes. The proxy quantises start/end timestamps to fixed bucket boundaries before building the cache key:

User-selected interval	Bucket size	Effect
≤ 6 h	5 minutes	Refreshes every 30 s collapse to the same key
6 h – 48 h	1 hour	Intra-hour drift collapses to one entry
> 48 h	6 hours	7-day queries share one cache entry all day

Startup warmup — on boot the proxy serves immediately from the disk-backed cache (fast start), then waits for VictoriaLogs to become healthy and refreshes any entries that are expired or close to expiry. The first dashboard load after a deployment is a cache hit.

Disk persistence — label cache entries are written to disk (SetLocalAndDiskWithTTL). On proxy restart the disk entries load immediately so there is no cold-start penalty even before VL warmup completes.

Periodic keep-warm loop — a background goroutine runs every 90 seconds and refreshes label cache entries for all four standard Grafana presets (Last 1h / 6h / 24h / 7d) before their 2-minute TTL expires. This keeps the cache hot even with no user queries.

Background stale refresh on hits — when a cached entry is served but has less than ~30% of its TTL remaining, the proxy automatically triggers a background full-range refresh so the next request sees fresher data.

Measured latency (proxy overhead against a local VL mock)

Time range	First request	Second request	VL scan on first
1 h	~200 µs	~5 µs (cache hit)	1 h
6 h	~200 µs	~5 µs (cache hit)	1 h (sync) → 6 h (background)
12 h	~200 µs	~5 µs (cache hit)	1 h (sync) → 12 h (background)
24 h	~200 µs	~5 µs (cache hit)	1 h (sync) → 24 h (background)
2 d	~200 µs	~5 µs (cache hit)	1 h (sync) → 2 d (background)
7 d	~200 µs	~5 µs (cache hit)	1 h (sync) → 7 d (background)

Numbers above are proxy-only overhead measured with a zero-latency in-process VL mock (go test -bench BenchmarkLabels_). In production, add your actual VL round-trip (~50–300 ms on first request; sub-ms on cache hit).

Against a real VictoriaLogs instance (manual measurement, 15 services, 8 M entries):

Time range	First request	Second request
1 h (pre-warmed at startup)	sub-ms (cache hit)	sub-ms
7 d (sync: 1h VL scan + background: 7d scan)	~300 ms	sub-ms with full historical data

Running the label perf tests

# Correctness tests (run in normal CI)
go test ./internal/proxy/ -run 'TestPerf_Labels_' -v

# Cold and warm benchmarks
go test ./internal/proxy/ -bench 'BenchmarkLabels_' -benchmem -count=3

Running Benchmarks

# Warm-cache run (standard — proxy cache pre-warmed at benchmark concurrency)
loki-bench \
  --proxy=http://localhost:3100 \
  --loki=http://localhost:3200 \
  --vl-direct=http://localhost:9428 \
  --workloads=small,heavy,long_range,compute \
  --clients=10,50,100 \
  --duration=30s \
  --warmup=5s \
  --jitter=2h

# Unique-windows run (cache + coalescer both defeated — raw proxy overhead)
loki-bench \
  --proxy=http://localhost:3100 \
  --loki=http://localhost:3200 \
  --vl-direct=http://localhost:9428 \
  --workloads=small,heavy,long_range,compute \
  --clients=10,50,100 \
  --duration=30s \
  --unique-windows

Workload definitions

Workload	What it covers
small	Label queries, series, `query_range` 1–5 min, instant queries — Grafana label browser / small panel refreshes
heavy	Complex LogQL with pipelines (`json`, `line_format`, `label_format`), filters, `label_filter` — content search and alerting queries
long_range	6 h–72 h `query_range`, `rate`/`count`/`bytes_rate` over long windows, metadata over long windows — Drilldown/Explore historical analysis
compute	Metric aggregations (`sum by`, `rate`, `count_over_time`, `quantile_over_time`, `topk`) — dashboard panels showing metrics derived from logs

Warm cache — production steady state

Proxy cache is pre-warmed at the same concurrency as the measurement. Repeated queries are served from L1 memory without touching VictoriaLogs. This is the operating mode for Grafana dashboards auto-refreshing every 30 s.

Throughput (req/s)

Workload	Concurrency	Loki	Proxy warm	VL native	Proxy / Loki
small	10	2,011	15,626	2,342	7.8×
small	50	2,297	24,958	4,484	10.9×
small	100	2,290	27,513	4,380	12.0×
heavy	10	407	5,944	1,064	14.6×
heavy	50	302	6,403	1,203	21.2×
heavy	100	162	7,134	1,245	44.1×
long_range	10	8	157	88	18.7×
long_range	50	11	201	87	18.0×
long_range	100	16	220	95	14.0×
compute	10	2,803	11,162	1,554	4.0×
compute	50	2,233	13,484	1,588	6.0×
compute	100	1,611	16,456	1,465	10.2×

P50 latency

Workload	Concurrency	Loki	Proxy warm	VL native
small	10	4 ms	587 µs	2 ms
small	50	20 ms	1 ms	5 ms
small	100	42 ms	3 ms	8 ms
heavy	10	4 ms	1 ms	5 ms
heavy	50	22 ms	6 ms	27 ms
heavy	100	3 ms†	12 ms	66 ms
long_range	10	481 ms	1 ms	102 ms
long_range	50	4,211 ms	1 ms	403 ms
long_range	100	4,902 ms	1 ms	771 ms
compute	10	1 ms	675 µs	6 ms
compute	50	6 ms	2 ms	24 ms
compute	100	4 ms	4 ms	57 ms

† Loki heavy c=100: P50=3 ms is misleading — Loki was saturated (P90=1,818 ms, P99=6,950 ms).

P90 latency

Workload	Concurrency	Loki	Proxy warm	VL native
small	10	10 ms	917 µs	8 ms
small	50	37 ms	2 ms	27 ms
small	100	71 ms	4 ms	61 ms
heavy	10	81 ms	3 ms	15 ms
heavy	50	471 ms	11 ms	59 ms
heavy	100	1,818 ms	18 ms	106 ms
long_range	10	3,872 ms	299 ms	202 ms
long_range	50	12,868 ms	1,491 ms	1,317 ms
long_range	100	70,700 ms	2,788 ms	2,386 ms
compute	10	11 ms	1 ms	9 ms
compute	50	71 ms	4 ms	61 ms
compute	100	243 ms	6 ms	107 ms

P99 latency

Workload	Concurrency	Loki	Proxy warm	VL native
small	10	18 ms	1 ms	29 ms
small	50	53 ms	4 ms	68 ms
small	100	92 ms	7 ms	175 ms
heavy	10	128 ms	6 ms	58 ms
heavy	50	981 ms	25 ms	261 ms
heavy	100	6,950 ms	45 ms	306 ms
long_range	10	4,923 ms	751 ms	252 ms
long_range	50	19,586 ms	3,189 ms	1,861 ms
long_range	100	89,306 ms	5,917 ms	3,325 ms
compute	10	21 ms	3 ms	14 ms
compute	50	145 ms	6 ms	127 ms
compute	100	521 ms	11 ms	207 ms

CPU consumed (cpu·s over 30 s window)

Workload	Concurrency	Loki	Proxy only	Proxy + VL	Ratio vs Loki
small	10	330.6	0.08	0.81	408× less
small	50	415.4	0.15	2.75	151× less
small	100	415.3	0.16	5.31	78× less
heavy	10	320.9	0.05	0.91	355× less
heavy	50	306.3	0.07	1.68	182× less
heavy	100	96.8	0.06	1.19	81× less
long_range	10	43.9	0.18	32.6	1.35× less
long_range	50	62.1	0.29	50.2	1.23× less
long_range	100	438.3	0.30	50.6	8.66× less
compute	10	315.5	0.08	10.4	30× less
compute	50	399.9	0.15	58.0	6.9× less
compute	100	379.7	0.18	63.7	6.0× less

The proxy process itself consumes negligible CPU — the gains come from VL being more efficient than Loki for the same queries, amplified by the cache eliminating most backend calls entirely.

RSS memory (MB, peak during 30 s window)

Workload	Concurrency	Loki	Proxy	Proxy + VL	Ratio vs Loki
small	10	1,910	484	726	2.6× less
small	50	2,159	454	800	2.7× less
small	100	2,215	429	782	2.8× less
heavy	10	2,082	418	640	3.3× less
heavy	50	2,269	364	584	3.9× less
heavy	100	1,650	371	636	2.6× less
long_range	10	1,957	1,072	1,373	1.4× less
long_range	50	1,737	768	1,754	~parity
long_range	100	2,004	1,252	2,082	~parity
compute	10	2,340	353	733	3.2× less
compute	50	2,437	362	766	3.2× less
compute	100	2,317	370	720	3.2× less

Long-range memory parity (c=50, c=100) reflects the GOMEMLIMIT=2 GiB fix applied before this run. Without the limit, proxy RSS reached 4,466 MB at c=100 for long-range; with it, VL can scan the same 7-day windows within a bounded footprint.

Cold cache, unique queries — honest worst case

Every worker gets a distinct non-overlapping time window. This defeats both the singleflight coalescer and the response cache. What remains is raw proxy overhead: LogQL→LogsQL translation (2.7–7.2 µs depending on complexity) + HTTP proxying + response shaping.

Throughput (req/s)

Workload	Concurrency	Loki	Proxy cold	VL native	Proxy / Loki
small	10	1,080	1,201	2,957	1.11×
small	50	1,369	1,343	3,637	0.98× (parity)
heavy	10	133	179	829	1.34×
heavy	50	193†	182	859	1.47× on delivered‡
long_range	10	9	19	82	2.06×
long_range	50	9	19	84	2.05×
long_range	100	13	24	85	1.86×
compute	10	2,281	352	1,462	0.15×
compute	50	1,633	336	1,455	0.21×
compute	100	899	366	1,431	0.41×

† Loki heavy c=50: 35.63% error rate — saturated under unique-window load. Successful throughput: ~124 req/s.
‡ 182 proxy req/s (0 errors) vs ~124 Loki successful req/s = 1.47× on delivered traffic.

P50 latency (unique-windows)

Workload	Concurrency	Loki	Proxy cold	VL native
small	10	7 ms	4 ms	1 ms
small	50	30 ms	14 ms	5 ms
heavy	10	22 ms	21 ms	6 ms
heavy	50	5 ms†	128 ms	41 ms
long_range	10	464 ms	39 ms	99 ms
long_range	50	5,041 ms	2,060 ms	392 ms
long_range	100	4,276 ms	3,068 ms	826 ms
compute	10	1 ms	10 ms	6 ms
compute	50	4 ms	107 ms	28 ms
compute	100	6 ms	261 ms	63 ms

CPU consumed (unique-windows, cpu·s over 30 s)

Workload	Concurrency	Loki	Proxy only	Proxy + VL	Ratio vs Loki
small	100	412.1	0.21	212.8	1.9× less
heavy	10	251.9	0.23	230.4	1.1× less
heavy	50	258.4	0.30	257.7	~parity
heavy	100	308.2	0.29	251.3	1.2× less
long_range	10	61.8	0.04	170.2	0.36× (VL scans more in parallel)
long_range	50	72.9	0.04	188.2	0.39×
long_range	100	180.7	0.06	213.5	0.85×
compute	10	351.5	0.15	229.7	1.5× less
compute	50	377.8	0.20	274.7	1.4× less
compute	100	326.6	0.18	278.0	1.2× less

RSS memory (unique-windows, MB)

Workload	Concurrency	Loki	Proxy	Proxy + VL
small	100	2,380	633	1,204
heavy	10	2,044	886	1,236
heavy	50	2,214	910	1,284
heavy	100	2,456	979	1,386
long_range	10	2,174	880	1,754
long_range	50	1,923	768	1,754
long_range	100	2,004	1,252	2,082
compute	10	2,471	843	1,214
compute	50	2,463	821	1,356
compute	100	2,131	671	993

Cold overhead by workload type

What determines proxy performance when cache and coalescer provide no help:

Small (metadata): Proxy beats Loki at c=10 (1,201 vs 1,080 req/s, 1.11×) and reaches parity at c=50 (1,343 vs 1,369 req/s). VL native is ~2.7× faster than Loki (2,957 req/s at c=10); the proxy's extra HTTP hop and envelope conversion is the gap between proxy and raw VL. The windowing NDJSON parser was ported to fastjson (eliminating map[string]interface{} allocation per entry) to achieve this cold-path parity. With any cache warmth, this reverses strongly (12× warm).

Heavy (pipeline queries): Proxy cold outperforms Loki at both measured concurrency levels. At c=10: 179 vs 133 req/s (1.34× faster). At c=50: Loki saturates with 35.63% errors (successful throughput ~124 req/s) while the proxy handles 182 req/s with zero errors — 1.47× more successful traffic delivered. The fastjson NDJSON parser (no map[string]interface{} per entry) and background pattern autodetect (offloaded from the request critical path) were the key cold-path improvements; total proxy CPU dropped 22.8%. VL native remains ~4–5× faster than the proxy; the remaining gap is the network round-trip for each sub-window request.

Long-range (6 h–72 h windows): Proxy is 1.86–2.06× faster than Loki even cold. VL's parallel window fetching within the proxy — splitting long ranges into parallel 1 h sub-windows — completes before Loki can scan its chunk store sequentially. This advantage is structural and does not require cache.

Compute (metric aggregations): The stats_query_range fast path routes sum by (...) (count_over_time/rate({...}[W])) and bytes_over_time/bytes_rate queries directly to VL's pre-aggregated Prometheus buckets, eliminating the raw NDJSON log scan that was 39% CPU in cold pprof. This delivered the headline improvement in this PR: heavy cold throughput 44→126 req/s (c=10, +2.9×) and 33→139 req/s (c=100, +4.2×). A follow-up round of pprof-guided allocation fixes (pooled fastjson scratch buffers, zero-alloc label map serialization, pre-computed stream keys, direct byte-building replacing json.Marshal reflection) eliminated ~20 GB of per-request allocations and raised cold rate/topk throughput from ~40 req/s to 210 req/s (+5.25× on the loki-bench compute workload). For complex aggregations without a VL-native equivalent (quantile_over_time, topk, sum by with pipeline stages), the proxy still decomposes the query into N parallel sub-window fetches and aggregates locally. With warm cache (24 h TTL on historical windows), all compute queries hit cache on repeat and the structural overhead disappears.

AST-typed translation (v1.35.0+): The LogQL→LogsQL translation layer was migrated from fragile fmt.Sprintf string assembly to a typed logsql.PipeStats/PipeMath/PipeFilter AST. This does not change throughput (translation is 2.7–7.2 µs vs 100–500 ms VL query time), but restores correctness for binary metric queries (sum(rate(...)) / sum(rate(...)), rate(...) * 100, sum(...) + sum(...)) that were previously silently erroring because the generated | math alias:=expr form was rejected by VictoriaLogs (which expects | math expr as alias). These queries now work and are included in the compute and heavy workload numbers above.

Drilldown / Explore — real Grafana queries vs Loki direct

The numbers above came from synthetic benchmark workloads. This section replays the exact queries Grafana Drilldown and Explore emit (sum by (pod) (count_over_time({namespace="prod",pod!=""}[2m])), trace_id variants, service_version, detected_fields) at the time ranges that exercise the chunked-merge path (1 h / 6 h / 24 h / 2 d / 7 d), measured against the live e2e-compat compose stack: Loki 3.6 with max_query_series=1M, 8 GiB heap, result + chunk caching enabled; proxy fronted by vmauth (the same path Grafana actually uses).

Methodology: cold = first request (cache miss), warm = same request replayed 200 ms later. cold_status=500 / blank means Loki returned too_many_series or query-timeout. Run with ./bench/drilldown-vs-loki.sh.

`sum by (pod) (count_over_time({namespace="prod",pod!=""}[2m]))`

Range	Loki cold	Loki warm	Loki status	Proxy cold	Proxy warm	Proxy series
1 h	1 360 ms	13 ms	timeout / no body	239 ms	20 ms	5 000
6 h	27 509 ms	1 240 ms	500 (too_many_series)	901 ms	14 ms	4 989
24 h	26 870 ms	2 312 ms	500 (too_many_series)	535 ms	13 ms	16 (`/hits` top-N)
2 d	21 132 ms	9 ms	timeout / no body	1 027 ms	10 ms	16
7 d	22 727 ms	12 ms	timeout / no body	3 844 ms	12 ms	16

Cold-path speedup: 6× (1 h) → 50× (24 h) → 6× (7 d). At 6 h+ Loki returns HTTP 500 because the high-cardinality pod!="" selector blows the series cap; the proxy routes those through /select/logsql/hits and returns a real top-N chart.

`sum by (trace_id) (count_over_time({namespace="prod"}|json|drop error,__error_details__|trace_id!=""[2m]))`

Range	Loki cold	Proxy cold	Proxy series
1 h	26 106 ms (timeout)	523 ms	4 992
6 h	20 568 ms (timeout)	2 883 ms ⚠️	0 (502 — VL cap exceeded on parser-direct path; tracked below)
24 h	22 100 ms (500)	1 260 ms	8
2 d	29 408 ms (500)	3 598 ms	8
7 d	26 451 ms (500)	4 483 ms ⚠️	0 (same 502 path)

trace_id parser-direct at 6 h / 7 d returns 502 because the per-trace cardinality on this dataset exceeds the VL parser-pipe limit; the chunked-merge path elsewhere returns top-N successfully. This is a documented known limit of the |json|trace_id!="" parser-direct shape — for the dataset used here each trace_id appears in only one log line so the parser exhausts VL's row-scan budget; queries that route through /hits (the default for all label fields and most parser fields) return top-N successfully. Loki returns either timeout or too_many_series for every range.

`sum by (service_version) (count_over_time({namespace="prod",service_version!=""}[2m]))`

Range	Loki cold	Loki series	Proxy cold	Proxy series
1 h	25 723 ms (500)	-	61 ms	100
6 h	28 824 ms (500)	-	205 ms	100
24 h	24 ms (200)	0 (empty result)	202 ms	14
2 d	224 ms (200)	0 (empty result)	315 ms	15
7 d	43 ms (200)	0 (empty result)	1 381 ms	14

At 24 h+ Loki returns HTTP 200 with empty data for service_version — silently — because the cardinality reduces below its sample threshold and Loki gives up before stream selection completes. The proxy returns the real top-N every time.

`/loki/api/v1/detected_fields`

Range	Loki cold	Loki body	Proxy cold	Proxy body
1 h	3 035 ms	3 885 B (empty fields)	58 ms	7 772 B (full field set)
6 h	8 ms	no body / timeout	148 ms	8 178 B
24 h	26 981 ms	500 (too_many_series)	249 ms	8 088 B

detected_fields is the call Drilldown makes first to populate the field picker — it determines whether the panel even renders. On Loki at 24 h+ it 500s; on the proxy it returns the full OTel field map in 250 ms.

Container resource consumption (peak during bench run)

Captured with docker stats e2e-loki e2e-victorialogs e2e-proxy e2e-proxy-vmauth --no-trunc for the duration of the bench. Loki was given 8 GiB heap, 16 CPUs, and result + chunk caching; VL + proxy together ran in 4 GiB / 8 CPUs.

Container	Peak CPU	Peak RSS	Outcome
`e2e-loki`	1 580% (15.8 cores)	7.9 GiB / 8 GiB	OOM-near; 500s on every 6 h+ pod query, 500 on 24 h `detected_fields`
`e2e-victorialogs`	410% (4.1 cores)	1.4 GiB / 4 GiB	Steady; no errors
`e2e-proxy`	90% (0.9 cores)	180 MiB	Steady; cache hit rate 78% by end of run
`e2e-proxy-vmauth`	22% (0.2 cores)	35 MiB	Steady

What the numbers mean. Loki consumed roughly 18× the CPU and 44× the RSS of the proxy on the same workload while returning errors or empty results for the majority of Drilldown / Explore queries. The proxy + VL stack together used ~5 cores and 1.6 GiB to serve the entire query set with full results. This is the gap that motivated the project: Drilldown's call pattern (Grafana 24 h querySplitting, high-cardinality field!="" selectors, parser-heavy queries) is essentially incompatible with Loki's stream-store model at production volume, and the proxy bridges it by routing through VL's columnar storage.

For continuous tracking, run bench/drilldown-vs-loki.sh against any stack and diff successive runs — output is TSV so it diffs cleanly.

Short-range fairness re-run (30 m – 24 h, Loki at 12 GiB)

The 1 h – 7 d numbers above were captured with Loki at the historical 8 GiB container limit. On a 7-day dataset (6.7 GiB of stored bigParts) that proved too small: Loki OOM-looped continuously (35 restarts in one session), so cold latencies were dominated by recovery time rather than query work. We re-ran the bench at 30 m – 24 h ranges with Loki bumped to 12 GiB container, GOMEMLIMIT=10GiB, GOGC=80 so the comparison reflects Loki doing real work, not Loki recovering from kernel SIGKILL.

Outcome distribution (36 cold queries: 6 shapes × 6 ranges)

Outcome	Loki	Proxy
200 OK with data	9	35
200 OK but empty result (silent fail)	13	0
HTTP 500 / 502	9	1 (known VL parser-pipe limit on `trace_id` 6 h)
Timeout	5	0

Where both succeed with real data (cold path)

Query	Range	Loki	Proxy	Speedup
`sum by (pod)`	30 m	6 302 ms (9 274 series)	436 ms (5 000 series via /hits)	14.5×
`sum by (pod)`	1 h	5 412 ms (17 342 series)	223 ms (5 000)	24.3×
`sum by (pod)`	2 h	3 502 ms (35 388 series)	354 ms (5 000)	9.9×
`/loki/api/v1/labels`	30 m – 24 h	9 – 27 ms	8 – 31 ms	parity

Where only the proxy returns usable data

Query	Range	Loki outcome	Proxy cold
`sum by (pod)`	3 h	timeout (60 s)	578 ms (4 997 series)
`sum by (pod)`	6 h	HTTP 500	22 075 ms (4 986 series) ⚠️
`sum by (pod)`	24 h	HTTP 500	549 ms (16 series, /hits top-N)
`sum by (trace_id)`	30 m – 24 h	HTTP 500 every range	406 – 1 232 ms
`sum by (trace_id)`	3 h	timeout (60 s)	44 670 ms (4 996 series) ⚠️
`sum by (service_version)`	1 h – 24 h	200 empty (silent)	76 – 240 ms (100 versions)
`sum by (service_version)`	30 m / 6 h	HTTP 500	71 / 244 ms
`sum by (k8s_pod_name)`	30 m – 24 h	200 empty every range	116 – 257 ms
`/loki/api/v1/detected_fields`	1 h	timeout	36 ms
`/loki/api/v1/detected_fields`	2 h – 24 h	200 empty	54 – 323 ms

Two notable proxy outliers (the ⚠️ rows above):

pod 6 h took 22 s. The bench alternates Loki and proxy calls; that 22 s landed during a window where Loki was at 14 cores / 10 GiB CPU+memory pressure, and the host (17 GiB Docker Desktop allocation, several other containers running) couldn't get cycles to the proxy. Outside that contention window the proxy's 6 h pod query is sub-second.
trace_id 3 h took 44 s. Same root cause — Loki was thrashing during that exact wall-clock window. The proxy path for this query is structurally a /hits call which is normally 0.5 – 2 s for ~5 000 unique trace_ids.

These two are an artifact of running both targets back-to-back on a constrained host. If we ran proxy-only (no Loki competing for CPU) the numbers would be sub-second. They're called out here because we don't want to silently filter outliers — the bench script captures everything.

Container resources (14 min fair-bench window)

Container	Peak CPU	Peak RSS	Avg CPU	Notes
`e2e-loki`	1 444 % (≈ 14 cores)	10 445 MiB	258 %	within 12 GiB, no OOM
`e2e-victorialogs`	699 % (≈ 7 cores)	2 129 MiB	11 %	steady
`e2e-proxy`	14 %	44 MiB	0.5 %	negligible
`e2e-proxy-vmauth`	90 %	2 254 MiB	1.4 %	cache layer

To match the 9 / 36 successful queries Loki served, Loki used ≈ 14 cores and 10 GiB. The proxy + VL together served 35 / 36 with ≈ 7 cores and 2.2 GiB peak. Roughly half the CPU and one-fifth the RAM for four times the successful query coverage.

What Loki tuning we tried and what it didn't fix

To rule out config gaps before publishing these numbers, we tested raising max_query_series to 1 M, cardinality_limit to 1 M, max_query_parallelism to 256, tsdb_max_query_parallelism to 512, per-shard byte cap to 300 MB, chunk + result cache sizes to 2 GiB / 1 GiB, query_timeout to 10 m, and Loki container memory up to 24 GiB with GOMEMLIMIT=20GiB. None of those changes converted any silent-empty or HTTP 500 outcome into a successful response. The bottleneck is algorithmic: sum by (FIELD) (count_over_time({...}[w])) requires Loki to materialize one Prometheus series per unique field value per step bucket; with pod!="" on a 5 000-pod namespace over a 24 h window that working set blows past max_query_series (or memory) before the chunk store finishes reading. There is no Loki config that changes this — the proxy bypasses it by routing through VL's columnar /select/logsql/hits which computes top-N server-side without per-stream materialization.

This is the gap that motivates the project: Drilldown's call pattern is structurally incompatible with Loki's read path at production cardinality, and the proxy bridges it. The numbers above are what you get when you give Loki adequate RAM and still ask it to do something it cannot.

VictoriaLogs tuning

Long-range columnar scans are I/O-bound and goroutine-heavy. The default -defaultParallelReaders=2×CPU (36 on 18 cores) creates 3,600 goroutines at c=100, causing context-switch overhead that degrades throughput for all workloads. Reducing to 8 cuts goroutines to ~800 and improves small query throughput dramatically without harming long-range scan capacity.

Flag	Value	Effect
`-defaultParallelReaders`	`8`	Critical: limit goroutines per query
`-fs.maxConcurrency`	`64`	Cap concurrent file ops
`-memory.allowedPercent`	`80`	Increase block cache budget (default 60%)
`-search.maxConcurrentRequests`	`100`	Allow high bench concurrency
`-search.maxQueueDuration`	`60s`	Queue rather than reject excess requests
`-search.maxQueryDuration`	`60s`	Cancel scans that exceed memory budget
`-blockcache.missesBeforeCaching`	`1`	Cache from first miss (default 2)
`-internStringCacheExpireDuration`	`15m`	Reduce GC pressure on label intern cache

These flags are already applied in test/e2e-compat/docker-compose.yml. In production, the proxy cache further reduces effective VL concurrency — only cache-miss requests reach VL, so real VL concurrency is far lower than the client-facing rate.

Per-request proxy overhead (microbenchmarks)

Operation	Latency	Allocs	Bytes/op
Labels (cache hit)	2.0 µs	25	6.6 KB
QueryRange (cache hit)	118 µs	600	142 KB
LogQL→LogsQL translation (selector)	2.7 µs	18	836 B
LogQL→LogsQL translation (rate/sum)	4.9 µs	48	2.1 KB
LogQL→LogsQL translation (binary rate/rate)	7.2 µs	76	3.7 KB
LogQL→LogsQL translation (ip() filter, v1.45+)	4.4 µs	36	1.4 KB
`PipeMath.String()` (AST serialization)	55 ns	3	80 B
`PipeStats.String()` (AST serialization)	70 ns	5	136 B
VL NDJSON → Loki streams (100 lines)	170 µs	3,118	70 KB
wrapAsLokiResponse	2.8 µs	58	2.6 KB

Translation overhead is 2.7–7.2 µs depending on query complexity. For a typical heavy-workload query taking 100–500 ms against VL, translation is under 0.007% of wall-clock time. The AST-typed path (PipeMath, PipeStats) adds no overhead versus the previous string-concat approach.

# Run microbenchmarks
go test ./internal/proxy/ -bench . -benchmem -run "^$" -count=3
go test ./internal/translator/ -bench . -benchmem -run "^$" -count=3
go test ./internal/cache/ -bench . -benchmem -run "^$" -count=3

Measuring eviction pressure

# Eviction rate — non-zero means L1 is too small
rate(loki_vl_proxy_cache_evictions_total[5m])

# Hit rate — below 50% means cache too small or workload not cacheable
rate(loki_vl_proxy_cache_hits_total[5m])
/
(rate(loki_vl_proxy_cache_hits_total[5m]) + rate(loki_vl_proxy_cache_misses_total[5m]))

# Per-replica cache RSS — should stay below -cache-max
process_resident_memory_bytes{job="loki-vl-proxy"}

Rule of thumb: L1 size = (unique active queries per hour) × (average response size).

Label Metadata Performance​

How the proxy makes label fetches fast and accurate​

Measured latency (proxy overhead against a local VL mock)​

Running the label perf tests​

Running Benchmarks​

Workload definitions​

Warm cache — production steady state​

Throughput (req/s)​

P50 latency​

P90 latency​

P99 latency​

CPU consumed (cpu·s over 30 s window)​

RSS memory (MB, peak during 30 s window)​

Cold cache, unique queries — honest worst case​

Throughput (req/s)​

P50 latency (unique-windows)​

CPU consumed (unique-windows, cpu·s over 30 s)​

RSS memory (unique-windows, MB)​

Cold overhead by workload type​

Drilldown / Explore — real Grafana queries vs Loki direct​

sum by (pod) (count_over_time({namespace="prod",pod!=""}[2m]))​

sum by (trace_id) (count_over_time({namespace="prod"}|json|drop __error__,__error_details__|trace_id!=""[2m]))​

sum by (service_version) (count_over_time({namespace="prod",service_version!=""}[2m]))​

/loki/api/v1/detected_fields​

Container resource consumption (peak during bench run)​

Short-range fairness re-run (30 m – 24 h, Loki at 12 GiB)​

VictoriaLogs tuning​

Per-request proxy overhead (microbenchmarks)​

Measuring eviction pressure​

Label Metadata Performance

How the proxy makes label fetches fast and accurate

Measured latency (proxy overhead against a local VL mock)

Running the label perf tests

Running Benchmarks

Workload definitions

Warm cache — production steady state

Throughput (req/s)

P50 latency

P90 latency

P99 latency

CPU consumed (cpu·s over 30 s window)

RSS memory (MB, peak during 30 s window)

Cold cache, unique queries — honest worst case

Throughput (req/s)

P50 latency (unique-windows)

CPU consumed (unique-windows, cpu·s over 30 s)

RSS memory (unique-windows, MB)

Cold overhead by workload type

Drilldown / Explore — real Grafana queries vs Loki direct

`sum by (pod) (count_over_time({namespace="prod",pod!=""}[2m]))`

`sum by (trace_id) (count_over_time({namespace="prod"}|json|drop error,__error_details__|trace_id!=""[2m]))`

`sum by (service_version) (count_over_time({namespace="prod",service_version!=""}[2m]))`

`/loki/api/v1/detected_fields`

Container resource consumption (peak during bench run)

Short-range fairness re-run (30 m – 24 h, Loki at 12 GiB)

VictoriaLogs tuning

Per-request proxy overhead (microbenchmarks)

Measuring eviction pressure