Skip to main content

Benchmarks

Hardware: Apple M5 Pro, 18 cores, 64 GB RAM, macOS 26.4.1, Go 1.26.4 darwin/arm64, Docker Desktop 29.4.0 (17.3 GiB allocated to Docker).

Stack: Loki 3.6.x, VictoriaLogs v1.50.0, loki-vl-proxy latest. ~8 M log entries across 15 services, 7-day window.

VictoriaLogs flags: -defaultParallelReaders=8 -fs.maxConcurrency=64 -memory.allowedPercent=80 -search.maxConcurrentRequests=100 -search.maxQueueDuration=60s. See VictoriaLogs tuning for rationale.

Loki flags: querier.max_concurrent=16, max_query_parallelism=64, result + chunk caching enabled.

Label Metadata Performance

Label endpoints (/loki/api/v1/labels, /loki/api/v1/label/{name}/values) are the first calls Grafana makes when opening Explore or a dashboard. Their latency determines perceived "snappiness" and the completeness of the label picker.

How the proxy makes label fetches fast and accurate

Progressive two-stage fetch — when a request arrives for a wide time range (e.g. 7d) the proxy returns an initial response from a 1h VL scan immediately, then triggers a background goroutine that fetches the full user-requested range. The second request for the same window gets the complete historical label set from cache (sub-ms). This means:

  • First request: fast (~200 µs proxy overhead + one VL round-trip against 1h of data)
  • Second request: complete and fast (cache hit, sub-ms)

Labels do change over time (services added, removed, renamed across deployments), so the full-range background fetch ensures historical labels are not silently omitted.

Time-bucketed cache keys — Grafana's time picker slides by seconds between dashboard refreshes. The proxy quantises start/end timestamps to fixed bucket boundaries before building the cache key:

User-selected intervalBucket sizeEffect
≤ 6 h5 minutesRefreshes every 30 s collapse to the same key
6 h – 48 h1 hourIntra-hour drift collapses to one entry
> 48 h6 hours7-day queries share one cache entry all day

Startup warmup — on boot the proxy serves immediately from the disk-backed cache (fast start), then waits for VictoriaLogs to become healthy and refreshes any entries that are expired or close to expiry. The first dashboard load after a deployment is a cache hit.

Disk persistence — label cache entries are written to disk (SetLocalAndDiskWithTTL). On proxy restart the disk entries load immediately so there is no cold-start penalty even before VL warmup completes.

Periodic keep-warm loop — a background goroutine runs every 90 seconds and refreshes label cache entries for all four standard Grafana presets (Last 1h / 6h / 24h / 7d) before their 2-minute TTL expires. This keeps the cache hot even with no user queries.

Background stale refresh on hits — when a cached entry is served but has less than ~30% of its TTL remaining, the proxy automatically triggers a background full-range refresh so the next request sees fresher data.

Measured latency (proxy overhead against a local VL mock)

Time rangeFirst requestSecond requestVL scan on first
1 h~200 µs~5 µs (cache hit)1 h
6 h~200 µs~5 µs (cache hit)1 h (sync) → 6 h (background)
12 h~200 µs~5 µs (cache hit)1 h (sync) → 12 h (background)
24 h~200 µs~5 µs (cache hit)1 h (sync) → 24 h (background)
2 d~200 µs~5 µs (cache hit)1 h (sync) → 2 d (background)
7 d~200 µs~5 µs (cache hit)1 h (sync) → 7 d (background)

Numbers above are proxy-only overhead measured with a zero-latency in-process VL mock (go test -bench BenchmarkLabels_). In production, add your actual VL round-trip (~50–300 ms on first request; sub-ms on cache hit).

Against a real VictoriaLogs instance (manual measurement, 15 services, 8 M entries):

Time rangeFirst requestSecond request
1 h (pre-warmed at startup)sub-ms (cache hit)sub-ms
7 d (sync: 1h VL scan + background: 7d scan)~300 mssub-ms with full historical data

Running the label perf tests

# Correctness tests (run in normal CI)
go test ./internal/proxy/ -run 'TestPerf_Labels_' -v

# Cold and warm benchmarks
go test ./internal/proxy/ -bench 'BenchmarkLabels_' -benchmem -count=3

Running Benchmarks

# Warm-cache run (standard — proxy cache pre-warmed at benchmark concurrency)
loki-bench \
--proxy=http://localhost:3100 \
--loki=http://localhost:3200 \
--vl-direct=http://localhost:9428 \
--workloads=small,heavy,long_range,compute \
--clients=10,50,100 \
--duration=30s \
--warmup=5s \
--jitter=2h

# Unique-windows run (cache + coalescer both defeated — raw proxy overhead)
loki-bench \
--proxy=http://localhost:3100 \
--loki=http://localhost:3200 \
--vl-direct=http://localhost:9428 \
--workloads=small,heavy,long_range,compute \
--clients=10,50,100 \
--duration=30s \
--unique-windows

Workload definitions

WorkloadWhat it covers
smallLabel queries, series, query_range 1–5 min, instant queries — Grafana label browser / small panel refreshes
heavyComplex LogQL with pipelines (json, line_format, label_format), filters, label_filter — content search and alerting queries
long_range6 h–72 h query_range, rate/count/bytes_rate over long windows, metadata over long windows — Drilldown/Explore historical analysis
computeMetric aggregations (sum by, rate, count_over_time, quantile_over_time, topk) — dashboard panels showing metrics derived from logs

Warm cache — production steady state

Proxy cache is pre-warmed at the same concurrency as the measurement. Repeated queries are served from L1 memory without touching VictoriaLogs. This is the operating mode for Grafana dashboards auto-refreshing every 30 s.

Throughput (req/s)

WorkloadConcurrencyLokiProxy warmVL nativeProxy / Loki
small102,01115,6262,3427.8×
small502,29724,9584,48410.9×
small1002,29027,5134,38012.0×
heavy104075,9441,06414.6×
heavy503026,4031,20321.2×
heavy1001627,1341,24544.1×
long_range1081578818.7×
long_range50112018718.0×
long_range100162209514.0×
compute102,80311,1621,5544.0×
compute502,23313,4841,5886.0×
compute1001,61116,4561,46510.2×

P50 latency

WorkloadConcurrencyLokiProxy warmVL native
small104 ms587 µs2 ms
small5020 ms1 ms5 ms
small10042 ms3 ms8 ms
heavy104 ms1 ms5 ms
heavy5022 ms6 ms27 ms
heavy1003 ms†12 ms66 ms
long_range10481 ms1 ms102 ms
long_range504,211 ms1 ms403 ms
long_range1004,902 ms1 ms771 ms
compute101 ms675 µs6 ms
compute506 ms2 ms24 ms
compute1004 ms4 ms57 ms

† Loki heavy c=100: P50=3 ms is misleading — Loki was saturated (P90=1,818 ms, P99=6,950 ms).

P90 latency

WorkloadConcurrencyLokiProxy warmVL native
small1010 ms917 µs8 ms
small5037 ms2 ms27 ms
small10071 ms4 ms61 ms
heavy1081 ms3 ms15 ms
heavy50471 ms11 ms59 ms
heavy1001,818 ms18 ms106 ms
long_range103,872 ms299 ms202 ms
long_range5012,868 ms1,491 ms1,317 ms
long_range10070,700 ms2,788 ms2,386 ms
compute1011 ms1 ms9 ms
compute5071 ms4 ms61 ms
compute100243 ms6 ms107 ms

P99 latency

WorkloadConcurrencyLokiProxy warmVL native
small1018 ms1 ms29 ms
small5053 ms4 ms68 ms
small10092 ms7 ms175 ms
heavy10128 ms6 ms58 ms
heavy50981 ms25 ms261 ms
heavy1006,950 ms45 ms306 ms
long_range104,923 ms751 ms252 ms
long_range5019,586 ms3,189 ms1,861 ms
long_range10089,306 ms5,917 ms3,325 ms
compute1021 ms3 ms14 ms
compute50145 ms6 ms127 ms
compute100521 ms11 ms207 ms

CPU consumed (cpu·s over 30 s window)

WorkloadConcurrencyLokiProxy onlyProxy + VLRatio vs Loki
small10330.60.080.81408× less
small50415.40.152.75151× less
small100415.30.165.3178× less
heavy10320.90.050.91355× less
heavy50306.30.071.68182× less
heavy10096.80.061.1981× less
long_range1043.90.1832.61.35× less
long_range5062.10.2950.21.23× less
long_range100438.30.3050.68.66× less
compute10315.50.0810.430× less
compute50399.90.1558.06.9× less
compute100379.70.1863.76.0× less

The proxy process itself consumes negligible CPU — the gains come from VL being more efficient than Loki for the same queries, amplified by the cache eliminating most backend calls entirely.

RSS memory (MB, peak during 30 s window)

WorkloadConcurrencyLokiProxyProxy + VLRatio vs Loki
small101,9104847262.6× less
small502,1594548002.7× less
small1002,2154297822.8× less
heavy102,0824186403.3× less
heavy502,2693645843.9× less
heavy1001,6503716362.6× less
long_range101,9571,0721,3731.4× less
long_range501,7377681,754~parity
long_range1002,0041,2522,082~parity
compute102,3403537333.2× less
compute502,4373627663.2× less
compute1002,3173707203.2× less

Long-range memory parity (c=50, c=100) reflects the GOMEMLIMIT=2 GiB fix applied before this run. Without the limit, proxy RSS reached 4,466 MB at c=100 for long-range; with it, VL can scan the same 7-day windows within a bounded footprint.


Cold cache, unique queries — honest worst case

Every worker gets a distinct non-overlapping time window. This defeats both the singleflight coalescer and the response cache. What remains is raw proxy overhead: LogQL→LogsQL translation (2.7–7.2 µs depending on complexity) + HTTP proxying + response shaping.

Throughput (req/s)

WorkloadConcurrencyLokiProxy coldVL nativeProxy / Loki
small101,0801,2012,9571.11×
small501,3691,3433,6370.98× (parity)
heavy101331798291.34×
heavy50193†1828591.47× on delivered
long_range10919822.06×
long_range50919842.05×
long_range1001324851.86×
compute102,2813521,4620.15×
compute501,6333361,4550.21×
compute1008993661,4310.41×

† Loki heavy c=50: 35.63% error rate — saturated under unique-window load. Successful throughput: ~124 req/s.
‡ 182 proxy req/s (0 errors) vs ~124 Loki successful req/s = 1.47× on delivered traffic.

P50 latency (unique-windows)

WorkloadConcurrencyLokiProxy coldVL native
small107 ms4 ms1 ms
small5030 ms14 ms5 ms
heavy1022 ms21 ms6 ms
heavy505 ms†128 ms41 ms
long_range10464 ms39 ms99 ms
long_range505,041 ms2,060 ms392 ms
long_range1004,276 ms3,068 ms826 ms
compute101 ms10 ms6 ms
compute504 ms107 ms28 ms
compute1006 ms261 ms63 ms

CPU consumed (unique-windows, cpu·s over 30 s)

WorkloadConcurrencyLokiProxy onlyProxy + VLRatio vs Loki
small100412.10.21212.81.9× less
heavy10251.90.23230.41.1× less
heavy50258.40.30257.7~parity
heavy100308.20.29251.31.2× less
long_range1061.80.04170.20.36× (VL scans more in parallel)
long_range5072.90.04188.20.39×
long_range100180.70.06213.50.85×
compute10351.50.15229.71.5× less
compute50377.80.20274.71.4× less
compute100326.60.18278.01.2× less

RSS memory (unique-windows, MB)

WorkloadConcurrencyLokiProxyProxy + VL
small1002,3806331,204
heavy102,0448861,236
heavy502,2149101,284
heavy1002,4569791,386
long_range102,1748801,754
long_range501,9237681,754
long_range1002,0041,2522,082
compute102,4718431,214
compute502,4638211,356
compute1002,131671993

Cold overhead by workload type

What determines proxy performance when cache and coalescer provide no help:

Small (metadata): Proxy beats Loki at c=10 (1,201 vs 1,080 req/s, 1.11×) and reaches parity at c=50 (1,343 vs 1,369 req/s). VL native is ~2.7× faster than Loki (2,957 req/s at c=10); the proxy's extra HTTP hop and envelope conversion is the gap between proxy and raw VL. The windowing NDJSON parser was ported to fastjson (eliminating map[string]interface{} allocation per entry) to achieve this cold-path parity. With any cache warmth, this reverses strongly (12× warm).

Heavy (pipeline queries): Proxy cold outperforms Loki at both measured concurrency levels. At c=10: 179 vs 133 req/s (1.34× faster). At c=50: Loki saturates with 35.63% errors (successful throughput ~124 req/s) while the proxy handles 182 req/s with zero errors — 1.47× more successful traffic delivered. The fastjson NDJSON parser (no map[string]interface{} per entry) and background pattern autodetect (offloaded from the request critical path) were the key cold-path improvements; total proxy CPU dropped 22.8%. VL native remains ~4–5× faster than the proxy; the remaining gap is the network round-trip for each sub-window request.

Long-range (6 h–72 h windows): Proxy is 1.86–2.06× faster than Loki even cold. VL's parallel window fetching within the proxy — splitting long ranges into parallel 1 h sub-windows — completes before Loki can scan its chunk store sequentially. This advantage is structural and does not require cache.

Compute (metric aggregations): The stats_query_range fast path routes sum by (...) (count_over_time/rate({...}[W])) and bytes_over_time/bytes_rate queries directly to VL's pre-aggregated Prometheus buckets, eliminating the raw NDJSON log scan that was 39% CPU in cold pprof. This delivered the headline improvement in this PR: heavy cold throughput 44→126 req/s (c=10, +2.9×) and 33→139 req/s (c=100, +4.2×). A follow-up round of pprof-guided allocation fixes (pooled fastjson scratch buffers, zero-alloc label map serialization, pre-computed stream keys, direct byte-building replacing json.Marshal reflection) eliminated ~20 GB of per-request allocations and raised cold rate/topk throughput from ~40 req/s to 210 req/s (+5.25× on the loki-bench compute workload). For complex aggregations without a VL-native equivalent (quantile_over_time, topk, sum by with pipeline stages), the proxy still decomposes the query into N parallel sub-window fetches and aggregates locally. With warm cache (24 h TTL on historical windows), all compute queries hit cache on repeat and the structural overhead disappears.

AST-typed translation (v1.35.0+): The LogQL→LogsQL translation layer was migrated from fragile fmt.Sprintf string assembly to a typed logsql.PipeStats/PipeMath/PipeFilter AST. This does not change throughput (translation is 2.7–7.2 µs vs 100–500 ms VL query time), but restores correctness for binary metric queries (sum(rate(...)) / sum(rate(...)), rate(...) * 100, sum(...) + sum(...)) that were previously silently erroring because the generated | math alias:=expr form was rejected by VictoriaLogs (which expects | math expr as alias). These queries now work and are included in the compute and heavy workload numbers above.


Drilldown / Explore — real Grafana queries vs Loki direct

The numbers above came from synthetic benchmark workloads. This section replays the exact queries Grafana Drilldown and Explore emit (sum by (pod) (count_over_time({namespace="prod",pod!=""}[2m])), trace_id variants, service_version, detected_fields) at the time ranges that exercise the chunked-merge path (1 h / 6 h / 24 h / 2 d / 7 d), measured against the live e2e-compat compose stack: Loki 3.6 with max_query_series=1M, 8 GiB heap, result + chunk caching enabled; proxy fronted by vmauth (the same path Grafana actually uses).

Methodology: cold = first request (cache miss), warm = same request replayed 200 ms later. cold_status=500 / blank means Loki returned too_many_series or query-timeout. Run with ./bench/drilldown-vs-loki.sh.

sum by (pod) (count_over_time({namespace="prod",pod!=""}[2m]))

RangeLoki coldLoki warmLoki statusProxy coldProxy warmProxy series
1 h1 360 ms13 mstimeout / no body239 ms20 ms5 000
6 h27 509 ms1 240 ms500 (too_many_series)901 ms14 ms4 989
24 h26 870 ms2 312 ms500 (too_many_series)535 ms13 ms16 (/hits top-N)
2 d21 132 ms9 mstimeout / no body1 027 ms10 ms16
7 d22 727 ms12 mstimeout / no body3 844 ms12 ms16

Cold-path speedup: 6× (1 h) → 50× (24 h) → 6× (7 d). At 6 h+ Loki returns HTTP 500 because the high-cardinality pod!="" selector blows the series cap; the proxy routes those through /select/logsql/hits and returns a real top-N chart.

sum by (trace_id) (count_over_time({namespace="prod"}|json|drop __error__,__error_details__|trace_id!=""[2m]))

RangeLoki coldProxy coldProxy series
1 h26 106 ms (timeout)523 ms4 992
6 h20 568 ms (timeout)2 883 ms ⚠️0 (502 — VL cap exceeded on parser-direct path; tracked below)
24 h22 100 ms (500)1 260 ms8
2 d29 408 ms (500)3 598 ms8
7 d26 451 ms (500)4 483 ms ⚠️0 (same 502 path)

trace_id parser-direct at 6 h / 7 d returns 502 because the per-trace cardinality on this dataset exceeds the VL parser-pipe limit; the chunked-merge path elsewhere returns top-N successfully. This is a documented known limit of the |json|trace_id!="" parser-direct shape — for the dataset used here each trace_id appears in only one log line so the parser exhausts VL's row-scan budget; queries that route through /hits (the default for all label fields and most parser fields) return top-N successfully. Loki returns either timeout or too_many_series for every range.

sum by (service_version) (count_over_time({namespace="prod",service_version!=""}[2m]))

RangeLoki coldLoki seriesProxy coldProxy series
1 h25 723 ms (500)-61 ms100
6 h28 824 ms (500)-205 ms100
24 h24 ms (200)0 (empty result)202 ms14
2 d224 ms (200)0 (empty result)315 ms15
7 d43 ms (200)0 (empty result)1 381 ms14

At 24 h+ Loki returns HTTP 200 with empty data for service_version — silently — because the cardinality reduces below its sample threshold and Loki gives up before stream selection completes. The proxy returns the real top-N every time.

/loki/api/v1/detected_fields

RangeLoki coldLoki bodyProxy coldProxy body
1 h3 035 ms3 885 B (empty fields)58 ms7 772 B (full field set)
6 h8 msno body / timeout148 ms8 178 B
24 h26 981 ms500 (too_many_series)249 ms8 088 B

detected_fields is the call Drilldown makes first to populate the field picker — it determines whether the panel even renders. On Loki at 24 h+ it 500s; on the proxy it returns the full OTel field map in 250 ms.

Container resource consumption (peak during bench run)

Captured with docker stats e2e-loki e2e-victorialogs e2e-proxy e2e-proxy-vmauth --no-trunc for the duration of the bench. Loki was given 8 GiB heap, 16 CPUs, and result + chunk caching; VL + proxy together ran in 4 GiB / 8 CPUs.

ContainerPeak CPUPeak RSSOutcome
e2e-loki1 580% (15.8 cores)7.9 GiB / 8 GiBOOM-near; 500s on every 6 h+ pod query, 500 on 24 h detected_fields
e2e-victorialogs410% (4.1 cores)1.4 GiB / 4 GiBSteady; no errors
e2e-proxy90% (0.9 cores)180 MiBSteady; cache hit rate 78% by end of run
e2e-proxy-vmauth22% (0.2 cores)35 MiBSteady

What the numbers mean. Loki consumed roughly 18× the CPU and 44× the RSS of the proxy on the same workload while returning errors or empty results for the majority of Drilldown / Explore queries. The proxy + VL stack together used ~5 cores and 1.6 GiB to serve the entire query set with full results. This is the gap that motivated the project: Drilldown's call pattern (Grafana 24 h querySplitting, high-cardinality field!="" selectors, parser-heavy queries) is essentially incompatible with Loki's stream-store model at production volume, and the proxy bridges it by routing through VL's columnar storage.

For continuous tracking, run bench/drilldown-vs-loki.sh against any stack and diff successive runs — output is TSV so it diffs cleanly.

Short-range fairness re-run (30 m – 24 h, Loki at 12 GiB)

The 1 h – 7 d numbers above were captured with Loki at the historical 8 GiB container limit. On a 7-day dataset (6.7 GiB of stored bigParts) that proved too small: Loki OOM-looped continuously (35 restarts in one session), so cold latencies were dominated by recovery time rather than query work. We re-ran the bench at 30 m – 24 h ranges with Loki bumped to 12 GiB container, GOMEMLIMIT=10GiB, GOGC=80 so the comparison reflects Loki doing real work, not Loki recovering from kernel SIGKILL.

Outcome distribution (36 cold queries: 6 shapes × 6 ranges)

OutcomeLokiProxy
200 OK with data935
200 OK but empty result (silent fail)130
HTTP 500 / 50291 (known VL parser-pipe limit on trace_id 6 h)
Timeout50

Where both succeed with real data (cold path)

QueryRangeLokiProxySpeedup
sum by (pod)30 m6 302 ms (9 274 series)436 ms (5 000 series via /hits)14.5×
sum by (pod)1 h5 412 ms (17 342 series)223 ms (5 000)24.3×
sum by (pod)2 h3 502 ms (35 388 series)354 ms (5 000)9.9×
/loki/api/v1/labels30 m – 24 h9 – 27 ms8 – 31 msparity

Where only the proxy returns usable data

QueryRangeLoki outcomeProxy cold
sum by (pod)3 htimeout (60 s)578 ms (4 997 series)
sum by (pod)6 hHTTP 50022 075 ms (4 986 series) ⚠️
sum by (pod)24 hHTTP 500549 ms (16 series, /hits top-N)
sum by (trace_id)30 m – 24 hHTTP 500 every range406 – 1 232 ms
sum by (trace_id)3 htimeout (60 s)44 670 ms (4 996 series) ⚠️
sum by (service_version)1 h – 24 h200 empty (silent)76 – 240 ms (100 versions)
sum by (service_version)30 m / 6 hHTTP 50071 / 244 ms
sum by (k8s_pod_name)30 m – 24 h200 empty every range116 – 257 ms
/loki/api/v1/detected_fields1 htimeout36 ms
/loki/api/v1/detected_fields2 h – 24 h200 empty54 – 323 ms

Two notable proxy outliers (the ⚠️ rows above):

  • pod 6 h took 22 s. The bench alternates Loki and proxy calls; that 22 s landed during a window where Loki was at 14 cores / 10 GiB CPU+memory pressure, and the host (17 GiB Docker Desktop allocation, several other containers running) couldn't get cycles to the proxy. Outside that contention window the proxy's 6 h pod query is sub-second.
  • trace_id 3 h took 44 s. Same root cause — Loki was thrashing during that exact wall-clock window. The proxy path for this query is structurally a /hits call which is normally 0.5 – 2 s for ~5 000 unique trace_ids.

These two are an artifact of running both targets back-to-back on a constrained host. If we ran proxy-only (no Loki competing for CPU) the numbers would be sub-second. They're called out here because we don't want to silently filter outliers — the bench script captures everything.

Container resources (14 min fair-bench window)

ContainerPeak CPUPeak RSSAvg CPUNotes
e2e-loki1 444 % (≈ 14 cores)10 445 MiB258 %within 12 GiB, no OOM
e2e-victorialogs699 % (≈ 7 cores)2 129 MiB11 %steady
e2e-proxy14 %44 MiB0.5 %negligible
e2e-proxy-vmauth90 %2 254 MiB1.4 %cache layer

To match the 9 / 36 successful queries Loki served, Loki used ≈ 14 cores and 10 GiB. The proxy + VL together served 35 / 36 with ≈ 7 cores and 2.2 GiB peak. Roughly half the CPU and one-fifth the RAM for four times the successful query coverage.

What Loki tuning we tried and what it didn't fix

To rule out config gaps before publishing these numbers, we tested raising max_query_series to 1 M, cardinality_limit to 1 M, max_query_parallelism to 256, tsdb_max_query_parallelism to 512, per-shard byte cap to 300 MB, chunk + result cache sizes to 2 GiB / 1 GiB, query_timeout to 10 m, and Loki container memory up to 24 GiB with GOMEMLIMIT=20GiB. None of those changes converted any silent-empty or HTTP 500 outcome into a successful response. The bottleneck is algorithmic: sum by (FIELD) (count_over_time({...}[w])) requires Loki to materialize one Prometheus series per unique field value per step bucket; with pod!="" on a 5 000-pod namespace over a 24 h window that working set blows past max_query_series (or memory) before the chunk store finishes reading. There is no Loki config that changes this — the proxy bypasses it by routing through VL's columnar /select/logsql/hits which computes top-N server-side without per-stream materialization.

This is the gap that motivates the project: Drilldown's call pattern is structurally incompatible with Loki's read path at production cardinality, and the proxy bridges it. The numbers above are what you get when you give Loki adequate RAM and still ask it to do something it cannot.


VictoriaLogs tuning

Long-range columnar scans are I/O-bound and goroutine-heavy. The default -defaultParallelReaders=2×CPU (36 on 18 cores) creates 3,600 goroutines at c=100, causing context-switch overhead that degrades throughput for all workloads. Reducing to 8 cuts goroutines to ~800 and improves small query throughput dramatically without harming long-range scan capacity.

FlagValueEffect
-defaultParallelReaders8Critical: limit goroutines per query
-fs.maxConcurrency64Cap concurrent file ops
-memory.allowedPercent80Increase block cache budget (default 60%)
-search.maxConcurrentRequests100Allow high bench concurrency
-search.maxQueueDuration60sQueue rather than reject excess requests
-search.maxQueryDuration60sCancel scans that exceed memory budget
-blockcache.missesBeforeCaching1Cache from first miss (default 2)
-internStringCacheExpireDuration15mReduce GC pressure on label intern cache

These flags are already applied in test/e2e-compat/docker-compose.yml. In production, the proxy cache further reduces effective VL concurrency — only cache-miss requests reach VL, so real VL concurrency is far lower than the client-facing rate.


Per-request proxy overhead (microbenchmarks)

OperationLatencyAllocsBytes/op
Labels (cache hit)2.0 µs256.6 KB
QueryRange (cache hit)118 µs600142 KB
LogQL→LogsQL translation (selector)2.7 µs18836 B
LogQL→LogsQL translation (rate/sum)4.9 µs482.1 KB
LogQL→LogsQL translation (binary rate/rate)7.2 µs763.7 KB
LogQL→LogsQL translation (ip() filter, v1.45+)4.4 µs361.4 KB
PipeMath.String() (AST serialization)55 ns380 B
PipeStats.String() (AST serialization)70 ns5136 B
VL NDJSON → Loki streams (100 lines)170 µs3,11870 KB
wrapAsLokiResponse2.8 µs582.6 KB

Translation overhead is 2.7–7.2 µs depending on query complexity. For a typical heavy-workload query taking 100–500 ms against VL, translation is under 0.007% of wall-clock time. The AST-typed path (PipeMath, PipeStats) adds no overhead versus the previous string-concat approach.

# Run microbenchmarks
go test ./internal/proxy/ -bench . -benchmem -run "^$" -count=3
go test ./internal/translator/ -bench . -benchmem -run "^$" -count=3
go test ./internal/cache/ -bench . -benchmem -run "^$" -count=3

Measuring eviction pressure

# Eviction rate — non-zero means L1 is too small
rate(loki_vl_proxy_cache_evictions_total[5m])

# Hit rate — below 50% means cache too small or workload not cacheable
rate(loki_vl_proxy_cache_hits_total[5m])
/
(rate(loki_vl_proxy_cache_hits_total[5m]) + rate(loki_vl_proxy_cache_misses_total[5m]))

# Per-replica cache RSS — should stay below -cache-max
process_resident_memory_bytes{job="loki-vl-proxy"}

Rule of thumb: L1 size = (unique active queries per hour) × (average response size).