Performance
Architecture Optimizationsβ
Lock-Free Hot Pathsβ
At high concurrency (100+ concurrent requests), mutex contention becomes the dominant source of latency. The proxy eliminates locks from the request hot path in several subsystems:
| Component | Before | After | Impact |
|---|---|---|---|
| Circuit breaker | sync.Mutex on every Allow()/RecordSuccess() | Atomic state machine; mutex only on state transitions | Zero contention in closed (healthy) state |
| Metrics histograms | sync.RWMutex on observe() and scrape | Atomic CAS loops for all counters | Scrape no longer blocks request recording |
| Rate limiter | Single global sync.Mutex + map | sync.Map + per-bucket sync.Mutex | Eliminates convoy effect across tenants |
| Query fingerprinting | crypto/sha256 | hash/maphash | ~30x faster hashing |
| Structured logging | slog.JSONHandler internal mutex (52% of contention at c=100) | Async buffered handler with 8192-slot channel | Writer goroutine decouples logs from request path |
Connection Pool Tuningβ
The proxy's HTTP transport is tuned for high-concurrency single-backend proxying:
| Setting | Value | Why |
|---|---|---|
MaxIdleConns | 256 | Total idle connection pool size |
MaxIdleConnsPerHost | 256 | All slots for VL (single backend) |
MaxConnsPerHost | 0 (unlimited) | No artificial cap on concurrent VL connections |
IdleConnTimeout | 90s | Reuse warm connections |
ResponseHeaderTimeout | 120s | Allow slow VL queries to complete |
DisableCompression | false | Allow negotiated upstream compression |
Go's default MaxIdleConnsPerHost=2 causes ephemeral port exhaustion at >50 concurrent requests. Our tuning handles 200+ concurrent with 0 errors.
Compression Pathβ
The proxy now uses compression on multiple read-path hops:
| Hop | Current behavior |
|---|---|
| Client -> proxy response | Frontend stays on gzip/identity; the chart pins -response-compression=gzip with -response-compression-min-bytes=1024 for broad Loki/Grafana compatibility |
| Proxy -> peer-cache owner | /_cache/get prefers zstd, then gzip, then identity for payloads >=1KiB |
| Proxy -> VictoriaLogs | -backend-compression=auto advertises zstd, gzip; the proxy decodes either safely before translation or passthrough |
| Disk cache | gzip-compressed value storage |
| OTLP push | none, gzip, or zstd |
Important: current VictoriaLogs docs clearly describe HTTP response compression, but not a guaranteed zstd query-response contract on the normal select path. The proxy can accept zstd from upstream when the backend provides it, but stock VictoriaLogs deployments may still return gzip or identity today.
Verification note: against Grafana 12.4.2, the datasource proxy path
advertised Accept-Encoding: deflate, gzip, not zstd, in local
verification. That means the safe deployment default is frontend gzip,
and -response-compression=auto now keeps that same gzip path enabled for
legacy configs instead of maintaining a separate client-facing zstd mode.
The Tier0 compatibility cache now stores the canonical identity body and lazily adds compressed variants on hot hits. That removes repeated gzip/zstd work from the hot cached read path without forcing every cached response to occupy multiple encodings up front.
klauspost/compress (v1.27.0): The proxy uses the klauspost/compress library instead of the standard compress/gzip, providing 2.5β3Γ faster gzip decode throughput. This matters for responses from remote VL backends where gzip decompression is on the critical path.
Window cache zstd (v1.27.0): Split-window query results are compressed with zstd before storing in the in-memory cache, achieving 5β15Γ size reduction compared to raw NDJSON. This allows the window cache to hold significantly more windows within the same memory budget.
Loopback backend optimization: When VL runs on the same host (localhost/127.0.0.1/::1), the proxy detects this and requests uncompressed responses (identity encoding). This eliminates 25β35% of decompress CPU overhead for co-located deployments.
NDJSON Parsing Optimizationβ
VL returns log results as newline-delimited JSON. The proxy converts this to Loki's streams format. Key optimizations:
| Technique | Effect |
|---|---|
Byte scanning (not strings.Split) | Avoids copying entire body to string |
sync.Pool for JSON entry maps | Reuses maps across NDJSON lines, reducing GC pressure |
| Pre-allocated slice capacity | Estimates line count from body size |
Nanosecond timestamp via strconv.FormatInt | Avoids fmt.Sprintf allocation |
Result: 49% memory reduction and 9.6% faster parsing vs original implementation.
Request Coalescingβ
When multiple clients send identical queries simultaneously, only 1 request reaches VL. Others wait for the shared result. Keys include tenant ID to prevent cross-tenant data sharing.
Cache Topologyβ
The proxy now has two cache layers in front of the backend execution path:
| Layer | Scope | Primary goal |
|---|---|---|
| Tier0 compatibility-edge cache | Final Loki-shaped GET responses on safe read endpoints | Bypass translation and backend work for hot repeated reads |
| L1/L2/L3 cache stack | Local memory, optional disk, optional peer fleet reuse | Reduce backend load and share results across replicas |
Tier0 is intentionally small and bounded as a percentage of the primary L1 memory budget. The deeper L1/L2/L3 stack still handles broader reuse, peer sharing, and persistent cache warming.
What This Means For Operatorsβ
The simplest way to understand the cache stack is by operational outcome:
| Layer | Plain-English role | What it buys you |
|---|---|---|
Tier0 | Fast answer cache at the Loki-compatible frontend | Repeated Grafana reads can return before most proxy logic runs |
L1 memory | Hot cache inside the local process | Best-case latency for repeated dashboards and Explore refreshes |
L2 disk | Persistent local cache | Useful cache survives beyond RAM pressure and supports larger working sets |
L3 peer cache | Fleet-wide cache reuse between replicas | One warm pod can make the rest of the fleet faster and cheaper; peer-first startup warmup means only 1 instance per label window hits VL on rolling restart regardless of fleet size |
This is the difference between βit speaks Lokiβ and βit feels like Loki at runtime.β The project is designed to preserve Loki-compatible UX while reducing repeated backend work aggressively.
Drilldown Metadata Performanceβ
Grafana Logs Drilldown and the label browser make multiple metadata calls per page load. The proxy optimises these calls to eliminate the two main sources of latency: slow backend endpoints and cache key drift.
Endpoint Selectionβ
VictoriaLogs exposes two field-name endpoints with very different cost profiles:
| Endpoint | Index scanned | Typical latency (1h window) | Typical latency (24h window) |
|---|---|---|---|
stream_field_names | Stream index (per-stream) | ~30ms | 5β10s |
field_names | All-fields index (per-doc) | ~30ms | ~30ms |
stream_field_names scales linearly with data volume because it scans per-stream metadata. At 24h with a busy installation it regularly takes 5β10 seconds. field_names uses a separate all-fields index and is consistently ~30ms regardless of time range.
Service-name detection (/loki/api/v1/label/service_name/values, drilldown entrypoint): previously called stream_field_names 4Γ in series (~5s cold). Now calls field_names once (~29ms) plus stream_field_values per candidate field.
Background label refresh (refreshLabelsCacheAsync): previously sent the full user-requested range to stream_field_names β a 6h Grafana window produced a 5s background call, 24h produced a 10s call, causing CPU spikes visible as Grafana slowness. Now uses field_names for the full-range background path (consistently ~30ms), keeping stream_field_names (capped at 1h) for the synchronous labels endpoint where strict stream-only semantics matter.
Cache Key Stabilityβ
Grafana's time picker uses a sliding end-timestamp that drifts a few seconds between requests. Before this fix, each click produced a unique cache key, defeating the 30s TTL and ensuring every click hit the backend. The fix floors the end timestamp to 30-second intervals so all requests within the same logical window share one cache entry.
Measured improvement (e2e-compat, ~8M log entries, 1h window):
| Path | Before | After (cold) | After (warm) |
|---|---|---|---|
service_name/values | ~5,028ms (4 Γ stream_field_names) | 235ms | 12ms |
| Speedup | β | 21Γ | ~420Γ |
Wide-range background refresh: eliminated 5β10s stream_field_names calls for 6h/12h/24h windows. All background metadata calls now complete in <100ms regardless of the Grafana time picker selection.
VictoriaLogs Native Stats Offloadingβ
The proxy routes as much metric aggregation as possible to VL's native /select/logsql/stats_query_range endpoint, which returns a Prometheus-compatible matrix directly. This eliminates the most expensive proxy path: fetching all raw log lines and aggregating them in-process.
VL Native Stats API β Feature Timelineβ
| Feature | Added in VL | Release date | Notes |
|---|---|---|---|
/select/logsql/stats_query | v0.29.0 | 2024-09-08 | Instant stats, Prometheus vector format |
/select/logsql/stats_query_range | v0.29.0 | 2024-09-08 | Range stats, Prometheus matrix format β used by Grafana |
count, sum, avg, min, max, quantile, stddev | v0.29.0 | 2024-09-08 | In stats pipe |
rate() and rate_sum() | v1.2.0 | 2024-12-06 | Native per-second rate β no proxy aggregation needed |
rate_sum(prefix*) prefix wildcard | v1.25.0 | 2025-07-07 |
The proxy's minimum tracked capability profile is vl-v1.30-plus. All features above are available on every backend the proxy supports β no version gating required for stats offloading.
LogQL β VL Native Stats Mappingβ
| LogQL metric | VL LogsQL (stats_query_range) | Condition |
|---|---|---|
count_over_time({...}[r]) by (l) | {...} | stats by (l) count() | always |
rate({...}[r]) by (l) | {...} | stats by (l) rate() | range == step |
bytes_over_time({...}[r]) by (l) | {...} | stats by (l) sum_len(_msg) | always |
sum_over_time({...}|unwrap f [r]) by (l) | {...} | <parser> | stats by (l) sum(f) | always |
avg_over_time({...}|unwrap f [r]) by (l) | {...} | <parser> | stats by (l) avg(f) | always |
quantile_over_time(q, {...}|unwrap f [r]) by (l) | {...} | <parser> | stats by (l) quantile(q, f) | always |
max_over_time({...}|unwrap f [r]) by (l) | {...} | <parser> | stats by (l) max(f) | always |
min_over_time({...}|unwrap f [r]) by (l) | {...} | <parser> | stats by (l) min(f) | always |
stddev_over_time({...}|unwrap f [r]) by (l) | {...} | <parser> | stats by (l) stddev(f) | always |
bytes_rate({...}[r]) | {...} | stats by (l) sum_len(_msg) as __lvp_inner | math __lvp_inner/r_s as __lvp_rate | range == step |
rate_counter({...}|unwrap f [r]) | β | manual path (monotonic counter reset detection required) |
rate() / bytes_rate() condition β range == step: VL's stats pipe uses tumbling windows. LogQL's rate() and bytes_rate() compute over a sliding r window. These are semantically identical when r == step, which is Grafana's default with $__interval. The proxy shifts the query start back by r and trims the pre-start bucket to match Loki's first evaluation point. Fixed ranges that differ from the step keep the proxy sliding-window path.
Parser stage (| json / | logfmt) + range metric: Even when range == step, if the translated VL query contains an unpack_json or unpack_logfmt stage, the fast stats path is disabled. VL's unpack pipes do not model Loki's __error__ filtering (Loki excludes lines that fail to parse; VL may include them). These queries fall back to the manual log-fetch path.
Query Routing Decisionβ
The proxy chooses between native VL stats and the manual log-fetch path based on:
LogQL metric query
β
ββ has parser stage (| json, | logfmt, etc.)?
β ββ NO β native VL stats (translator emits | stats β¦)
β ββ YES + no post-parser filter + function = rate/count/bytes + range == step
β BUT translated query has unpack_json/unpack_logfmt β manual path
β (VL unpack semantics differ from Loki __error__ filtering)
β function = rate_counter β always manual
β
ββ outer aggregation (sum/avg/topk)?
ββ topk/bottomk/sort β native VL stats + post-filter in proxy
ββ sum/avg/min/max by (labels) β folded into | stats by (labels) β¦
Why this matters: the manual path fetches every matching log line from VL (/select/logsql/query), buffers them in proxy RAM, and computes rates/sums per step window. A 30-second query_range at step=60s with 1 M logs/min fetches 30 M lines; the same query via stats_query_range returns 30 aggregated data points. The performance ratio scales with log volume.
Query-Range Tuning (Long-Range Efficiency)β
Default tuning pattern:
- split long
query_rangerequests into fixed windows (commonly1h) - avoid caching near-now windows (or keep TTL very short)
- keep historical window cache TTL longer for reuse
- use adaptive bounded parallel fetch to improve range latency under healthy backend conditions
Recommended operator workflow:
- Start with conservative adaptive bounds (for example min
2, max8). - Observe backend latency/error EWMA and window cache hit ratio.
- Increase max parallel only when backend latency/error stays stable.
- Increase history TTL together with disk cache budget, not independently.
Key tuning signals:
loki_vl_proxy_window_cache_hit_totalloki_vl_proxy_window_cache_miss_totalloki_vl_proxy_window_fetch_secondsloki_vl_proxy_window_merge_secondsloki_vl_proxy_window_prefilter_attempt_totalloki_vl_proxy_window_prefilter_error_totalloki_vl_proxy_window_prefilter_kept_totalloki_vl_proxy_window_prefilter_skipped_totalloki_vl_proxy_window_prefilter_duration_secondsloki_vl_proxy_window_adaptive_parallel_currentloki_vl_proxy_window_adaptive_latency_ewma_secondsloki_vl_proxy_window_adaptive_error_ewma
Capacity approximation for history window caching:
required_disk_bytes ~= unique_windows_per_day * avg_window_bytes * ttl_days
Always cap disk cache explicitly with -disk-cache-max-bytes for predictable retention and node usage.
Benchmark Resultsβ
Measured on Apple M3 Max (14 cores), Go 1.26.2, -benchmem.
Long-Range Phase Program Benchmarks (1h split windows)β
Command:
go test ./internal/proxy -run '^$' -bench 'BenchmarkQueryRangeWindowing_(NoPrefilter|WithPrefilter|StreamAwareBatching_Off|StreamAwareBatching_On)$' -benchmem -benchtime=10x
| Benchmark | ns/op | hits_calls/op | query_calls/op | max_inflight | allocs/op | Key result |
|---|---|---|---|---|---|---|
NoPrefilter | 39,525,225 | 0 | 49 | n/a | 11,582 | baseline full fanout |
WithPrefilter | 11,672,412 | 48 | 9 | n/a | 10,346 | ~81.6% fewer backend query calls |
StreamAwareBatching_Off | 17,048,771 | n/a | n/a | 4 | 9,490 | higher concurrency under load |
StreamAwareBatching_On | 62,808,025 | n/a | n/a | 1 | 9,656 | lower concurrency spikes, more stable backend pressure |
Notes:
- Prefiltering is the largest direct backend-load reduction lever for sparse long ranges.
- Stream-aware batching intentionally trades raw synthetic throughput for lower backend saturation risk and fewer breaker cascades on real 2d/7d traffic.
Per-Request Latencyβ
| Operation | Latency | Allocs | Bytes/op |
|---|---|---|---|
| Labels (cache hit) | 2.0 us | 25 | 6.6 KB |
| QueryRange (cache hit) | 118 us | 600 | 142 KB |
| wrapAsLokiResponse | 2.8 us | 58 | 2.6 KB |
| VL NDJSON to Loki streams (100 lines) | 170 us | 3118 | 70 KB |
| LogQL translation | ~5 us | ~20 | ~2 KB |
Cache Story In One Tableβ
These are the numbers that matter most when you want to judge the value of the cache stack rather than the implementation details:
| Path | Slow path | Fast path | What it means |
|---|---|---|---|
query_range | 4.58 ms cold miss with delayed backend | 0.64-0.67 us warm cache hit | Repeated dashboards stop behaving like backend-bound requests |
detected_field_values | 2.76 ms without Tier0 | 0.71 us with Tier0 | Drilldown metadata becomes effectively instant after warm-up |
L1 memory cache | full handler/backend path | 45 ns hit | Local hot cache is essentially free |
L2 disk cache | backend refill | 0.45 us uncompressed read, 3.9 us compressed read | Persistent cache is still cheap enough for hot-path reuse |
L3 peer cache | backend or owner re-fetch | 52 ns warm shadow-copy hit | A warm 3-node fleet can reuse results instead of refetching them |
Tier0 is most valuable on metadata-style and Drilldown-style endpoints where the proxy still has meaningful compatibility work to skip. On query_range, the deeper primary cache is already so effective that Tier0 mostly preserves that win rather than multiplying it.
Throughputβ
| Scenario | Concurrency | Throughput | Avg Latency | Errors |
|---|---|---|---|---|
| Cache hit (labels) | 100 | 175,726 req/s | 6 us | 0 |
| No cache, 10 concurrent | 10 | 9,823 req/s | 102 us | 0 |
| No cache, 50 concurrent | 50 | 17,791 req/s | 56 us | 0 |
| No cache, 200 concurrent | 200 | 33,659 req/s | 30 us | 0 |
| Cache miss (1ms backend) | 50 | 12,976 req/s | 80 us | 0 |
Resource Usage at Scaleβ
| Load (req/s) | CPU (est.) | Memory | Notes |
|---|---|---|---|
| 100 | <1% | ~10 MB | Idle, mostly cache hits |
| 1,000 | ~8% | ~20 MB | Mixed cache hit/miss |
| 10,000 | ~30% | ~50 MB | Backend-bound |
| 30,000+ | ~100% | ~100 MB | CPU-bound, scale horizontally |
Memory Stabilityβ
Under sustained load (10K requests, no cache):
- Total allocation: ~70 KB/request (GC reclaims between requests)
- Live heap growth: <1 MB (no leak)
- GC handles ~200 cycles per 10K requests
1000-line NDJSON body (700 bytes/line, 700 KB input): 1.2 MB allocated total.
Test Coverageβ
| Test | What it verifies |
|---|---|
TestOptimization_VLLogsToLokiStreams_* (7 tests) | Correctness of byte-scanned NDJSON parser |
BenchmarkProxy_Series_CompatCacheHit / BenchmarkProxy_Series_NoCompatCache | Tier0 hit-path cost versus the uncached route-execution path |
TestPeerCache_ThreePeers_ShadowCopiesAvoidRepeatedOwnerFetches | 3-node fleet reuses one owner fetch per non-owner after warm-up |
BenchmarkPeerCache_ThreePeers_ShadowCopyHit | Warm non-owner reads stay local after the first peer shadow copy |
TestOptimization_SyncPool_NoStateLeak | Pool doesn't leak labels between invocations |
TestOptimization_SyncPool_ConcurrentSafety | 50 goroutines x 100 iterations, correct results |
TestOptimization_ConnectionPool_HighConcurrency | 200 concurrent, 0 errors (port exhaustion regression) |
TestOptimization_FormatVLStep | VL step format conversion (8 cases) |
TestOptimization_VLLogsToLokiStreams_ValidJSON | 100-line output produces valid parseable JSON |
TestOptimization_NoMemoryLeak_SustainedLoad | <200 KB/req allocation after 10K requests |
TestOptimization_LargeBody_GCPressure | 1000-line body within allocation budget |
TestLoad_NoCache_ScalingProfile | 3 concurrency tiers, 0 errors |
TestLoad_WithCache_ScalingProfile | Same tiers with cache enabled |
Running Benchmarksβ
# All proxy benchmarks
go test ./internal/proxy/ -bench . -benchmem -run "^$" -count=3
# Focus on the new Tier0 path
go test ./internal/proxy/ -bench 'BenchmarkProxy_Series_(CompatCacheHit|NoCompatCache)$' -benchmem -run "^$" -count=3
# Focus on fleet cache warm shadow-copy behavior
go test ./internal/cache/ -bench 'BenchmarkPeerCache_ThreePeers_ShadowCopyHit$' -benchmem -run "^$" -count=3
# Load tests
go test ./internal/proxy/ -run "TestLoad" -v -timeout=120s
# Optimization regression tests
go test ./internal/proxy/ -run "TestOptimization" -v
# CPU profile
go test ./internal/proxy/ -bench BenchmarkVLLogsToLokiStreams -cpuprofile=cpu.prof
go tool pprof cpu.prof
# Memory profile
go test ./internal/proxy/ -bench BenchmarkVLLogsToLokiStreams -memprofile=mem.prof
go tool pprof mem.prof
Go Runtime Tuningβ
GOMEMLIMITβ
The Helm chart now injects GOMEMLIMIT at runtime. Resolution order:
goMemLimitwhen explicitly set- otherwise
goMemLimitPercentofresources.limits.memory
In percentage mode, the chart computes bytes and exports that numeric value as GOMEMLIMIT.
# Default: 85% of memory limit
goMemLimitPercent: 85 # 256Mi * 85% => GOMEMLIMIT=228589772 (bytes)
# Override with explicit value
goMemLimit: "500MiB" # ignores goMemLimitPercent
Supported memory-limit units for percentage mode are integer quantities with Ki|Mi|Gi|Ti|Pi|Ei|K|M|G|T|P|E. If resources.limits.memory is missing or unsupported, the chart does not inject a computed GOMEMLIMIT.
GOGCβ
GOGC=200 is the proxy default (set via -go-gc-percent=200), halving GC frequency at the cost of higher peak RSS versus Go's default of 100. Set -go-gc-percent=100 to restore Go's default, or lower it for tighter memory control.
Go Runtime Metrics Exposedβ
The /metrics endpoint exposes Go runtime and GC statistics:
go_memstats_alloc_bytes # current heap allocation
go_memstats_sys_bytes # total OS memory
go_memstats_heap_inuse_bytes # heap in use
go_memstats_heap_idle_bytes # heap idle
go_goroutines # active goroutines
go_gc_cycles_total # completed GC cycles
CI Integrationβ
The bench job in .github/workflows/ci.yaml runs all benchmarks and load tests on every push. It:
- Runs benchmarks 3x for stability
- Runs load tests at all concurrency tiers
- Fails the build if load tests produce errors (regression gate)
- Uploads results as CI artifacts for historical tracking
The next CI step is to add compose-backed e2e cache/fleet smoke runs for pull requests and post-merge main builds, so Tier0 behavior and 3-node peer-cache gains are validated against the full Grafana + proxy + VictoriaLogs stack rather than only unit/load environments.
For TestLoad_HighConcurrency_MemoryStability, the throughput expectation is >10k req/s in local environments and >5k req/s on shared CI runners (CI=true) to reduce race-mode noise while still catching major regressions.