Fleet Cache Architecture
Overviewโ
The fleet cache enables multiple Loki-VL-proxy replicas to share cached data with minimal network overhead. Each key lives on exactly one peer (the owner, determined by consistent hashing). Non-owner peers fetch from the owner on local miss and keep short-lived shadow copies. With owner write-through enabled (default), non-owner pods also push eligible long-TTL writes to the owner shard so hot traffic pinned to one pod still warms the full fleet.
Request Flowโ
Cache Hit (Local) โ 0 Hopsโ
Cache Hit (Peer) โ 1 Hopโ
Cache Miss (VL Fetch)โ
Non-Owner Miss + Owner Write-Through (default)โ
TTL Preservationโ
Shadow copies use the owner's remaining TTL, not a fresh default:
Consistent Hash Ringโ
Keys map to peers deterministically โ no communication needed:
150 virtual nodes per peer ensures even distribution:
- 2 peers โ ~50/50 split
- 3 peers โ ~33/33/33 split
- Adding a peer moves ~1/N keys (minimal rebalancing)
Circuit Breakerโ
Per-peer circuit breaker prevents cascading failures:
Peer Discoveryโ
Four discovery modes are supported. All share the same mechanics: a background goroutine re-runs discovery every DiscoveryInterval (default 15 s) and atomically rebuilds the consistent hash ring. Peers that disappear from the source are automatically removed; new peers that appear are added. No restart required.
dns โ Kubernetes headless service (A records)โ
-peer-discovery=dns
-peer-dns=loki-vl-proxy-headless.monitoring.svc.cluster.local
net.LookupHost resolves the headless service name to the IP addresses of all running pods. In Kubernetes, a headless service DNS entry only includes pods that pass their readiness probe โ unhealthy pods fall out of DNS within one TTL (typically 5โ30 s), so the peer ring automatically excludes them.
Use when:
- Running in Kubernetes with an HPA-managed Deployment or StatefulSet
- You want readiness-probe gating to control peer inclusion automatically
- Pods have stable cluster IP (not required) but stable DNS subdomain
srv โ DNS SRV recordsโ
-peer-discovery=srv
-peer-srv=_loki-vl-proxy._tcp.loki-vl-proxy-headless.monitoring.svc.cluster.local
net.LookupSRV resolves a full SRV record (_service._proto.domain) and extracts host+port from each record. SRV records embed the port number, so no -peer-port flag is needed โ each SRV record can point to a different port if required.
In Kubernetes, StatefulSet headless services publish SRV records per pod (e.g., _http._tcp.proxy-headless.ns.svc.cluster.local). These carry the same readiness gating as A records.
Outside Kubernetes, any DNS server that publishes SRV records works:
- HAProxy DNS resolver
- CoreDNS with custom zone
- Consul DNS (
_service._tcp.service.consul) - Manual BIND zone file
The SRV name format is _service._proto.domain. All three segments are required and must start with _.
http โ HTTP JSON endpointโ
-peer-discovery=http
-peer-http-url=http://consul:8500/v1/catalog/service/loki-vl-proxy
The proxy fetches the configured URL every DiscoveryInterval. The response body is parsed as JSON and must return a list of host:port strings. Four response formats are supported and auto-detected:
| Format | Example |
|---|---|
| Simple array | ["10.0.0.1:3100","10.0.0.2:3100"] |
Object with peers key | {"peers":["10.0.0.1:3100"]} |
| Prometheus HTTP SD | [{"targets":["10.0.0.1:3100"],"labels":{}}] |
| Consul catalog | [{"ServiceAddress":"10.0.0.1","ServicePort":3100}] |
Consul example (includes health check filtering):
-peer-http-url=http://localhost:8500/v1/health/service/loki-vl-proxy?passing=true
Consul's ?passing=true parameter returns only healthy instances โ equivalent to Kubernetes readiness gating.
Prometheus HTTP SD example (custom endpoint):
-peer-http-url=http://my-registry/sd/loki-vl-proxy
Nomad example:
-peer-http-url=http://nomad:4646/v1/service/loki-vl-proxy
Use when:
- Running outside Kubernetes (VMs, bare metal, Nomad, Docker Swarm)
- Already using Consul or another service registry
- You have a custom service registry or health-checked load balancer
- You want fine-grained control over which instances participate in the peer ring
How add/remove works: The HTTP endpoint is the authoritative source. On each discovery tick the proxy fetches the URL and passes the result to updatePeers(). If an instance disappears from the response (failed health check, deregistered, shut down), it is removed from the hash ring within one DiscoveryInterval. There is no explicit register/deregister on the proxy side โ that is handled by whatever manages the registry (Consul agent, health check cron, deployment tooling).
static โ Fixed peer listโ
-peer-discovery=static
-peer-static=10.0.0.1:3100,10.0.0.2:3100,10.0.0.3:3100
The peer list is parsed once at startup and never refreshed. Useful for small, fixed fleets where the topology does not change without a restart. Requires manual update (redeploy) when peers are added or removed.
Peer discovery comparisonโ
| Mode | Readiness gating | Dynamic add/remove | Works outside k8s | Port in config |
|---|---|---|---|---|
dns | โ (k8s headless) | โ (every 15 s) | โ ๏ธ requires headless-style DNS | Yes (-peer-port) |
srv | โ (k8s or Consul DNS) | โ (every 15 s) | โ | No (embedded in SRV) |
http | โ (endpoint controls list) | โ (every 15 s) | โ | Yes (in response) |
static | โ | โ (restart required) | โ | Yes (in flag) |
Diagnostic endpointโ
GET /_cache/peers returns the current known peer list as JSON:
{"peers":["10.0.0.1:3100","10.0.0.2:3100"],"self":"10.0.0.3:3100","count":2}
This reflects the ring at the moment of the request and is useful for verifying that discovery is working correctly.
AZ-Aware Peer Selectionโ
When a proxy instance needs to fetch a key from a peer (startup warmup, L1 miss, read-ahead), it picks the peer with the highest remaining TTL by default โ maximising cache freshness. With AZ-aware selection enabled it applies a two-tier preference:
- Same-AZ peers with fresh data โ lowest latency and no cross-AZ transfer cost.
- Any peer with fresh data โ fallback when no same-AZ peer has the key.
This reduces cross-AZ data-transfer costs and latency in multi-AZ cloud deployments without sacrificing correctness. If no peer has the key, the proxy falls through to VictoriaLogs as usual.
How to configureโ
Flag:
-peer-self-az=us-east-1a
Set this to the availability zone of the current instance. When empty (the default), AZ preference is disabled and peers are selected purely by TTL freshness.
Helm โ explicit AZ:
peerCache:
enabled: true
selfAZ: "us-east-1a"
Helm โ automatic detection from pod topology label (recommended for Kubernetes):
peerCache:
enabled: true
# selfAZ: "" # empty = auto-detect from topologyLabel (default)
topologyLabel: "topology.kubernetes.io/zone" # default; matches standard k8s node topology label
# Ensure pods carry the topology zone label so the downward API can read it:
podLabels:
topology.kubernetes.io/zone: "us-east-1a"
The chart injects a PEER_SELF_AZ env var sourced from metadata.labels['topology.kubernetes.io/zone'] via the Kubernetes Downward API and passes it as -peer-self-az=$(PEER_SELF_AZ). If the pod does not carry the label (e.g., it was not set via podLabels or a platform webhook), the env var resolves to "" and AZ preference is silently disabled.
Kubernetes platforms that populate the topology label automatically:
| Platform | How to enable |
|---|---|
| Karpenter | Add the label to NodePool.spec.template.metadata.labels |
| GKE Autopilot | Use cloud.google.com/gke-nodepool or set podLabels in Helm values |
| EKS with Karpenter or managed node groups | NodePool.spec.template.metadata.labels: {topology.kubernetes.io/zone: <zone>} |
| Cluster with a node label syncer / mutating webhook | Labels propagated automatically |
HTTP SD โ AZ from discovery labels:
When using peer-discovery=http with a Prometheus HTTP SD endpoint, the proxy extracts AZ from the target group's labels.az or labels.availability_zone field automatically โ no extra flag needed:
[
{
"targets": ["10.0.0.1:3100", "10.0.0.2:3100"],
"labels": {"az": "us-east-1a", "env": "prod"}
},
{
"targets": ["10.0.0.3:3100"],
"labels": {"az": "us-east-1b"}
}
]
Each target's AZ is stored at discovery refresh time and used during peer selection. No configuration beyond the SD labels is required.
Configuration Examplesโ
# Kubernetes: DNS discovery via headless service (single-AZ or no AZ preference)
./loki-vl-proxy \
-peer-self=$(hostname -i):3100 \
-peer-discovery=dns \
-peer-dns=loki-vl-proxy-headless.monitoring.svc.cluster.local
# Kubernetes: DNS discovery with AZ-aware peer selection
./loki-vl-proxy \
-peer-self=$(hostname -i):3100 \
-peer-self-az=us-east-1a \
-peer-discovery=dns \
-peer-dns=loki-vl-proxy-headless.monitoring.svc.cluster.local
# Kubernetes: SRV discovery (StatefulSet with headless service)
./loki-vl-proxy \
-peer-self=$(hostname -i):3100 \
-peer-discovery=srv \
-peer-srv=_loki-vl-proxy._tcp.loki-vl-proxy-headless.monitoring.svc.cluster.local
# Consul (health-checked, works outside k8s)
./loki-vl-proxy \
-peer-self=$(hostname -i):3100 \
-peer-discovery=http \
-peer-http-url=http://localhost:8500/v1/health/service/loki-vl-proxy?passing=true
# Prometheus HTTP SD with AZ labels (AZ extracted automatically from labels.az)
./loki-vl-proxy \
-peer-self=$(hostname -i):3100 \
-peer-self-az=us-east-1a \
-peer-discovery=http \
-peer-http-url=http://my-registry/sd/loki-vl-proxy
# Static peer list
./loki-vl-proxy \
-peer-self=10.0.0.1:3100 \
-peer-discovery=static \
-peer-static=10.0.0.1:3100,10.0.0.2:3100,10.0.0.3:3100 \
-peer-auth-token=shared-secret
Helm Valuesโ
# Minimal โ chart auto-wires peer-self, peer-discovery, and peer-dns
peerCache:
enabled: true
# With AZ-aware peer selection โ automatic from pod topology label
peerCache:
enabled: true
topologyLabel: "topology.kubernetes.io/zone" # default
# Ensure pods carry the label (set by platform or explicitly):
podLabels:
topology.kubernetes.io/zone: "us-east-1a"
# With AZ-aware peer selection โ explicit zone
peerCache:
enabled: true
selfAZ: "us-east-1a"
When you use the Helm chart, prefer peerCache.enabled=true and let the chart wire the discovery flags. Use peerCache.authToken or peerCache.existingSecret when you need to provide the shared secret yourself; extraArgs.peer-auth-token is intentionally rejected while peerCache.enabled=true because the chart owns that CLI flag.
Performance Characteristicsโ
| Metric | Value |
|---|---|
| L1 latency | ~2ยตs |
| L2 latency | ~1ms |
| L3 latency (peer) | ~1-5ms |
| VL latency | ~10-100ms |
| Background traffic | Near zero; only request-path peer fetches and write-through pushes |
| Startup warmup VL queries | โคW (one per window, regardless of fleet size, with peer-first warmup) |
/_cache/has response size | ~50 bytes per key (JSON metadata, no values) |
| Max VL calls per key | 1 (per owner) |
| Shadow copy overhead | ~0 (uses owner's remaining TTL) |
| Hash ring lookup | O(log N) |
| Discovery refresh | Every 15s (dns / srv / http modes) |
Peer fetch behavior details:
- larger
/_cache/getpayloads are compressed when peers requestAccept-Encoding, preferringzstdand falling back togzip - when
-peer-write-through=true, non-owner writes above-peer-write-through-min-ttlare pushed to owners via/_cache/set - set
-peer-auth-tokenfleet-wide in Kubernetes deployments so peer fetches authenticate by token instead of only by the currently discovered peer IP set - when
-peer-auth-tokenis set, both peer fetch and peer write-through calls must carry the shared token or endpoints fail closed
Fleet Metricsโ
The /metrics endpoint exports fleet-specific visibility for peer-cache behavior:
loki_vl_proxy_peer_cache_peers # remote peers, excluding self
loki_vl_proxy_peer_cache_cluster_members # total ring members, including self
loki_vl_proxy_peer_cache_hits_total # successful peer fetches
loki_vl_proxy_peer_cache_misses_total # owner returned miss / near-expiry miss
loki_vl_proxy_peer_cache_errors_total # peer fetch failures
loki_vl_proxy_peer_cache_write_through_pushes_total # successful owner write-through pushes
loki_vl_proxy_peer_cache_write_through_errors_total # failed owner write-through pushes
Use these together with the normal client metrics to tell apart:
- backend pain caused by specific Grafana users or tenants
- cache-ring imbalance or shrinking fleets
- peer-to-peer failures that are forcing traffic back to VictoriaLogs
Collapse Forwarding Statusโ
Current behavior already includes request collapsing in two critical places:
- Proxy -> VictoriaLogs collapse uses singleflight coalescing (
internal/middleware/coalescer.go) so concurrent identical requests share one upstream call. - Peer-cache
/_cache/getcollapse uses per-key in-flight dedupe (internal/cache/peer.go) so concurrent non-owner pulls for the same key share one owner fetch.
Recent verification coverage:
TestCoalescer_DedupConcurrentRequestsTestCoalescer_TenantIsolationTestPeerCache_CoalescingAndCacheIntegrationTestPeerCache_ThreePeers_ShadowCopiesAvoidRepeatedOwnerFetches
Peer payload exchange already prefers zstd, then gzip, then identity.
Hot Read-Ahead (Bounded)โ
Bounded hot read-ahead is implemented and remains disabled by default (-peer-hot-read-ahead-enabled=false).
Runtime behavior:
- Owners expose a compact hot-key index on
/_cache/hot(top N keys with score, size, and remaining TTL). - Peers pull owner hot indexes on a periodic, jittered loop.
- Prefetch selection is bounded and tenant-fair:
- remaining TTL must be above threshold
- object size must stay below prefetch object limit
- selected keys must stay within key budget
- selected bytes must stay within byte budget
- first pass enforces per-tenant fairness cap, second pass backfills remaining budget
- Prefetch fetches use existing
/_cache/getwithAccept-Encoding: zstd, gzip. - Prefetched values are inserted as local shadow copies (no write-through fanout loops).
- Existing collapse-forwarding stays in place: concurrent pulls for the same key coalesce.
Anti-storm controls:
- max concurrency for hot-index and prefetch pulls
- strict per-interval key/byte budgets
- jittered scheduling
- circuit-breaker-aware peer selection
- error-streak backoff before next read-ahead cycle
Read-ahead observability metrics:
loki_vl_proxy_peer_cache_hot_index_requests_total
loki_vl_proxy_peer_cache_hot_index_errors_total
loki_vl_proxy_peer_cache_read_ahead_prefetches_total
loki_vl_proxy_peer_cache_read_ahead_prefetch_bytes_total
loki_vl_proxy_peer_cache_read_ahead_budget_drops_total
loki_vl_proxy_peer_cache_read_ahead_tenant_skips_total
These are additive to existing peer-cache counters and are also used by CI regression guards.
Expected effect:
- Lower VictoriaLogs fetch rate for repeatedly accessed hot keys.
- Better p95/p99 cache hit latency on non-owner replicas.
- More even read pressure across a fleet behind L4/L7 load balancers.
Design Decisionsโ
| Decision | Why |
|---|---|
| Consistent hashing (not gossip) | Zero background traffic, deterministic routing |
| Owner write-through + shadow copies | Preserve owner-centric cache warmth under skewed traffic while keeping non-owner shadows short-lived |
| TTL preservation (not extension) | Never serve stale data beyond original intent |
| MinUsableTTL=5s (force refresh) | Don't transfer data that expires in transit |
| Singleflight per key | Prevent cache stampede on L3 misses |
| Per-peer circuit breaker | Isolate failures, auto-recover after cooldown |
| No disk encryption | Delegated to cloud provider (EBS/PD encryption at rest) |
Startup Coordination and Fleet Restart Safetyโ
The Problem: Thundering Herdโ
Without coordination, a rolling restart of N proxy instances causes every instance to fire expensive metadata warmup queries to VL simultaneously:
t=0s: instance-1 restarts โ stream_field_names [1h] โ stream_field_names [6h] โ ... (4 queries)
t=0s: instance-2 restarts โ stream_field_names [1h] โ stream_field_names [6h] โ ... (4 queries)
t=0s: instance-3 restarts โ ...
...
t=0s: instance-9 restarts โ stream_field_names [7d] โ (4 queries)
Total: 9 instances ร 4 windows ร 14-46s query = 36 wide-range VL queries in parallel
Result: VL OOM / restart
Solution: Three-Layer Startup Defenseโ
Layer 1: Startup Jitterโ
Controlled by -warmup-max-jitter. Each instance sleeps for a random duration
[0, maxJitter) before starting warmup queries. This staggers the fleet so instances
don't all hit VL simultaneously.
Recommended settings:
| Fleet size | -warmup-max-jitter | Expected VL hits per window |
|---|---|---|
| 2โ5 pods | 5s | 1 (first pod only) |
| 6โ15 pods | 10s | 1โ2 |
| 16โ30 pods | 20s | 2โ3 |
| 30+ pods | 30s | โค3 |
A warmup of the 4 standard label windows takes ~2โ8 s total. With maxJitter=10s
a pod waking up at t=4s will find fresh data from a pod that woke at t=0s.
Layer 2: Batch Peer Discovery (/_cache/has)โ
After jitter, each instance checks whether any peer already has the data before touching VL. This is a two-phase operation:
Phase 1 โ Discovery (metadata only, no value transfer):
Phase 2 โ Targeted fetch (values only from the freshest peer):
Layer 3: Inter-Window Sleepโ
When an instance must fetch from VL (it's the first one up or peers had nothing), a 500ms pause between each window prevents consecutive wide-range queries from monopolizing VL's query concurrency slots.
Network Traffic Analysisโ
Per-Restart Request Countโ
For a fleet of P peers warming W label windows (default W=4):
| Strategy | Requests per instance | Total fleet requests | Data transferred |
|---|---|---|---|
| Old (per-key get) | P ร W (worst case) | Pยฒ ร W | full values ร Pยฒ ร W |
| New (batch has + targeted get) | P + W' (W' โค W) | Pยฒ + PรW' | tiny JSON ร Pยฒ + values ร PรW' |
Example: 9-pod fleet restarting simultaneously
Example: 30-pod fleet
The peer-first strategy means only the first pod per window needs to hit VL; all subsequent pods pull from that pod. With 4 windows and staggered jitter, the realistic steady state is 4 VL warmup queries total regardless of fleet size.
Timeline: 30-Pod Rolling Restartโ
Key observation: VL only sees warmup queries from the first 2 instances. All
subsequent instances pull from peers. This is true for any fleet size as long
as maxJitter is larger than the warmup duration (~6โ8s for 4 windows).
/_cache/has Endpoint Referenceโ
GET /_cache/has?keys=key1,key2,key3
Batch key-presence check. Returns JSON presence and remaining TTL for each requested key. No value data is transferred โ responses are tiny (~50 bytes per key).
Query parameters:
| Parameter | Description |
|---|---|
keys | Comma-separated cache keys (max 200) |
Response โ 200 OK, Content-Type: application/json:
{
"labels:start=1716278400000000000&end=1716282000000000000&query=%2A": {
"ok": true,
"ttl_ms": 55000
},
"labels:start=1716257200000000000&end=1716282000000000000&query=%2A": {
"ok": false
}
}
| Field | Description |
|---|---|
ok | true if key is present and has > MinUsableTTL (5s) remaining |
ttl_ms | Remaining TTL in milliseconds; only present when ok=true |
Behavior:
- Keys near expiry (
remaining < 5s) are reported asok: false(treat as miss) - Response can be
zstd- orgzip-compressed whenAccept-Encodingheader is set - Protected by the same
X-Peer-Tokenauthentication as/_cache/getand/_cache/set
Use case โ pick the freshest peer before fetching:
caller โ each peer: GET /_cache/has?keys=k1,k2,k3,k4 (metadata, ~200 bytes/peer)
caller โ each peer: {k1: {ok:true, ttl_ms:55000}, k2: {ok:false}, ...}
caller selects peer with highest ttl_ms per key
caller โ best peer: GET /_cache/get?key=k1 (value fetch, only if needed)
Peer Endpoint Summaryโ
| Endpoint | Method | Purpose | Body transferred |
|---|---|---|---|
/_cache/get?key=K | GET | Fetch value for one key | Full value (compressed) |
/_cache/set?key=K&ttl_ms=T | POST | Push a value to a peer (write-through) | Full value |
/_cache/has?keys=k1,k2,... | GET | Batch presence + TTL check | JSON metadata only (~50B/key) |
/_cache/hot?limit=N | GET | Top N hot keys with scores and TTL | JSON index (no values) |
All endpoints respect X-Peer-Token when -peer-auth-token is configured.
Responses โฅ1 KB are offered compressed (zstd preferred, gzip fallback).
Large-Fleet Configuration Referenceโ
Kubernetes (30+ pods)โ
# values.yaml
extraArgs:
peer-self: "$(POD_IP):3100"
peer-discovery: "dns"
peer-dns: "loki-vl-proxy-headless.monitoring.svc.cluster.local"
peer-auth-token: "$(PEER_AUTH_TOKEN)" # from Secret
warmup-max-jitter: "20s" # spread 30 pods over 20s window
# Headless service for peer discovery
# (chart creates this automatically when peerCache.enabled=true)
Jitter Sizing Formulaโ
recommended_jitter = max(single_warmup_duration ร 1.5, 5s)
single_warmup_duration โ 4 windows ร (avg_VL_latency + 500ms_inter_window_sleep)
โ 4 ร (2s + 0.5s) = 10s (typical)
recommended_jitter โ 10s ร 1.5 = 15s
For large fleets (30+ pods) add extra buffer: recommended_jitter = 20โ30s.
Expected Steady-State VL Loadโ
| Fleet size | maxJitter | VL warmup queries per full restart |
|---|---|---|
| 3 pods | 5s | โค4 (1 per window) |
| 9 pods | 10s | โค4 |
| 20 pods | 15s | โค4โ8 |
| 30 pods | 20s | โค4โ8 |
| 50 pods | 30s | โค8 |
The theoretical minimum is W (one VL query per label window, regardless of fleet size) because the peer-first strategy means only the first instance per window touches VL. In practice, 1โ2 additional instances may overlap before the first completes, giving โค2W queries for the most contended windows.