Skip to main content

Operations Guide

Deployment

Minimum Requirements

ResourceMinimumRecommended
CPU50m200m
Memory64Mi256Mi
Replicas12+ (with PDB)

The proxy is stateless (except optional disk cache). Scale horizontally without coordination.

Current implementation note:

  • per-client rate limiting, global concurrent-query protection, and the backend circuit breaker are built in with fixed defaults today
  • the current CLI does not expose direct tuning flags for those controls
  • use Grafana refresh policy, ingress shaping, HPA, and cache tuning as the primary operator levers

Helm Deployment

helm install loki-vl-proxy oci://ghcr.io/reliablyobserve/charts/loki-vl-proxy \
--version <release> \
--set extraArgs.backend=http://victorialogs:9428 \
--set extraArgs.label-style=underscores

# Local chart (development)
helm install loki-vl-proxy ./charts/loki-vl-proxy \
--set extraArgs.backend=http://victorialogs:9428 \
--set extraArgs.label-style=underscores

For multi-replica fleets with HPA, prefer peerCache.enabled=true over static peer lists. The chart creates a headless service and the proxy refreshes DNS-discovered peers automatically, so scaling events do not require manual replica or peer updates.

For Grafana Logs Drilldown pattern discovery, keep the default extraArgs.patterns-enabled=true or set it explicitly during rollout if you need to control the surface area:

extraArgs:
backend: http://victorialogs:9428
label-style: underscores
patterns-enabled: "true"

Required Configuration

FlagRequiredDescription
-backendYesVictoriaLogs URL
-listenNoListen address (default :3100)
-label-styleNopassthrough (default) or underscores

Backend Auth Forwarding

If VictoriaLogs authentication is delegated from upstream clients, you can forward client Authorization to backend explicitly:

-forward-authorization=true

Equivalent manual mode:

-forward-headers=Authorization

Use this only in trusted topologies (for example Grafana/auth-proxy -> Loki-VL-proxy -> VictoriaLogs).


Operational Assets

Treat these as one versioned operational package:

AssetCanonical sourcePurpose
Grafana operations dashboarddashboard/loki-vl-proxy.jsonOperator view for Client -> Proxy -> VictoriaLogs, route-aware RED signals, cache behavior, long-range query tuning, and operational resources
Alert rulesalerting/loki-vl-proxy-prometheusrule.yamlPrometheusRule/vmalert-oriented alert set with standardized labels and annotations
SRE runbooksdocs/runbooks/alerts.mdIndex plus per-alert runbook files referenced directly from alert runbook_url

When using the Helm chart, the runtime templates consume synced copies in charts/loki-vl-proxy/{dashboards,alerting}. Keep canonical and chart copies aligned with:

./scripts/ci/sync_observability_assets.sh sync
./scripts/ci/sync_observability_assets.sh --check

--check is already enforced in CI to prevent drift.


Preventive Scaling And Deployment

Use the dedicated guide for prevention-oriented operations hardening:

Critical defaults to reduce incident frequency:

  • run at least 2 replicas with PDB enabled
  • enable HPA with conservative downscale
  • tune cache TTLs differently for query paths vs metadata paths
  • monitor backend p95 and proxy p99 histograms, not averages
  • add synthetic in-cluster e2e query probes in addition to /ready

Translation Modes

Translation guidance moved to dedicated docs:

Operational recommendation:

  • use label-style=underscores when upstream VL stores dotted OTel fields
  • use metadata-field-mode=hybrid for mixed Loki + OTel field workflows
  • use metadata-field-mode=translated for strict Loki-style field surfaces
  • use metadata-field-mode=native for OTel-native field-only surfaces

Capacity Planning

Memory

ComponentMemory per Unit
L1 cache (default 10k entries)~50MB
L2 disk cache (bbolt)~10MB mmap overhead
Per active query~1-5MB (depends on result size)
Singleflight coalescing bufferUp to 256MB per unique query
Base process~20MB

Formula: base(20MB) + cache(entries × 5KB) + concurrent_queries × 3MB

For 10k cache entries and 100 concurrent queries: ~370MB recommended limit.

CPU

The proxy is CPU-light. Main costs:

  • JSON marshaling/unmarshaling (~70% of CPU)
  • LogQL→LogsQL translation (~10%)
  • Label translation (~5%)
  • HTTP overhead (~15%)

Guideline: 1 CPU core handles ~2000 req/s.

Disk Cache

L2 disk cache with bbolt:

  • 1 million entries ≈ 2-5GB on disk (gzip compressed)
  • Write amplification: ~2x with bbolt
  • Use fast SSD (NVMe) for the cache volume
  • Set disk-cache-flush-size=500 and disk-cache-flush-interval=10s for batched writes

Performance Tuning

Cache TTLs

Default TTLs are conservative. Adjust for your query patterns:

-cache-ttl=120s # Increase for stable label sets
-cache-max=50000 # Increase for high-cardinality environments
EndpointDefault TTLRecommendation
labels60s120-300s if label set is stable
label_values60s60-120s
series30s30-60s
detected_fields30s30-60s
query_range10s5-30s depending on freshness needs
query10s5-30s

Concurrency Limits

-http-max-header-bytes=1048576 # 1MB default
-http-max-body-bytes=10485760 # 10MB default

The proxy uses singleflight to coalesce identical concurrent queries. N identical requests → 1 backend request.

Built-In Traffic Guards

The current code uses these built-in defaults:

  • per-client rate limit: 50 req/s
  • per-client burst: 100
  • global concurrent backend queries: 100
  • circuit breaker: open after 5 failures, remain open for 10s

These values are not exposed as CLI or Helm flags today. If they are too strict or too loose for your workload, mitigate at the surrounding layers:

  • reduce Grafana auto-refresh and retry pressure
  • add ingress or service-mesh shaping in front of the proxy
  • scale out replicas and raise cache effectiveness before pushing more uncached load through the same pods

Monitoring

See the dedicated Observability Guide for the full metrics catalog, JSON log schema, OTLP push configuration, and collector/agent integration examples.

Metrics

The proxy exposes Prometheus metrics at /metrics:

Use the Observability Guide as the canonical catalog for:

  • every documented loki_vl_proxy_* metric family
  • cardinality level (Low, Medium, High (capped)) for each family
  • scrape versus OTLP field/label mapping
  • the new fanout and proxy-internal operation metrics/log fields
MetricTypePrimary dimensionsDescription
loki_vl_proxy_requests_totalcountersystem, direction, endpoint, route, statusTotal requests by downstream Loki route or upstream backend route
loki_vl_proxy_request_duration_secondshistogramsystem, direction, endpoint, routeEnd-to-end request latency
loki_vl_proxy_backend_duration_secondshistogramsystem, direction, endpoint, routeUpstream-only latency for VictoriaLogs and rules/alerts backends
loki_vl_proxy_cache_hits_by_endpoint / loki_vl_proxy_cache_misses_by_endpointcountersystem, direction, endpoint, routeCache efficiency by normalized route
loki_vl_proxy_tenant_requests_total / loki_vl_proxy_client_requests_totalcountertenant/client plus route dimensionsHot tenants and clients per route
loki_vl_proxy_process_*gauges/countersmetric family specificRuntime, CPU, memory, disk, network, and PSI health

Key Ratios to Monitor

  • Route cache hit ratio: cache_hits_by_endpoint / (cache_hits_by_endpoint + cache_misses_by_endpoint) by endpoint,route — target >80% on stable metadata paths
  • Downstream error rate: requests_total{system="loki",direction="downstream",status=~"5.."} over total downstream requests — target <1%
  • Upstream latency: backend_duration_seconds by endpoint,route — use this to separate VictoriaLogs slowness from proxy-side work
  • End-to-end latency: request_duration_seconds{system="loki",direction="downstream"} by endpoint,route — compare with upstream latency and request logs

OTLP Push

Push metrics to an OTLP collector:

-otlp-endpoint=http://otel-collector:4318/v1/metrics
-otlp-interval=30s
-otlp-compression=gzip

The OTLP exporter reuses the same core proxy metric names that /metrics exposes, so dashboards and alert logic can stay aligned across scrape and push modes.

For exact proxy-only overhead on translated paths, use structured request logs with proxy.overhead_ms, proxy.duration_ms, and upstream.duration_ms. The metrics intentionally keep route-aware end-to-end and upstream histograms, while logs carry the per-request decomposition.


Troubleshooting

No Data in Grafana

  1. Check proxy health: curl http://proxy:3100/ready
  2. Check VL backend: curl http://vl:9428/health
  3. Check proxy logs for translation errors
  4. Verify label-style matches your VL ingestion format
  5. Check /loki/api/v1/labels for available labels

If /ready stays non-ok immediately after a restart, also check whether patterns or indexed label-values startup warm is configured. Those persistence restores can intentionally hold readiness at 503 until warm-up completes.

Label Names Don't Match

SymptomCauseFix
Dots in Grafana labelslabel-style=passthrough with dotted VL dataSet label-style=underscores
Empty label_values for service_nameVL stores service.name, query asks service_nameSet label-style=underscores
Grafana Drilldown "failed to fetch"Volume/stats endpoint issueCheck proxy logs, ensure VL v1.49+

High Memory Usage

  • Reduce -cache-max (default 10000)
  • Reduce -http-max-body-bytes
  • Add memory limits in Kubernetes
  • Check for singleflight amplification (many unique queries)

High Latency

  • Keep -response-compression=gzip for broad Loki/Grafana compatibility; auto now behaves the same on the frontend for legacy configs
  • Set -response-compression-min-bytes around 1024 to avoid wasting CPU on small metadata/control responses
  • Increase cache TTLs
  • Check VL backend latency via metrics
  • Rely on built-in singleflight coalescing for identical concurrent reads

Circuit Breaker Tripping

The circuit breaker opens after consecutive backend 5xx responses. Check:

  • VL backend health and logs
  • Network connectivity between proxy and VL
  • VL resource usage (CPU/memory/disk)

Backup & Recovery

The proxy is stateless. Only the optional disk cache needs backup:

  • L1 cache: In-memory, rebuilds on restart
  • L2 disk cache: bbolt file at -disk-cache-path. Can be deleted safely — will be repopulated.
  • Configuration: All config is CLI flags / env vars. Store in Helm values or ConfigMap.

Scaling

Horizontal Scaling

horizontalPodAutoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

Pod Disruption Budget

podDisruptionBudget:
enabled: true
minAvailable: 1

Multi-Zone Deployment

topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: loki-vl-proxy