LokiVLProxy Operational Resources
Alerts Covered
LokiVLProxySystemMetricsMissing
LokiVLProxySystemMemoryHigh
LokiVLProxySystemCPUPressureHigh
LokiVLProxySystemIOPressureHigh
Symptoms
- System metrics disappear from
/metrics or dashboards.
- Memory usage ratio remains above
90%.
- PSI
cpu or io pressure (60s window) remains elevated for 10+ minutes.
- User-facing query latency and timeout/error rates increase during pressure windows.
- Confirm proxy health:
curl -fsS http://<proxy>:3100/ready
- Confirm metrics endpoint includes system families:
curl -fsS http://<proxy>:3100/metrics | rg "loki_vl_proxy_process_memory_usage_ratio|loki_vl_proxy_process_cpu_usage_ratio|loki_vl_proxy_process_pressure_|loki_vl_proxy_process_disk_(read|write)_operations_total"
- Inspect startup diagnostics in proxy logs for system-metrics check output.
- Check dashboard section
Operational Resources for memory, CPU, PSI, disk IOPS/throughput, and network trends.
Kubernetes-Specific Checks
- Ensure pod
/proc scope is used for container-level visibility. If host /proc is mounted, metrics reflect host scope:
systemMetrics.hostProc.enabled: true
- chart auto-sets
-proc-root=/host/proc
- Verify container has read access to mounted proc path.
- If scraping is disabled (
server.register-instrumentation=false), ensure OTLP pipeline is healthy and metrics are queryable in your backend.
Mitigation
- Reduce expensive query pressure:
- identify top endpoints/tenants/clients from proxy metrics dashboards
- temporarily tighten query ranges or concurrency
- Scale capacity:
- increase proxy replicas for CPU-bound contention
- move to nodes with higher memory/IO capacity when sustained pressure persists
- Validate backend health:
- correlate with VictoriaLogs latency and error metrics
- inspect backend disk/network saturation and compare with proxy-side disk/network direction graphs
Recovery Criteria
process_memory_usage_ratio < 0.85 sustained.
- PSI
cpu and io some-ratio drops below alert thresholds.
- Query latency/error alerts recover and remain stable for at least one alert interval.