LokiVLProxy Operational Resources

Alerts Covered

System metrics disappear from /metrics or dashboards.
Memory usage ratio remains above 90%.
PSI cpu or io pressure (60s window) remains elevated for 10+ minutes.
User-facing query latency and timeout/error rates increase during pressure windows.

Confirm proxy health:
- curl -fsS http://<proxy>:3100/ready
Confirm metrics endpoint includes system families:
- curl -fsS http://<proxy>:3100/metrics | rg "loki_vl_proxy_process_memory_usage_ratio|loki_vl_proxy_process_cpu_usage_ratio|loki_vl_proxy_process_pressure_|loki_vl_proxy_process_disk_(read|write)_operations_total"
Inspect startup diagnostics in proxy logs for system-metrics check output.
Check dashboard section Operational Resources for memory, CPU, PSI, disk IOPS/throughput, and network trends.

Ensure pod /proc scope is used for container-level visibility. If host /proc is mounted, metrics reflect host scope:
- systemMetrics.hostProc.enabled: true
- chart auto-sets -proc-root=/host/proc
Verify container has read access to mounted proc path.
If scraping is disabled (server.register-instrumentation=false), ensure OTLP pipeline is healthy and metrics are queryable in your backend.

Reduce expensive query pressure:
- identify top endpoints/tenants/clients from proxy metrics dashboards
- temporarily tighten query ranges or concurrency
Scale capacity:
- increase proxy replicas for CPU-bound contention
- move to nodes with higher memory/IO capacity when sustained pressure persists
Validate backend health:
- correlate with VictoriaLogs latency and error metrics
- inspect backend disk/network saturation and compare with proxy-side disk/network direction graphs

process_memory_usage_ratio < 0.85 sustained.
PSI cpu and io some-ratio drops below alert thresholds.
Query latency/error alerts recover and remain stable for at least one alert interval.