Observability Guide
Loki-VL-proxy emits the same core operational signals in two forms:
| Signal | Transport | Format | Best use |
|---|---|---|---|
| Metrics | Pull | Prometheus text at /metrics | Prometheus, Grafana Agent, Alloy, VictoriaMetrics, kube scraping |
| Metrics | Push | OTLP HTTP JSON to /v1/metrics | OpenTelemetry Collector, vendor OTLP gateways |
| Logs | Stream | Structured JSON on stdout/stderr | Fluent Bit, Vector, OpenTelemetry Collector, Docker/Kubernetes log agents |
The intent is parity, not two separate products. Prometheus scrape and OTLP push carry the same proxy-centric metric families, units, and low-cardinality request dimensions for the important operational paths. Prometheus uses label keys such as system, direction, endpoint, route, and status; OTLP exports the same dimensions as semantically aligned attributes such as loki.api.system, proxy.direction, loki.request.type, http.route, and http.response.status_code.
That shared model is what makes the packaged dashboard portable across scrape-backed and OTLP-backed setups without rewriting the operator view:
- request volume and latency
- backend latency
- cache hit/miss behavior
- translation failures
- tenant and client hot spots
- client status and query-length outliers
- process/runtime and host-level health
The proxy also now keeps more expensive metadata paths deliberately warmer than live log queries:
- live query and tail paths stay on short TTLs so fresh log visibility is not stretched unnecessarily
- slower-changing metadata such as labels, field lists, field values, and patterns are cached more aggressively
- Drilldown prefers backend-native metadata discovery where it is safe, which reduces proxy-side rescans and lowers CPU pressure on repeated field/label browsing
Observability Endpoints
| Endpoint | Purpose |
|---|---|
GET /ready | Readiness probe (checks backend /health and circuit-breaker state) |
GET /metrics | Prometheus text exposition (-server.register-instrumentation, bounded by -server.metrics-max-concurrency) |
GET /debug/queries | Query analytics endpoint (disabled by default, -server.enable-query-analytics) |
GET /debug/pprof/ | Go pprof profiling endpoints (disabled by default, -server.enable-pprof) |
Logs
JSON Log Shape
Default logs are emitted as JSON and already use OTel-friendly top-level keys:
{
"timestamp": "2026-04-05T18:03:27.214918Z",
"severity": {
"text": "INFO",
"number": 9
},
"body": "request",
"component": "proxy",
"http.route": "/loki/api/v1/query_range",
"url.path": "/loki/api/v1/query_range",
"http.request.method": "GET",
"http.response.status_code": 200,
"loki.request.type": "query_range",
"loki.api.system": "loki",
"proxy.direction": "downstream",
"event.duration": 42000000,
"loki.tenant.id": "team-a",
"loki.query": "{service_name=\"api\"} |= \"error\"",
"client.address": "10.0.0.12:51884",
"enduser.id": "grafana-user@example.com",
"enduser.source": "grafana_user",
"cache.result": "miss",
"proxy.duration_ms": 42,
"upstream.calls": 1,
"upstream.calls_by_type": {
"vl:select_logsql_query": 1
},
"upstream.status_code": 200,
"upstream.duration_ms": 31,
"upstream.duration_ms_by_type": {
"vl:select_logsql_query": 31
},
"proxy.operations_by_type": {
"translate_query:translated": 1
},
"proxy.operation_duration_ms_by_type": {
"translate_query:translated": 4
},
"proxy.overhead_ms": 11
}
That makes the log stream usable in two ways:
- plain JSON ingestion with no transformation
- low-friction mapping into the OpenTelemetry log data model
Log Sources
The proxy writes structured logs for:
- request lifecycle and status
- query translation and backend request flow
- tail/WebSocket behavior
- delete audit events
- cache warmer and disk cache internals
- OTLP export failures
OpenTelemetry Fields Used in Logs
| Field | Meaning |
|---|---|
timestamp | event time |
severity.text / severity.number | log severity |
body | message body |
component | internal subsystem (proxy, disk_cache, cache_warmer, otlp_metrics) |
http.* / url.path | request semantics and normalized route vs actual request path |
http.parent_route | parent downstream route template on upstream child-call logs |
event.duration | request or upstream call duration in nanoseconds |
client.address | remote address |
enduser.id | stable trusted user/client identity when available |
enduser.name | display/login user name from trusted user headers when available |
enduser.source | trusted header source for end-user attribution (grafana_user, forwarded_user, etc.) |
auth.* | datasource/auth principal context (separate from enduser.id) |
cache.result | compatibility cache result (hit, miss, bypass) |
proxy.* | proxy-facing convenience fields such as total request duration and measured proxy overhead |
upstream.* | backend call count, status, and latency |
loki.* | Loki/proxy-specific attributes |
Additional request-scope aggregate fields used for fanout visibility:
| Field | Meaning |
|---|---|
loki.parent_request.type | parent downstream request type on upstream child-call logs |
upstream.calls_by_type | per-parent aggregate map keyed by <system>:<request_type> |
upstream.duration_ms_by_type | per-parent aggregate latency map keyed by <system>:<request_type> |
proxy.operations_by_type | per-parent aggregate map keyed by <operation>:<outcome> for proxy-only work |
proxy.operation_duration_ms_by_type | per-parent aggregate latency map keyed by <operation>:<outcome> |
These aggregate map keys are intentionally bounded by route templates and hardcoded operation/outcome enums. They are log fields, not metric labels.
Metrics
Export Modes
Prometheus Scrape
scrape_configs:
- job_name: loki-vl-proxy
scrape_interval: 15s
static_configs:
- targets:
- loki-vl-proxy:3100
OTLP Push
./loki-vl-proxy \
-backend=http://victorialogs:9428 \
-otlp-endpoint=http://otel-collector:4318/v1/metrics \
-otlp-interval=30s \
-otlp-compression=gzip \
-otlp-headers='Authorization=Bearer example-token'
If the OTLP endpoint is passed as a collector base URL like http://collector:4318 or http://collector:4318/v1, the proxy normalizes it to /v1/metrics.
OpenTelemetry Resource Attributes for Metrics and Logs
These flags shape OTLP metric resource attributes. Structured logs intentionally do
not duplicate resource attributes per line; keep service identity in collector/OTLP
resource metadata to avoid message.service.* duplication in storage.
| Flag | Meaning |
|---|---|
-otel-service-name | service.name |
-otel-service-namespace | service.namespace |
-otel-service-instance-id | service.instance.id |
-deployment-environment | deployment.environment.name |
Request Dimensions
Request-oriented metrics use stable low-cardinality dimensions so dashboards can slice by user-visible API shape without leaking raw paths or query content.
| Dimension | Prometheus scrape | OTLP push | Example |
|---|---|---|---|
| API system | system | loki.api.system | loki, vl |
| Direction | direction | proxy.direction | downstream, upstream |
| Request type | endpoint | loki.request.type | query_range, labels, patterns |
| Route template | route | http.route | /loki/api/v1/query_range, /select/logsql/query |
| Final status | status | http.response.status_code | 200, 429, 500 |
Downstream routes are the normalized Loki API templates registered by the proxy. Upstream routes are the stable VictoriaLogs or rules/alerts backend path templates used by the proxy itself. Raw request paths and query strings stay in logs, not in metric labels.
Tenant and client metric families are the only intentionally high-cardinality families, and even those are bounded with -metrics.max-tenants and -metrics.max-clients; excess identities collapse to __overflow__.
Histogram helper series (_bucket, _sum, _count) inherit the same label set and cardinality as the parent metric family.
Cardinality Levels
| Level | Meaning |
|---|---|
Low | no labels or only fixed route templates / small enums (status, direction, mode, reason) |
Medium | bounded internal enums that may grow slowly with feature surface but not with traffic shape |
High (capped) | user or tenant identity dimensions; bounded by -metrics.max-tenants / -metrics.max-clients with __overflow__ fallback |
Core Proxy Metrics
All rows below are exposed through Prometheus scrape and OTLP push unless noted otherwise.
| Metric | Type | Labels | Cardinality | Description |
|---|---|---|---|---|
loki_vl_proxy_requests_total | counter | system, direction, endpoint, route, status | Low | all proxied requests, sliced by downstream Loki path or upstream backend path |
loki_vl_proxy_request_duration_seconds | histogram | system, direction, endpoint, route | Low | end-to-end request latency |
loki_vl_proxy_backend_duration_seconds | histogram | system, direction, endpoint, route | Low | upstream backend latency only (system="vl", direction="upstream") |
loki_vl_proxy_upstream_calls_per_request | histogram | system, direction, endpoint, route | Low | number of upstream child requests fanned out under a single downstream request |
loki_vl_proxy_cache_hits_total | counter | none | Low | global cache hits |
loki_vl_proxy_cache_misses_total | counter | none | Low | global cache misses |
loki_vl_proxy_cache_hits_by_endpoint | counter | system, direction, endpoint, route | Low | cache hits per normalized route |
loki_vl_proxy_cache_misses_by_endpoint | counter | system, direction, endpoint, route | Low | cache misses per normalized route |
loki_vl_proxy_translations_total | counter | none | Low | successful LogQL to LogsQL translations |
loki_vl_proxy_translation_errors_total | counter | none | Low | failed translations |
loki_vl_proxy_internal_operation_total | counter | operation, outcome | Medium | proxy-only work such as translation, parser preference, and response-label rewrites |
loki_vl_proxy_internal_operation_duration_seconds | histogram | operation, outcome | Medium | latency spent in proxy-only work not covered by backend timings |
loki_vl_proxy_coalesced_total | counter | none | Low | requests served from coalesced results |
loki_vl_proxy_coalesced_saved_total | counter | none | Low | backend requests saved by coalescing |
loki_vl_proxy_response_tuple_mode_total | counter | mode | Low | emitted log tuple contract mode by client behavior (Prometheus scrape only today) |
loki_vl_proxy_uptime_seconds | gauge | none | Low | process uptime |
loki_vl_proxy_active_requests | gauge | none | Low | current in-flight requests |
loki_vl_proxy_circuit_breaker_state | gauge | none | Low | 0=closed, 1=open, 2=half-open |
loki_vl_proxy_http_connections | gauge | state | Low | current downstream HTTP server connections by state |
loki_vl_proxy_http_connection_transitions_total | counter | state | Low | downstream HTTP server connection state transitions |
loki_vl_proxy_http_connection_rotations_total | counter | reason | Low | downstream HTTP/1.x connection rotations triggered by the proxy |
Operational notes for these hot paths:
query_rangeandlabelsbenchmarks in CI track both cache-hit and cache-bypass behavior- multi-tenant read fanout and merged response bodies are capped to keep a single request from exhausting proxy memory
- synthetic tail keeps bounded dedup state so long-running websocket sessions do not grow without limit
Query-Range Windowing Metrics
These are the primary signals for long-range query performance and backend protection:
| Metric | Type | Labels | Cardinality | Description |
|---|---|---|---|---|
loki_vl_proxy_window_cache_hit_total | counter | none | Low | cached split windows served without backend scan |
loki_vl_proxy_window_cache_miss_total | counter | none | Low | split windows requiring backend scan |
loki_vl_proxy_window_fetch_seconds | histogram | none | Low | backend fetch duration per split window |
loki_vl_proxy_window_merge_seconds | histogram | none | Low | merge duration for split-window responses |
loki_vl_proxy_window_count | histogram | none | Low | split windows per query_range request |
loki_vl_proxy_window_prefilter_attempt_total | counter | none | Low | prefilter runs against /select/logsql/hits |
loki_vl_proxy_window_prefilter_error_total | counter | none | Low | prefilter failures (proxy safely falls back to full window fanout) |
loki_vl_proxy_window_prefilter_kept_total | counter | none | Low | split windows retained for real log fanout |
loki_vl_proxy_window_prefilter_skipped_total | counter | none | Low | split windows skipped as empty by prefilter |
loki_vl_proxy_window_prefilter_hit_ratio | gauge | none | Low | current prefilter kept/total ratio (0-1) |
loki_vl_proxy_window_retry_total | counter | none | Low | per-window retry attempts after retryable backend failures |
loki_vl_proxy_window_degraded_batch_total | counter | none | Low | batches that were downgraded to lower parallelism |
loki_vl_proxy_window_partial_response_total | counter | none | Low | partial query-range responses returned when slow windows exceed budget |
loki_vl_proxy_window_prefilter_duration_seconds | histogram | none | Low | prefilter latency |
loki_vl_proxy_window_adaptive_parallel_current | gauge | none | Low | current adaptive split-window parallelism |
loki_vl_proxy_window_adaptive_latency_ewma_seconds | gauge | none | Low | adaptive EWMA latency |
loki_vl_proxy_window_adaptive_error_ewma | gauge | none | Low | adaptive EWMA backend error ratio |
Patterns Snapshot Metrics
These metrics track the proxy-side pattern cache and snapshot lifecycle.
| Metric | Type | Labels | Cardinality | Description |
|---|---|---|---|---|
loki_vl_proxy_patterns_detected_total | counter | none | Low | unique patterns detected from pattern mining |
loki_vl_proxy_patterns_stored_total | counter | none | Low | pattern entries stored in proxy cache or snapshot updates |
loki_vl_proxy_patterns_restored_from_disk_total | counter | none | Low | pattern entries restored from on-disk snapshots |
loki_vl_proxy_patterns_restored_from_peers_total | counter | none | Low | pattern entries restored from peer snapshots |
loki_vl_proxy_patterns_restored_disk_entries_total | counter | none | Low | snapshot cache keys restored from disk |
loki_vl_proxy_patterns_restored_peer_entries_total | counter | none | Low | snapshot cache keys restored from peers |
loki_vl_proxy_patterns_deduplicated_total | counter | source | Low | duplicate pattern snapshot entries removed by source (mem, disk, peer) |
loki_vl_proxy_patterns_in_memory | gauge | none | Low | current number of patterns held in in-memory snapshot state |
loki_vl_proxy_patterns_cache_keys | gauge | none | Low | current number of pattern cache keys held in memory |
loki_vl_proxy_patterns_in_memory_bytes | gauge | none | Low | current bytes used by in-memory pattern snapshot payloads |
loki_vl_proxy_patterns_last_response_patterns | gauge | none | Low | pattern entries returned in the most recent /patterns response |
loki_vl_proxy_patterns_last_response_bytes | gauge | none | Low | encoded size of the most recent /patterns response |
loki_vl_proxy_patterns_persisted_disk_entries | gauge | none | Low | snapshot cache keys present in the last persisted disk snapshot |
loki_vl_proxy_patterns_persisted_disk_patterns | gauge | none | Low | pattern entries present in the last persisted disk snapshot |
loki_vl_proxy_patterns_persisted_disk_bytes | gauge | none | Low | last persisted pattern snapshot size on disk |
loki_vl_proxy_patterns_persist_writes_total | counter | none | Low | completed pattern snapshot writes to disk |
loki_vl_proxy_patterns_persist_write_bytes_total | counter | none | Low | cumulative bytes written by pattern snapshot persistence |
loki_vl_proxy_patterns_restored_disk_bytes_total | counter | none | Low | cumulative bytes restored from on-disk pattern snapshots |
loki_vl_proxy_patterns_restored_peer_bytes_total | counter | none | Low | cumulative bytes restored from peer snapshot warmup |
loki_vl_proxy_patterns_source_lines_requested_total | counter | none | Low | source lines requested from backend pattern fetches |
loki_vl_proxy_patterns_source_lines_scanned_total | counter | none | Low | source lines scanned from backend responses |
loki_vl_proxy_patterns_source_lines_observed_total | counter | none | Low | source lines accepted into the pattern miner |
loki_vl_proxy_patterns_windows_attempted_total | counter | none | Low | pattern fetch windows attempted |
loki_vl_proxy_patterns_windows_accepted_total | counter | none | Low | pattern fetch windows accepted into the merged response |
loki_vl_proxy_patterns_windows_capped_total | counter | none | Low | pattern fetch windows that hit the per-window source line cap |
loki_vl_proxy_patterns_second_pass_windows_total | counter | none | Low | pattern fetch windows retried with a higher line limit |
loki_vl_proxy_patterns_mined_pre_merge_total | counter | none | Low | pattern entries mined before cross-window merge |
loki_vl_proxy_patterns_mined_post_merge_total | counter | none | Low | pattern entries after cross-window merge |
loki_vl_proxy_patterns_snapshot_hits_total | counter | none | Low | pattern snapshot fallback lookups that found cached payloads |
loki_vl_proxy_patterns_snapshot_misses_total | counter | none | Low | pattern snapshot fallback lookups that missed |
loki_vl_proxy_patterns_snapshot_reused_total | counter | none | Low | cached snapshot payloads actually reused in /patterns responses |
loki_vl_proxy_patterns_low_coverage_responses_total | counter | none | Low | responses flagged as likely degraded by capped or incomplete mining coverage |
Peer Cache Metrics
These families are currently exposed on Prometheus scrape at /metrics.
| Metric | Type | Labels | Cardinality | Description |
|---|---|---|---|---|
loki_vl_proxy_peer_cache_peers | gauge | none | Low | remote peers currently in the fleet-cache ring |
loki_vl_proxy_peer_cache_cluster_members | gauge | none | Low | total fleet-cache ring members including self |
loki_vl_proxy_peer_cache_hits_total | counter | none | Low | successful peer-cache fetches |
loki_vl_proxy_peer_cache_misses_total | counter | none | Low | peer-cache lookups that missed on the owner |
loki_vl_proxy_peer_cache_errors_total | counter | none | Low | peer-cache fetch errors |
loki_vl_proxy_peer_cache_write_through_pushes_total | counter | none | Low | successful owner write-through pushes from non-owner peers |
loki_vl_proxy_peer_cache_write_through_errors_total | counter | none | Low | owner write-through push errors |
loki_vl_proxy_peer_cache_hot_index_requests_total | counter | none | Low | peer hot-index requests |
loki_vl_proxy_peer_cache_hot_index_errors_total | counter | none | Low | peer hot-index request errors |
loki_vl_proxy_peer_cache_read_ahead_prefetches_total | counter | none | Low | successful hot read-ahead prefetches |
loki_vl_proxy_peer_cache_read_ahead_prefetch_bytes_total | counter | none | Low | bytes prefetched by hot read-ahead |
loki_vl_proxy_peer_cache_read_ahead_budget_drops_total | counter | none | Low | hot read-ahead candidates dropped by budget or size filters |
loki_vl_proxy_peer_cache_read_ahead_tenant_skips_total | counter | none | Low | hot read-ahead candidates skipped by tenant fairness |
Tenant and Client Metrics
These are the metrics to use when you want to identify the users or tenants actually causing backend load.
| Metric | Type | Labels | Cardinality | Description |
|---|---|---|---|---|
loki_vl_proxy_tenant_requests_total | counter | system, direction, tenant, endpoint, route, status | High (capped) | request volume by tenant |
loki_vl_proxy_tenant_request_duration_seconds | histogram | system, direction, tenant, endpoint, route | High (capped) | latency by tenant |
loki_vl_proxy_client_requests_total | counter | system, direction, client, endpoint, route | High (capped) | request volume by client identity |
loki_vl_proxy_client_response_bytes_total | counter | client | High (capped) | response bytes by client |
loki_vl_proxy_client_status_total | counter | system, direction, client, endpoint, route, status | High (capped) | final status breakdown by client |
loki_vl_proxy_client_inflight_requests | gauge | client | High (capped) | current parallelism by client |
loki_vl_proxy_client_request_duration_seconds | histogram | system, direction, client, endpoint, route | High (capped) | request latency by client |
loki_vl_proxy_client_query_length_chars | histogram | system, direction, client, endpoint, route | High (capped) | query size outliers by client |
loki_vl_proxy_client_errors_total | counter | system, direction, endpoint, route, reason | Low | categorized downstream client errors |
This is one of the main advantages of putting an explicit proxy between the Grafana Loki datasource and VictoriaLogs: the read path becomes attributable.
Instead of only seeing aggregate datasource traffic, operators can see:
- which Grafana user or trusted client identity is generating load
- which tenant is hot
- which route is expensive for that client or tenant
- which client is producing the largest responses, longest queries, or most bad requests
Grafana Client Visibility, Offenders, and User Patterns
When -metrics.trust-proxy-headers=true is enabled behind a trusted Grafana or
auth proxy, the proxy can turn northbound identity into durable read-path
signals without using raw datasource credentials as the end-user key.
That gives you:
- per-client request rate by route via
loki_vl_proxy_client_requests_total - per-client latency by route via
loki_vl_proxy_client_request_duration_seconds - per-client response-volume visibility via
loki_vl_proxy_client_response_bytes_total - per-client query-size outlier visibility via
loki_vl_proxy_client_query_length_chars - per-client bad-request and error clustering via
loki_vl_proxy_client_status_totalandloki_vl_proxy_client_errors_total - per-tenant volume and latency visibility via
loki_vl_proxy_tenant_*
Those tenant and client identity series are opt-in. Set -metrics.export-sensitive-labels=true only on trusted scrape or OTLP paths where exposing identity labels is acceptable.
At log level, the same request can also carry:
enduser.idenduser.nameenduser.sourceauth.principalauth.sourceloki.tenant.idhttp.route
Per-request by-type breakdown maps such as upstream.calls_by_type and
proxy.operations_by_type are emitted only at debug level. The default info-level
request logs keep aggregate counts while the detailed per-type visibility stays in
Prometheus/OTLP metrics. This avoids log-body field explosion in pipelines that
flatten structured JSON bodies into discoverable message.* fields.
That separation matters:
enduser.*answers "which Grafana user or trusted client triggered this?"auth.*answers "which datasource or auth principal was used on the request path?"loki.tenant.idanswers "which tenant boundary did the request execute in?"
This is what makes offender analysis practical on the read path instead of only looking at coarse IP-level traffic.
Northbound and Southbound Auth Boundaries
The same proxy layer also improves trust separation between components.
| Boundary | Main controls | Why it matters operationally |
|---|---|---|
| Grafana or client -> proxy | -auth.enabled, -tls-client-ca-file, -tls-require-client-cert, trusted user headers with -metrics.trust-proxy-headers | Lets the proxy require tenant context, optionally require client certs, and attribute read traffic to the actual Grafana user or trusted upstream identity when sensitive metrics export is explicitly enabled. |
| Proxy -> VictoriaLogs | -backend-basic-auth, -forward-authorization, -forward-headers | Lets the lower layer keep its own auth boundary while the proxy preserves full Loki-client compatibility on the northbound side. |
| Proxy -> peer cache | -peer-auth-token | Prevents peer-cache reuse from becoming an unauthenticated east-west path when the fleet spans a broader network boundary. |
| Operator -> admin/debug endpoints | -server.admin-auth-token | Protects admin and troubleshooting surfaces without weakening the main read path. Non-loopback listeners now require this token before /debug/queries or /debug/pprof can be enabled. |
When trusted proxy headers are enabled, the proxy also forwards derived context headers to VictoriaLogs:
X-Loki-VL-Client-IDX-Loki-VL-Client-Source
That gives the lower layer better context about who is really behind the read traffic while still preserving datasource compatibility at the Grafana edge.
Runtime and Process Metrics
The proxy also exports a lightweight built-in set of runtime and process/container health metrics.
App-scoped aliases are emitted with the loki_vl_proxy_ prefix, while legacy go_* and process_* families remain for compatibility:
Grouped family rows below mean every concrete metric name in that family shares the same cardinality profile.
| Metric family | Labels | Cardinality | Description |
|---|---|---|---|
loki_vl_proxy_go_memstats_*, loki_vl_proxy_go_goroutines, loki_vl_proxy_go_gc_cycles_total, loki_vl_proxy_go_gc_duration_seconds | none | Low | Go runtime health |
loki_vl_proxy_process_resident_memory_bytes, loki_vl_proxy_process_open_fds | none | Low | process resource usage |
loki_vl_proxy_process_cpu_usage_ratio | mode | Low | CPU pressure split by user, system, iowait |
loki_vl_proxy_process_memory_* | none | Low | total, free, available, usage ratio |
loki_vl_proxy_process_disk_*_bytes_total | none | Low | disk I/O byte counters |
loki_vl_proxy_process_disk_*_operations_total | none | Low | disk read/write operation counters |
loki_vl_proxy_process_network_*_bytes_total | none | Low | network I/O counters |
loki_vl_proxy_process_pressure_*_{some,full}_ratio | window | Low | Linux PSI gauges when available (10s, 60s, 300s) |
Legacy unprefixed compatibility aliases (go_*, process_*) follow the same label sets and cardinality profile as their loki_vl_proxy_* counterparts.
Kubernetes notes:
- These runtime/system metrics are read from
/procand do not require Kubernetes RBAC permissions. - PSI metrics (
process_pressure_*) depend on kernel support and may be absent on nodes without/proc/pressure/*. - On startup, the proxy logs a system-metrics readiness check with missing families and remediation hints instead of failing silently.
- If you mount host
/proc(-proc-root=/host/proc), these metrics will reflect host scope; keep default pod/procfor pod/container scope. - For per-pod attribution in OTLP backends, set
OTEL_SERVICE_INSTANCE_IDfrom pod name andOTEL_SERVICE_NAMESPACEfrom pod namespace (the upstream chart now injects these by default). - CI includes a metric-name guard so new app metrics must stay under the
loki_vl_proxy_*prefix unless explicitly allowlisted for compatibility.
PromQL Drilldowns For Slowness And Client Errors
Use these queries to quickly isolate downstream client pain, upstream slowness, and route-specific cache efficiency:
| Goal | Query |
|---|---|
| Downstream p95 latency by route | histogram_quantile(0.95, sum(rate(loki_vl_proxy_request_duration_seconds_bucket{system="loki",direction="downstream"}[5m])) by (le, endpoint, route)) |
| Upstream p95 latency by route | histogram_quantile(0.95, sum(rate(loki_vl_proxy_backend_duration_seconds_bucket{system="vl",direction="upstream"}[5m])) by (le, endpoint, route)) |
| Downstream 5xx rate by route | sum(rate(loki_vl_proxy_requests_total{system="loki",direction="downstream",status=~"5.."}[5m])) by (endpoint, route) |
| Tenant p99 latency by route | histogram_quantile(0.99, sum(rate(loki_vl_proxy_tenant_request_duration_seconds_bucket{system="loki",direction="downstream"}[5m])) by (le, tenant, endpoint, route)) |
| Route cache hit ratio | sum(rate(loki_vl_proxy_cache_hits_by_endpoint{system="loki",direction="downstream"}[5m])) by (endpoint, route) / clamp_min(sum(rate(loki_vl_proxy_cache_hits_by_endpoint{system="loki",direction="downstream"}[5m])) by (endpoint, route) + sum(rate(loki_vl_proxy_cache_misses_by_endpoint{system="loki",direction="downstream"}[5m])) by (endpoint, route), 1) |
| Client bad_request by route | sum(rate(loki_vl_proxy_client_errors_total{system="loki",direction="downstream",reason="bad_request"}[5m])) by (endpoint, route) |
For latency histograms, keep dashboards on p50, p95, and p99 rather than averages. Averages hide tail latency incidents. For exact proxy-only overhead, use structured logs (proxy.overhead_ms) alongside the latency histograms; subtracting histogram quantiles is not mathematically reliable.
The packaged Loki-VL-Proxy dashboard includes an Operational Resources section with:
- memory saturation and memory footprint/headroom
- CPU usage split by mode
- disk IOPS up/down and disk throughput up/down
- network up/down
- PSI pressure (cpu/memory/io)
- process RSS and open file descriptors by pod
The top of the dashboard is organized as a left-to-right operator flow:
Main Overview - Client -> Proxy -> VictoriaLogsClient Edge - Request Quality & ShapeHeavy Consumers - Client Load DriversProxy -> VictoriaLogs Query Pipeline
It also includes a Query-Range Windowing section for cache/tuning signals:
- window fetch p50/p95 latency
- window merge p50/p95 latency
- window cache hit ratio
- adaptive window parallelism + EWMA latency/error
It also includes a Long-Range Resilience KPIs section for phase tuning:
- prefilter kept/skipped rate
- retry/degraded-batch/partial-response rate
- prefilter hit ratio
Dashboard datasource notes:
- datasource variable regex is intentionally permissive (
/.*/) so the dashboard works with scrape-backed and OTLP-backed metric datasources without renaming - key stat panels use explicit zero fallbacks so dashboards remain readable during cold starts and low-traffic windows
Active Backend E2E Healthchecks
/ready confirms backend reachability, but production health should also include synthetic end-to-end probes with real query traffic shape.
Recommended pattern:
- Probe
/readyevery 15-30s for hard availability. - Run a lightweight synthetic
query_rangeevery 1-5m from inside the cluster. - Alert when synthetic query latency or error ratio breaches SLO even if
/readyis green.
This catches backend partial degradation (slow scans, storage pressure, auth drift) earlier than readiness alone.
Choosing Client Identity
Per-client metrics and request logs can use trusted upstream identity instead of only remote IP:
-metrics.trust-proxy-headers=true
When enabled, the proxy prefers:
- Trusted user headers (
X-Grafana-User,X-Forwarded-User,X-Webauth-User,X-Auth-Request-User) - tenant
- trusted forwarded client IP (
X-Forwarded-For) - remote IP
Datasource/basic-auth credentials are reported separately under auth.* and are not used as end-user identity.
Only enable trusted proxy headers when the proxy sits behind a trusted auth proxy or Grafana instance.
Integration Examples
OpenTelemetry Collector: scrape /metrics and export OTLP
receivers:
prometheus:
config:
scrape_configs:
- job_name: loki-vl-proxy
scrape_interval: 15s
static_configs:
- targets: ["loki-vl-proxy:3100"]
processors:
batch: {}
exporters:
otlphttp:
endpoint: https://otel-gateway.example.com
headers:
Authorization: Bearer ${OTLP_TOKEN}
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlphttp]
OpenTelemetry Collector: collect JSON logs from container stdout
receivers:
filelog:
include:
- /var/log/containers/*loki-vl-proxy*.log
operators:
- type: json_parser
processors:
batch: {}
exporters:
otlphttp:
endpoint: https://otel-gateway.example.com
service:
pipelines:
logs:
receivers: [filelog]
processors: [batch]
exporters: [otlphttp]
Vector: ship structured JSON logs
[sources.proxy_logs]
type = "kubernetes_logs"
[transforms.proxy_json]
type = "remap"
inputs = ["proxy_logs"]
source = '''
. = parse_json!(string!(.message))
'''
[sinks.proxy_otlp]
type = "opentelemetry"
inputs = ["proxy_json"]
protocol.type = "http"
protocol.uri = "https://otel-gateway.example.com/v1/logs"
Fluent Bit: tail container logs and keep JSON structure
[INPUT]
Name tail
Path /var/log/containers/*loki-vl-proxy*.log
Parser docker
Tag loki_vl_proxy
[FILTER]
Name parser
Match loki_vl_proxy
Key_Name log
Parser json
[OUTPUT]
Name opentelemetry
Match loki_vl_proxy
Host otel-collector
Port 4318
Logs_uri /v1/logs
Recommended Dashboards and Alerts
Start with:
- request rate and error rate by
endpoint - backend latency p95/p99 by
endpoint - cache hit ratio overall and by
endpoint - top
clientby request rate, bytes, and query length - top
tenantby request volume and latency - circuit breaker state
- process RSS and open file descriptors
Dashboard Catalog
| Dashboard | Source | Primary use |
|---|---|---|
dashboard/loki-vl-proxy.json | Prometheus metrics | Service health, SLOs, cache and endpoint latency trends |
Metrics Dashboard Setup (Scrape and OTLP Push)
The metrics dashboard includes a Datasource variable and works with either metric transport mode:
- Prometheus scrape (
/metrics+ServiceMonitor) - OTLP push (
-otlp-endpoint=...) into a Prometheus-compatible backend
Recommended setup:
- Point
Datasourceto any Prometheus-compatible datasource that containsloki_vl_proxy_*metrics. - For scrape mode, use the datasource fed by your
ServiceMonitor/Prometheus scrape pipeline. - For OTLP push mode, use the datasource fed by your OTLP metrics pipeline.
- VictoriaMetrics can be used for both modes when it receives both scrape and OTLP streams.
Transport checklist:
- Scrape mode:
-server.register-instrumentation=true- Helm
serviceMonitor.enabled=true
- OTLP push mode:
-otlp-endpointconfigured-server.register-instrumentation=false(optional, recommended when you want push-only)
Quick validation in Grafana Explore against the selected datasource:
loki_vl_proxy_uptime_seconds
If this query has data, the Loki-VL-Proxy Metrics dashboard should populate out of the box.
High-signal alert ideas:
5xxrate rising on query endpoints- cache hit ratio collapsing
- backend latency p95 breaching SLO
- a single
clientdominating bytes or query length - circuit breaker opening repeatedly
The packaged alert set and incident procedures live in:
Notes
- OTLP push and Prometheus scrape share the same important proxy metrics and metric names.
- The OTLP export is intentionally lightweight and does not pull in the full OpenTelemetry Go SDK.
- Structured logs are already safe for JSON ingestion; agents can forward them directly or transform them into OTLP logs.