Merge Gateway Prometheus Virtual Host metrics random assigned wrong with multiple listeners #3305

owenhaynes · 2024-04-29T17:38:35Z

Description:
Prometheus Virtual Host metrics are being assigned wrongly having cases where the wrong host is being attached to the wrong vhost.

I am not sure what the correct behaviour is in some cases we just get metrics like envoy_vhost_vcluster_upstream_rq_retry and in the current case we get vhost.<virtual host name>.vcluster.<virtual cluster name> which is what the envoy docs say but with the wrong virtual host name attached to the wrong virtual host cluster.

A. The envoy_vhost_vcluster_upstream_rq_retry metrics that get emitted do have labels for envoy_virtual_cluster and envoy_virtual_host

B. vhost.<virtual host name>.vcluster.<virtual cluster name> only have the envoy_virtual_host label attached and no cluster label.

I hope A is the correct way as its easier to build dashboards from.

Repro steps:

Take a example :

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: foo
spec:
  gatewayClassName: merge-gateway
  listeners:
  - name: foo.com
    protocol: HTTP
    hostname: foo.com
    port: 80
    allowedRoutes:
      namespaces:
        from: Same
  - name: bar.com
    protocol: HTTP
    hostname: bar.com
    port: 80
    allowedRoutes:
      namespaces:
        from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: foobar-combined
spec:
  parentRefs:
    - name: foo
      group: gateway.networking.k8s.io
      kind: Gateway 
  hostnames:
  - foo.com
  - foobar.com
  rules:
  - backendRefs:
    - kind: Service
      group: ''
      name:  my-svc
      port: 80
      weight: 1
    matches:  
    - path:
        type: PathPrefix
        value: /
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: a
spec:
  gatewayClassName: merge-gateway
  listeners:
  - name: a.foo.com
    protocol: HTTP
    hostname: foo.com
    port: 80
    allowedRoutes:
      namespaces:
        from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: a
spec:
  parentRefs:
    - name: foo
      group: gateway.networking.k8s.io
      kind: Gateway 
  hostnames:
  - a.foo.com
  rules:
  - backendRefs:
    - kind: Service
      group: ''
      name:  my-svc
      port: 80
      weight: 1
    matches:  
    - path:
        type: PathPrefix
        value: /
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: ab
spec:
  gatewayClassName: merge-gateway
  listeners:
  - name: ab.foo.com
    protocol: HTTP
    hostname: ab.foo.com
    port: 80
    allowedRoutes:
      namespaces:
        from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: foobar-combined
spec:
  parentRefs:
    - name: ab
      group: gateway.networking.k8s.io
      kind: Gateway 
  hostnames:
  - ab.foo.com
  rules:
  - backendRefs:
    - kind: Service
      group: ''
      name:  my-svc
      port: 80
      weight: 1
    matches:  
    - path:
        type: PathPrefix
        value: /

Its hard to pinpoint what's going on and seems to be random in how it picks for example I got envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry{envoy_virtual_host="api-gateway/ab/ab"} 0
for the metric name.

Metrics dump

# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry counter
envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry_limit_exceeded counter
envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry_limit_exceeded{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry_overflow counter
envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry_overflow{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry_success counter
envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_retry_success{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_timeout counter
envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_timeout{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_total counter
envoy_vhost_foo_com_ab_foo_com_vcluster_ab_foo_com_upstream_rq_total{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_retry counter
envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_retry{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_retry_limit_exceeded counter
envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_retry_limit_exceeded{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_retry_overflow counter
envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_retry_overflow{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_retry_success counter
envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_retry_success{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_timeout counter
envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_timeout{envoy_virtual_host="api-gateway/ab/ab"} 0
# TYPE envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_total counter
envoy_vhost_foo_com_ab_foo_com_vcluster_other_upstream_rq_total{envoy_virtual_host="api-gateway/ab/ab"} 0

it looks like its merging the virtual clusters.

Environment:
k8s 1.29.0
envoy gateway 1.0.1 Merge gateways

The text was updated successfully, but these errors were encountered:

owenhaynes · 2024-04-29T20:32:34Z

I have also seen that resource names for gateways resource name which use "." are also causing issues like above

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: foo.foo

from the http://localhost:19000/stats
vhost.my-ns/foo.foo-public
I assume the same is for HTTPRoute names as well, not looked but I assume they need to be escaped like we do for hostnames?

owenhaynes added the triage label Apr 29, 2024

arkodg added this to the v1.1.0-rc1 milestone May 23, 2024

arkodg added help wanted Extra attention is needed and removed triage labels May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Gateway Prometheus Virtual Host metrics random assigned wrong with multiple listeners #3305

Merge Gateway Prometheus Virtual Host metrics random assigned wrong with multiple listeners #3305

owenhaynes commented Apr 29, 2024

owenhaynes commented Apr 29, 2024 •

edited

Merge Gateway Prometheus Virtual Host metrics random assigned wrong with multiple listeners #3305

Merge Gateway Prometheus Virtual Host metrics random assigned wrong with multiple listeners #3305

Comments

owenhaynes commented Apr 29, 2024

owenhaynes commented Apr 29, 2024 • edited

owenhaynes commented Apr 29, 2024 •

edited