Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambient mesh: upstream connect error or disconnect/reset before headers when using Gateway API #50983

Closed
2 tasks done
zsparal opened this issue May 10, 2024 · 11 comments · Fixed by #51013
Closed
2 tasks done
Assignees
Labels
area/ambient Issues related to ambient mesh area/networking

Comments

@zsparal
Copy link

zsparal commented May 10, 2024

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

I have been trying to get Ambient Mesh working on one of my side projects at home. I'm using version 1.22.0-beta.1 because 1.21 had some issues with init containers that were fixed in this version.

Everything seems to work fine up until the point I try to set up a Gateway to my services. Without ambient mesh the Gateway works properly, but when I turn on ambient for the gateway and service namespaces, whenever I try to make a connection I get the following in the ztunnel logs (and the response that I included in the issue title):

2024-05-10T12:05:00.233728Z    error    access    connection complete    src.addr=10.244.2.6:49660 src.workload="gateway-istio-c86c7677b-l662p" src.namespace="ingress-infra" src.identity="spiffe://cluster.local/ns/ingress-infra/sa/gateway-istio" dst.addr=10.244.2.4:15008 dst.hbone_addr="10.244.2.4:15008" dst.workload="console-759ddbb7d-ctg4t" dst.namespace="operator" dst.identity="spiffe://cluster.local/ns/minio/sa/console-sa" direction="outbound" bytes_sent=0 bytes_recv=0 duration="0ms" error="http status: 400 Bad Request"
2024-05-10T12:05:00.234184Z    warn    access    connection failed    src.addr=10.244.2.6:57577 dst.addr=10.244.2.4:15008 direction="inbound" error=attempted recursive call to ourselves
2024-05-10T12:05:00.234217Z    error    access    connection complete    src.addr=10.244.2.6:49668 src.workload="gateway-istio-c86c7677b-l662p" src.namespace="ingress-infra" src.identity="spiffe://cluster.local/ns/ingress-infra/sa/gateway-istio" dst.addr=10.244.2.4:15008 dst.hbone_addr="10.244.2.4:15008" dst.workload="console-759ddbb7d-ctg4t" dst.namespace="operator" dst.identity="spiffe://cluster.local/ns/minio/sa/console-sa" direction="outbound" bytes_sent=0 bytes_recv=0 duration="0ms" error="http status: 400 Bad Request"
2024-05-10T12:05:00.234430Z    info    access    connection complete    src.addr=10.244.2.1:22512 dst.addr=10.244.2.6:80 dst.service="gateway-istio.ingress-infra.svc.cluster.local" dst.workload="gateway-istio-c86c7677b-l662p" dst.namespace="gateway-istio" direction="inbound" bytes_sent=470 bytes_recv=146 duration="1ms"

Removing the ambient annotation on either the GW namespace or the service namespace fixes the issue.

You can use the following script & kind configuration to reproduce the issue (I'm using rootless podman for the cluster - you might need to change the LB IPs to be on the same subnet):

kind.config.yaml

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: sandbox
featureGates:
  SidecarContainers: true
nodes:
  - role: control-plane
    image: kindest/node:v1.29.2@sha256:51a1434a5397193442f0be2a297b488b6c919ce8a3931be0ce822606ea5ca245
    kubeadmConfigPatches:
      - |
        kind: KubeletConfiguration
        localStorageCapacityIsolation: true
  - role: worker
    image: kindest/node:v1.29.2@sha256:51a1434a5397193442f0be2a297b488b6c919ce8a3931be0ce822606ea5ca245
  - role: worker
    image: kindest/node:v1.29.2@sha256:51a1434a5397193442f0be2a297b488b6c919ce8a3931be0ce822606ea5ca245

repro.sh

#!/usr/bin/env bash

set -o errexit
set -o nounset
set -o pipefail

# Reset kind cluster
systemd-run --scope --user --property=Delegate=yes kind delete cluster --name sandbox --kubeconfig "${KUBECONFIG}"
systemd-run --scope --user --property=Delegate=yes kind create cluster --config kind-config.yaml --kubeconfig "${KUBECONFIG}"

# Add relevant helm repositories
helm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo add minio https://operator.min.io/
helm repo update

# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/experimental-install.yaml

# Install Istio
ISTIO_VERSION=1.22.0-beta.1
helm install istio-base istio/base --version "${ISTIO_VERSION}" -n istio-system --create-namespace --wait
helm install istio-cni istio/cni --version "${ISTIO_VERSION}" -n istio-system --set profile=ambient --wait
helm install istiod istio/istiod --version "${ISTIO_VERSION}" -n istio-system --set profile=ambient --wait
helm install ztunnel istio/ztunnel --version "${ISTIO_VERSION}" -n istio-system --wait

# Install MetalLB
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml
kubectl wait --for=condition=ready pod -l "app=metallb" -l "component=controller" -n metallb-system
kubectl apply -f - <<EOF
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: ip-address-pool
  namespace: metallb-system
spec:
  addresses:
    - 10.89.255.0-10.89.255.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: l2-advertisement
  namespace: metallb-system
EOF

# Install Minio
helm install minio-operator minio/operator -n minio --create-namespace --wait
kubectl label namespace minio istio.io/dataplane-mode=ambient
kubectl apply -f - <<EOF
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: gitops-grant
  namespace: minio
spec:
  from:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      namespace: ingress-infra
  to:
    - group: ""
      kind: Service
      name: console
EOF

# Create gateway
kubectl create namespace ingress-infra
kubectl label namespace ingress-infra istio.io/dataplane-mode=ambient
kubectl apply -f - <<EOF
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gateway
  namespace: ingress-infra
spec:
  gatewayClassName: istio
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      hostname: "*.test"
  infrastructure:
    annotations:
      metallb.universe.tf/loadBalancerIPs: "10.89.255.50"
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: minio
  namespace: ingress-infra
spec:
  parentRefs:
    - name: gateway
  hostnames:
    - "minio.test"
  rules:
    - backendRefs:
        - name: console
          namespace: minio
          port: 9090
EOF

curl

podman unshare --rootless-netns
curl -vvvv http://minio.test -HHost minio.test --resolve minio.test:80:10.89.255.50

Version

$ istioctl version
client version: 1.21.2
control plane version: 1.22.0-beta.1
data plane version: 1.22.0-beta.1 (4 proxies)

$ kubectl version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2

$ helm version --short
v3.14.3+gf03cc04

Additional Information

bug-report.tar.gz

@istio-policy-bot istio-policy-bot added area/ambient Issues related to ambient mesh area/networking labels May 10, 2024
@howardjohn
Copy link
Member

Didn't get a chance to try to fully reproduce, but I fixed a bug very similar to this recently. Worth giving https://github.com/istio/istio/releases/tag/1.22.0-rc.0 a shot.

Another question is why does the gateway have ztunnel capture enable on it. This sounds like what 385708e was meant to fix (the gateway doesn't need a ztunnel capture since it can natively do all the functionality). That commit should be in beta.1. 291cba8 is after beta.1 and touches the same code, though I doubt its fixing this

@howardjohn
Copy link
Member

Huh, the internal view (configz) of the GW has ambient.istio.io/redirection: disabled. But the live on in the cluster, and the Deployment and pod, do not. I think we are mutating an object in the k8s informer cache

CNI shows

cni/istio-cni-node-mhht4/cni.log
229:2024-05-10T12:20:19.809862Z warn    ambient pod gateway-istio-c86c7677b-sxv78 does not appear to have any assigned IPs, not capturing
230:2024-05-10T12:20:20.154815Z info    ambient in pod mode - adding pod ingress-infra/gateway-istio-c86c7677b-sxv78 to ztunnel
254:2024-05-10T12:20:20.156442Z info    ambient About to send added pod: 7584a3d1-f530-4d24-a9dc-08cab5dd50e6 to ztunnel: add:{uid:"7584a3d1-f530-4d24-a9dc-08cab5dd50e6"  workload_info:{name:"gateway-istio-c86c7677b-sxv78"  namespace:"ingress-infra"  service_account:"gateway-istio"  trust_domain:"cluster.local"}}

Seems like a bug. will try to replicate. Thanks!

@zsparal
Copy link
Author

zsparal commented May 10, 2024

Thanks for looking into it! I tested with rc.0 just in case and it's still happening

@zsparal
Copy link
Author

zsparal commented May 10, 2024

I tested it by removing the ambient annotation from the ingress namespace at everything started working again with the traffic still being mTLS. Since I don't really have anything else running in the NS, I'll just do this as a workaround for now. Thanks!

@howardjohn howardjohn self-assigned this May 10, 2024
@howardjohn
Copy link
Member

Very interesting, doesn't repropduce for me on rc0 or beta1 unless I run it in a script instead of interactively. Maybe a race condition

@howardjohn
Copy link
Member

Hmm, its pretty rare for me to trigger it actually. I got it once but struggling to do so again (which hints again to a race)

@zsparal
Copy link
Author

zsparal commented May 10, 2024

I originally got the issue with Flux, and it reproduced pretty reliably for me there but I wanted to share a more minimal example

@howardjohn
Copy link
Member

I've reproduced a few times on the beta but not yet on the RC. Will try more tomorrow. I think its

if _, ok := gw.Spec.Infrastructure.Labels[constants.DataplaneModeLabel]; ok {
that is the problem (around there). I see the annotation is not getting set, just not sure how that can be yet...

@howardjohn
Copy link
Member

Really frusterating debugging this because I can only reproduce 1/10 times or so...

I know there is a bug around mutating the object that we shouldn't in a shared cache. Just cannot verify 100% since I cannot reliably reproduce 😬 . I suppose I can just send a fix for that and we can see how it goes

howardjohn added a commit to howardjohn/istio that referenced this issue May 13, 2024
Fixes istio#50983, I think

Basically, we have logic to use `infra.Labels || meta.Labels`. We modify
meta.Labels.

This is wrong, we should modify AFTER we pick which set of labels to
use.

Also, we need to deepcopy anything we mutate since its accessing a
shared object cache
@howardjohn
Copy link
Member

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gateway
  namespace: ingress-infra
  annotations:
    metallb.universe.tf/loadBalancerIPs: "10.89.255.50"
spec:
  gatewayClassName: istio
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      hostname: "*.test"

I think you can workaround by this (infrasrtucture.annotations -> metadata). If I am right about #51013.

howardjohn added a commit to howardjohn/istio that referenced this issue May 13, 2024
Fixes istio#50983, I think

Basically, we have logic to use `infra.Labels || meta.Labels`. We modify
meta.Labels.

This is wrong, we should modify AFTER we pick which set of labels to
use.

Also, we need to deepcopy anything we mutate since its accessing a
shared object cache
howardjohn added a commit to howardjohn/istio that referenced this issue May 13, 2024
Fixes istio#50983, I think

Basically, we have logic to use `infra.Labels || meta.Labels`. We modify
meta.Labels.

This is wrong, we should modify AFTER we pick which set of labels to
use.

Also, we need to deepcopy anything we mutate since its accessing a
shared object cache
istio-testing pushed a commit that referenced this issue May 13, 2024
Fixes #50983, I think

Basically, we have logic to use `infra.Labels || meta.Labels`. We modify
meta.Labels.

This is wrong, we should modify AFTER we pick which set of labels to
use.

Also, we need to deepcopy anything we mutate since its accessing a
shared object cache
istio-testing pushed a commit to istio-testing/istio that referenced this issue May 13, 2024
Fixes istio#50983, I think

Basically, we have logic to use `infra.Labels || meta.Labels`. We modify
meta.Labels.

This is wrong, we should modify AFTER we pick which set of labels to
use.

Also, we need to deepcopy anything we mutate since its accessing a
shared object cache
istio-testing added a commit that referenced this issue May 14, 2024
Fixes #50983, I think

Basically, we have logic to use `infra.Labels || meta.Labels`. We modify
meta.Labels.

This is wrong, we should modify AFTER we pick which set of labels to
use.

Also, we need to deepcopy anything we mutate since its accessing a
shared object cache

Co-authored-by: John Howard <john.howard@solo.io>
@zsparal
Copy link
Author

zsparal commented May 14, 2024

Thank you for the investigation and the fix. Was this included in 1.22 or should I use the workaround for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ambient Issues related to ambient mesh area/networking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants