Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot connect to host 10.0.0.1:443 ssl:default #582

Open
KamranAzeem opened this issue Jun 14, 2022 · 2 comments
Open

Cannot connect to host 10.0.0.1:443 ssl:default #582

KamranAzeem opened this issue Jun 14, 2022 · 2 comments

Comments

@KamranAzeem
Copy link

KamranAzeem commented Jun 14, 2022

Alternate issue description:

ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.0.0.1:443 ssl:default [None]

Problem / symptom:

In the logs from controller-dask-gateway, we see this:

ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.0.0.1:443 ssl:default [None]

Details of the problem:

We are running JupyterHub and DaskHub (through official helm chart) - in AKS, using Azure CNI plugin. We have some network policies implemented in the daskhub namespace.

While using the jupyter notebooks (through a browser of-course), the following piece of code would wait for a minute and would then throw errors.

gateway = Gateway()
options = gateway.cluster_options()
options

The logs from controller-dask-gateway showed the following errors:

ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.0.0.1:443 ssl:default [None]
  • From the error messages, it looks like Dask-kube-controller cannot talk to kubernetes api on 10.0.0.1:443 .
  • Apparently it looks like that some firewall/network-policy is blocking the communication from Dask controller to k8s API.
    • This is not the case though. Inspection of Network policies show that the egress traffic is allowed.
    • ssl connection from inside the dask controller to k8s api IP and port seems to work in terms of TCP/IP connection. So it is not firewall or network policies.
    • There seems to be SSL handshake problem because of certificate mismatch, or because the CA certificate issued by k8s is not added to dask controller, which causes this failure.

Below are logs from the controller-dask-gateway pod.

[kamran@kworkhorse ~]$ kubectl -n daskhub logs -f controller-dask-gateway-668498765b-67lmk 

[I 2022-06-02 20:04:27.349 KubeController] Starting dask-gateway-kube-controller - version 2022.4.0
[I 2022-06-02 20:04:27.501 KubeController] dask-gateway-kube-controller started!
[I 2022-06-02 20:04:27.502 KubeController] API listening at http://:8000
[E 2022-06-05 01:02:20.171 KubeController] Error in endpoints informer, retrying...
Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 986, in _wrap_create_connection
    return await self._loop.create_connection(*args, **kwargs)  # type: ignore[return-value]  # noqa
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1089, in create_connection
    transport, protocol = await self._create_connection_transport(
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1119, in _create_connection_transport
    await waiter
ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.10/site-packages/dask_gateway_server/backends/kubernetes/utils.py", line 149, in run
    initial = await method(**self.method_kwargs)
  File "/home/dask/.local/lib/python3.10/site-packages/dask_gateway_server/backends/kubernetes/utils.py", line 47, in func
    return await method(*args, **kwargs)
  File "/home/dask/.local/lib/python3.10/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
    response_data = await self.request(
  File "/home/dask/.local/lib/python3.10/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
    return (await self.request("GET", url,
  File "/home/dask/.local/lib/python3.10/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
    r = await self.pool_manager.request(**args)
  File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/client.py", line 535, in _request
    conn = await self._connector.connect(
  File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 542, in connect
    proto = await self._create_connection(req, traces, timeout)
  File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 907, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
  File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
    raise last_exc
  File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
    transp, proto = await self._wrap_create_connection(
  File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
    raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.0.0.1:443 ssl:default [None]

[E 2022-06-05 10:24:44.857 KubeController] Error in endpoints informer, retrying...
Traceback (most recent call last):
  File "/home/dask/.local/lib/python3.10/site-packages/aiohttp/connector.py", line 986, in _wrap_create_connection
    return await self._loop.create_connection(*args, **kwargs)  # type: ignore[return-value]  # noqa
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1089, in create_connection
    transport, protocol = await self._create_connection_transport(
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1119, in _create_connection_transport
    await waiter
ConnectionAbortedError: SSL handshake is taking longer than 60.0 seconds: aborting the connection

. . . 

^C

Find if we can reach 10.0.0.1:443 from controller-dask-gateway:

Looks like we can reach 10.0.0.1:443 , but there is a certificate error. So, at least it is not a firewall/network policy issue.

dask@controller-dask-gateway-668498765b-67lmk:~$ openssl s_client -connect 10.0.0.1:443

CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN = apiserver
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = apiserver
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 CN = apiserver
verify return:1
---
Certificate chain
 0 s:CN = apiserver
   i:CN = ca
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIF+TCCA+GgAwIBAgIQcwW8FOcPZD0aaEJ8SB8aGTANBgkqhkiG9w0BAQsFADAN
MQswCQYDVQQDEwJjYTAeFw0yMjA2MDIxODE0MjBaFw0yNDA2MDIxODI0MjBaMBQx
EjAQBgNVBAMTCWFwaXNlcnZlcjCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoC
ggIBAJ48Qfk5HhAf71Cb9MzvY2hwb+tA3H022thdUiI3nxYhrkSUiXA+GzyZjb8t
ChV8Ecjxn1m/WeKfvuQ32T19PDmu2rlYhr2J1VPwd2r6ZTsJesi4R98EhxnxKZ7W
GGsjWu/E45yIuOFojIGBGDCEbKYAHe6U9xvEUUruGpY8gJQ8ms+sH6UBYz3aGqfv
oxaMMiuqC5FMgbnsle1JubpryyyaGwrk7m5OAn1aeB1qKfO85OhVl9oKXS3e2J2E
80uslbbqF/KP8zm1k5ilHEzwbP1eisqqWFcqWov0rZrfgWGrIYj2dNCVSAjl2iLM
VqKVFj7ki9uOhitCGInBQIfjvyzwtv1GrioZuAepL1/L1AJjfF4dsmcMCBX3WjzF
hwqjGaDk4/n4JoF8bYoXP1npfbtFWsqvDWAOwNUDSBvK4gePuBTjGyn0/YRS084F
OcG5npQyjD0aM/rQUv2pHA7esUQwQdMTUX4an6WBVJTyd/fRpvECvjtf3BNL/hUC
gUO3esC+K7KsJpWc0ZIOXIA3lmDN7vmivqvF2s4Xnd86QQhY7Txxr6aMfVby
-----END CERTIFICATE-----
subject=CN = apiserver

issuer=CN = ca

---
Acceptable client certificate CA names
CN = ca
CN = agg-ca
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 2456 bytes and written 393 bytes
Verification error: unable to verify the first certificate
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 21 (unable to verify the first certificate)
---
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : TLS_AES_256_GCM_SHA384
    Session-ID: A592BA296F8ADD7A8F8E50D4C424A876A10964DCA7DAD7D591F5170147716A53
    Session-ID-ctx: 
    Resumption PSK: C11FDA4709BFC86AC6E79F20274D63EAB7E356538FED72690D21859BD74600A6894FBEDA37E6EDBF4796C1D495C349CC
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 604800 (seconds)
    TLS session ticket:
    0000 - c3 31 95 4d c0 a4 8c 9c-87 dc dc 1f 36 78 0a 41   .1.M........6x.A
    0010 - 15 2b b6 36 db 49 fa f3-8c 6a 0f af 74 54 18 e3   .+.6.I...j..tT..
    0020 - 93 cc 2b 2b ec 1d 8b 90-70 c9 0b f4 8e 1d 64 f3   ..++....p.....d.
    0060 - 74 9e e4 28 c1 7d b5 e3-4b 81 7e 59 a5 d2 f6 f3   t..(.}..K.~Y....
    0070 - be b9 3e 72 b6 0e e6 59-8c 15 c4 03 65 cb bf 74   ..>r...Y....e..t
    0080 - ad                                                .

    Start Time: 1654596393
    Timeout   : 7200 (sec)
    Verify return code: 21 (unable to verify the first certificate)
    Extended master secret: no
    Max Early Data: 0
---
read R BLOCK
closed
dask@controller-dask-gateway-668498765b-67lmk:~$ 

Try accessing 10.0.0.1:443 by providing CA certificate manually:

This seems to work.

dask@controller-dask-gateway-668498765b-lgcks:~$ openssl s_client -connect 10.0.0.1:443 -CAfile /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 

CONNECTED(00000003)
Can't use SSL_get_servername
depth=1 CN = ca
verify return:1
depth=0 CN = apiserver
verify return:1
---
Certificate chain
 0 s:CN = apiserver
   i:CN = ca
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIF+TCCA+GgAwIBAgIQcwW8FOcPZD0aaEJ8SB8aGTANBgkqhkiG9w0BAQsFADAN
MQswCQYDVQQDEwJjYTAeFw0yMjA2MDIxODE0MjBaFw0yNDA2MDIxODI0MjBaMBQx
EjAQBgNVBAMTCWFwaXNlcnZlcjCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoC
ggIBAJ48Qfk5HhAf71Cb9MzvY2hwb+tA3H022thdUiI3nxYhrkSUiXA+GzyZjb8t
ChV8Ecjxn1m/WeKfvuQ32T19PDmu2rlYhr2J1VPwd2r6ZTsJesi4R98EhxnxKZ7W
GGsjWu/E45yIuOFojIGBGDCEbKYAHe6U9xvEUUruGpY8gJQ8ms+sH6UBYz3aGqfv
dMY59wPo7gEh9PMN6+N/OioPxb6nCi8Bw7hPaDN8KLitHwwJwVklZgZTl8/DPZ2h
4YJIV/1k4BdVQ7rBCALbnAexreiHgiUbxaLFfyYujI3ITWG4zta4LC5JsZTajdaT
oxaMMiuqC5FMgbnsle1JubpryyyaGwrk7m5OAn1aeB1qKfO85OhVl9oKXS3e2J2E
80uslbbqF/KP8zm1k5ilHEzwbP1eisqqWFcqWov0rZrfgWGrIYj2dNCVSAjl2iLM
VqKVFj7ki9uOhitCGInBQIfjvyzwtv1GrioZuAepL1/L1AJjfF4dsmcMCBX3WjzF
hwqjGaDk4/n4JoF8bYoXP1npfbtFWsqvDWAOwNUDSBvK4gePuBTjGyn0/YRS084F
OcG5npQyjD0aM/rQUv2pHA7esUQwQdMTUX4an6WBVJTyd/fRpvECvjtf3BNL/hUC
gUO3esC+K7KsJpWc0ZIOXIA3lmDN7vmivqvF2s4Xnd86QQhY7Txxr6aMfVby
-----END CERTIFICATE-----
subject=CN = apiserver

issuer=CN = ca

---
Acceptable client certificate CA names
CN = ca
CN = agg-ca
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 2456 bytes and written 393 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : TLS_AES_256_GCM_SHA384
    Session-ID: 4ACA638F9D4AEB2BD704C6F6264EC3E381B936C5A4D5A1CA543856E23F5B6A49
    Session-ID-ctx: 
    Resumption PSK: 4C3BEE30B6E4F0AE977B0B8426B157BCA50F0AD15562B670055970B0096EA244485B33E3225BBA1934AB6E0D2A73F5DF
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 604800 (seconds)
    TLS session ticket:
    0000 - c3 31 95 4d c0 a4 8c 9c-87 dc dc 1f 36 78 0a 41   .1.M........6x.A
    0010 - 66 36 e6 2c 50 2c ea 57-ba b2 db 89 5b 16 ee 80   f6.,P,.W....[...
    0020 - 34 60 67 54 80 39 c3 88-16 2f 1e 1e d4 2a fc b3   4`gT.9.../...*..
    0060 - 04 55 ce 3b 9e e4 8d 83-0a 97 c7 13 b2 73 37 0a   .U.;.........s7.
    0070 - 56 dc 50 61 4a 0d 76 71-1d 9b 9c b9 2c aa 79 a3   V.PaJ.vq....,.y.
    0080 - a6                                                .

    Start Time: 1654599650
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: no
    Max Early Data: 0
---
read R BLOCK
closed

The above tests simply prove that controller-dask-gateway is actually able to reach 10.0.0.1:443 , and no firewall or network policy is preventing this from happening.

We noticed that there is no setting in DASK API server configuration to provide CAfile. I would assume that DASK knows how to use the correct certificate (ca.crt) from the mounted/projected service account inside the pod. So ideally this should not be the cause of the problem.

So, what is the problem then?

Errors from the traefik pod:

There was a separate error being registered in the traefik pod - shown below.

time="2022-06-13T09:08:23Z" level=debug msg="vulcand/oxy/roundrobin/rr: Forwarding this request to URL" Request="{\"Method\":\"GET\",\"URL\":{\"Scheme\":\"\",\"Opaque\":\"\",\"User\":null,\"Host\":\"\",\"Path\":\"/api/v1/options\",\"RawPath\":\"\",\"ForceQuery\":false,\"RawQuery\":\"\",\"Fragment\":\"\",\"RawFragment\":\"\"},\"Proto\":\"HTTP/1.1\",\"ProtoMajor\":1,\"ProtoMinor\":1,\"Header\":{\"Accept\":[\"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\"],\"Accept-Encoding\":[\"gzip, deflate, br\"],\"Accept-Language\":[\"en-US,en;q=0.9\"],\"Connection\":[\"close\"],\"Cookie\":[\"jupyterhub-services=2|1:0|10:1654628753|19:jupyterhub-services|0:|20e5bf6702fbf493a47a021c7d133f62fc9c215a403a938cc541da628eb86c55; _odp_apps_oauth_session=s%3Ad8efd49c-e99f-4936-b016-61baff34592c.Ylp1d%2Fkuuw6sZsuBOA9KrmZeN73TIBdOipBVSd3Ul%2BM; jupyterhub-session-id=fd3293661df24d4bbae9ac6904f4815b\"],\"Sec-Fetch-Dest\":[\"document\"],\"Sec-Fetch-Mode\":[\"navigate\"],\"Sec-Fetch-Site\":[\"none\"],\"Sec-Fetch-User\":[\"?1\"],\"Sec-Gpc\":[\"1\"],\"Upgrade-Insecure-Requests\":[\"1\"],\"User-Agent\":[\"Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36\"],\"X-Forwarded-Host\":[\"dask.dev.oceandata.xyz\"],\"X-Forwarded-Port\":[\"80\"],\"X-Forwarded-Prefix\":[\"/services/dask-gateway\"],\"X-Forwarded-Proto\":[\"http\"],\"X-Forwarded-Server\":[\"traefik-dask-gateway-5b64bff4f7-2qrh7\"],\"X-Real-Ip\":[\"10.240.0.96\"]},\"ContentLength\":0,\"TransferEncoding\":null,\"Host\":\"dask.dev.oceandata.xyz\",\"Form\":null,\"PostForm\":null,\"MultipartForm\":null,\"Trailer\":null,\"RemoteAddr\":\"10.240.0.96:55150\",\"RequestURI\":\"/api/v1/options\",\"TLS\":null}" ForwardURL="http://10.240.0.192:8000"

time="2022-06-13T09:08:53Z" level=debug msg="'504 Gateway Timeout' caused by: dial tcp 10.240.0.192:8000: i/o timeout"

From the above log entry, it looks like traefik-dask-gateway is unable to access api-dask-gateway ( 10.240.0.192:8000 ).

time="2022-06-13T09:08:53Z" level=debug msg="'504 Gateway Timeout' caused by: dial tcp 10.240.0.192:8000: i/o timeout"

Solution:

Above, is a completely different problem, but we added an entry in the network policy of the api-dask-gateway, so traefik can talk to api-dask-server on port 8000. As soon as we applied this network policy, the SSL errors in the controller-dask-gateway stopped appearing anymore.

As you can see the error messages in the beginning of this issue were completely misleading. They forced us to think about fixing communication between controller-dask-gateway and the Kubernetes API server / endpoint, whereas the actual problem was traefik unable to reach api-dask-gateway . This caused us a lot of frustration, not to mention the hair loss in the process! This entire thing is documented so others can benefit from it, and hopefully the error messages in the DASK components can be improved.

  - from:
    - podSelector: 
        matchLabels:
          app.kubernetes.io/component: traefik
    ports:
    - port: 8000
      protocol: TCP

The complete policy file for api-dask-gateway looks like this:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: daskhub-dask-gateway
  labels:
    policyMaker: hubocean
    
spec:

  policyTypes:
  - Ingress
  - Egress

  podSelector:
    matchLabels:
      app.kubernetes.io/component: gateway

  ingress:
  
  - from:
    - podSelector: 
        matchLabels:
          component: hub
    ports:
    - port: 8000
      protocol: TCP


  - from:
    - podSelector: 
        matchLabels:
          app.kubernetes.io/component: dask-scheduler
    ports:
    - port: 8000
      protocol: TCP

  - from:
    - podSelector: 
        matchLabels:
          app.kubernetes.io/component: traefik
    ports:
    - port: 8000
      protocol: TCP

  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: monitoring


  egress:

  - to:
    - podSelector:
        matchLabels:
          app: jupyterhub
          component: hub
    ports:
    - port: 8081
      protocol: TCP

  - to:
    - podSelector:
        matchLabels:
          app.kubernetes.io/component: dask-scheduler
    ports:
    - port: 8788
      protocol: TCP
      

  - to:
    ports:
    - port: 53
      protocol: "TCP"
    - port: 53
      protocol: "UDP"

  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
    ports:
    - port: 443
      protocol: TCP
@jrbourbeau
Copy link
Member

This looks like a dask-gateway ticket, so transferring to the dask/dask-gateway repo

@jrbourbeau jrbourbeau transferred this issue from dask/dask Jun 14, 2022
@consideRatio
Copy link
Collaborator

Thanks for a detailed writeup! I'm on mobile and skimmed through the issue so far only.

We have some network policies implemented in the daskhub namespace.

As soon as you let a networkpolicy target a pod, it becomes locked down to what is explicitly allowed by networkpolicies targetting it. The dask-gateway helm chart does not bundle with networkpolicies, and therefore isnt locked down by default to what is needed for core functionality, so by adding a networkpolicy targetting pods it work with, you could have caused this, and it would be expected.

In my mind, the dask-gateway helm chart should ideally do that, and it seems you have figured out a lot of the networking you had to allow for core functionality after you ended up locking them down. Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants