Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX not working with NAT #9526

Open
ziegenbalg opened this issue Dec 4, 2023 · 1 comment
Open

UCX not working with NAT #9526

ziegenbalg opened this issue Dec 4, 2023 · 1 comment
Labels

Comments

@ziegenbalg
Copy link

Describe the bug

This is a continuation of a bug I've come across with the DAOS filesystem. UCX, by the looks of it does not work behind a NAT.

Steps to Reproduce

Setup a DAOS server, where the client is hosted on a virt-manager VM using a NAT'ed network. Also applies to any other NAT.

What's happening:

While running the command 'daos pool query tank', a ucx connection gets established. Somewhere along the way, UCX tries to negotiate the session to a higher range dynamic port. This outgoing SYN request does not get properly routed by the clients NAT. I know this is technically a NAT config problem, but I was wondering if this is a known issue and if someone knows of an workaround/solution that doesn't involve advanced NAT table rules.

I've tried using UCX_TCP_CM_REUSEADDR, and UCX_TCP_PORT_RANGE, but neither help for this scenario.

Setup and versions

debian 12
ucx+tcp
daos 2.4

Additional information (depending on the issue)

Attached pcap output.

Also:

[1701709287.814849] [elster-storage:3353 :a]   tcp_sockcm_ep.c:1124 UCX  DEBUG server created an endpoint on tcp_sockcm 0x7f918402f8c0 id: -1 state: 1
[1701709287.814853] [elster-storage:3353 :a]           async.c:230  UCX  DEBUG added async handler 0x7f9174001a90 [id=723 ref 1] uct_tcp_sa_data_handler() to hash
[1701709287.814860] [elster-storage:3353 :a]           async.c:508  UCX  DEBUG listening to async event fd 723 events 0x5 mode thread_spinlock
[1701709287.815390] [elster-storage:3353 :a]            sock.c:967  UCX  DEBUG check ifname for socket on 10.35.0.110:0
[1701709287.815498] [elster-storage:3353 :a]            sock.c:985  UCX  DEBUG matching ip found iface on eno1
[1701709287.815503] [elster-storage:3353 :a]   tcp_sockcm_ep.c:648  UCX  DEBUG fd 723: remote_data: (field_mask=15) dev_addr: <invalid address family> (length=6), conn_priv_data_length=47
[1701709287.815505] [elster-storage:3353 :a]       wireup_cm.c:1130 UCX  DEBUG server received a connection request on the rdmacm sockaddr transport (worker=0x7f9184039eb0 cm=0x7f918402f8c0 worker_cms_index=0)
[1701709287.815529] [elster-storage:3353 :1]          ucp_ep.c:354  UCX  DEBUG created ep 0x7f918804b000 to <no debug data> conn_request on uct_listener
[1701709287.815609] [elster-storage:3353 :1]          wireup.c:1071 UCX  DEBUG   ep 0x7f918804b000: am_lane 1 wireup_msg_lane 1 cm_lane 0 keepalive_lane <none> reachable_mds 0x1
[1701709287.815613] [elster-storage:3353 :1]          wireup.c:1094 UCX  DEBUG   ep 0x7f918804b000: lane[0]: cm <unknown>
[1701709287.815617] [elster-storage:3353 :1]          wireup.c:1094 UCX  DEBUG   ep 0x7f918804b000: lane[1]:  0:tcp/eno1.0 md[0]              -> addr[0].md[0]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
[1701709287.815620] [elster-storage:3353 :1]          tcp_ep.c:259  UCX  DEBUG   tcp_ep 0x7f9184068190: created on iface 0x7f91840376b0, fd -1
[1701709287.815623] [elster-storage:3353 :1]       wireup_ep.c:543  UCX  DEBUG   ep 0x7f918804b000: wireup_ep 0x7f9184296c70 created next_ep 0x7f9184068190 to <no debug data> using tcp/eno1
[1701709287.816574] [elster-storage:3353 :1]          tcp_cm.c:96   UCX  DEBUG   tcp_ep 0x7f9184068190: CLOSED -> CONNECTING for the [10.35.0.110:60099]<->[192.168.50.111:40695]:125 connection [-:Rx]
[17017092
elster-storage INFO 2023/12/04 09:01:27 daos_engine:0 87.816587] [elster-storage:3353 :1]          tcp_cm.c:96   UCX  DEBUG   tcp_ep 0x7f9184068190: CONNECTING -> CONNECTING for the [10.35.0.110:60099]<->[192.168.50.111:40695]:125 connection [-:Rx]
[1701709287.818856] [elster-storage:3353 :1]            sock.c:325  UCX  ERROR   connect(fd=724, dest_addr=192.168.50.111:40695) failed: Connection refused
elster-storage INFO 2023/12/04 09:01:27 daos_engine:0 [1701709287.818864] [elster-storage:3353 :1]       wireup_cm.c:1239 UCX  WARN  server ep 0x7f918804b000 failed to connect to remote address on device eno1, tl_bitmap 0x1 0x0, status Destination is unreachable
[1701709287.818883] [elster-storage:3353 :1]           async.c:155  UCX  DEBUG removed async handler 0x7f9174001a90 [id=723 ref 1] uct_tcp_sa_data_handler() from hash
[1701709287.818888] [elster-storage:3353 :1]           async.c:561  UCX  DEBUG removing async handler 0x7f9174001a90 [id=723 ref 1] uct_tcp_sa_data_handler()
[1701709287.818894] [elster-storage:3353 :1]           async.c:170  UCX  DEBUG release async handler 0x7f9174001a90 [id=723 ref 0] uct_tcp_sa_data_handler()
[1701709287.818908] [elster-storage:3353 :1]          ucp_ep.c:1209 UCX  DEBUG ep 0x7f918804b000: destroy
[1701709287.818909] [elster-storage:3353 :1]          ucp_ep.c:1459 UCX  DEBUG ep 0x7f918804b000: cleanup lanes
[1701709287.818911] [elster-storage:3353 :1]          ucp_ep.c:1469 UCX  DEBUG ep 0x7f918804b000: pending & destroy uct_ep[1]=0x7f9184296c70
[1701709287.818914] [elster-storage:3353 :1]       wireup_ep.c:471  UCX  DEBUG ep 0x7f918804b000: destroy wireup ep 0x7f9184296c70
[1701709287.818916] [elster-storage:3353 :1]          ucp_ep.c:1267 UCX  DEBUG ep 0x7f918804b000: unprogress iface 0x7f91840376b0 tcp/eno1
[1701709287.819885] [elster-storage:3353 :1]          tcp_ep.c:358  UCX  DEBUG tcp_ep 0x7f9184068190: purge outstanding operations with status Request canceled
[1701709287.819895] [elster-storage:3353 :1]          tcp_cm.c:96   UCX  DEBUG tcp_ep 0x7f9184068190: CONNECTING -> CLOSED for the [10.35.0.110:60099]<->[192.168.50.111:40695]:125 connection [-:-]
[1701709287.819897] [elster-storage:3353 :1]          tcp_ep.c:408  UCX  DEBUG tcp_ep 0x7f9184068190: destroyed on iface 0x7f91840376b0

packets.zip

^ Notice how upon successful ucx connection with 10.35.0.110:31416, at packet #32, 10.35.0.110 tried to open up a new port 57213 on the clients machine. This is unsuccessfully routed by the client NAT.

What could be going on here, and what would be the correct approach to solving this issue? Any advice on NAT config is also appreciated, though ideally I'd love to solve this with as minimal NAT configuration as possible.

Thank you very much for your time.

@ziegenbalg ziegenbalg added the Bug label Dec 4, 2023
@sb22bs
Copy link

sb22bs commented Dec 23, 2023

Hi

why do you think this should work?

Usually you have all parties on the same subnet when working with MPI/UCX&friends.

Also the daos.io-homepage states:

Subnet (https://docs.daos.io/v2.4/admin/predeployment_check/#subnet)

Since all engines need to be able to communicate, the different network interfaces must be on the same subnet or you must configuring routing across the different subnets.

So just make sure that everyone is seeing each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants