Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nexus does not re-use CruciblePantryClient during disk imports. #5717

Closed
faithanalog opened this issue May 7, 2024 · 3 comments · Fixed by #5750
Closed

nexus does not re-use CruciblePantryClient during disk imports. #5717

faithanalog opened this issue May 7, 2024 · 3 comments · Fixed by #5750

Comments

@faithanalog
Copy link
Contributor

faithanalog commented May 7, 2024

While attempting to upload 64 400-MiB disk images concurrently to madrid (4-sled racklette), I caused Nexus to consume 100% of the sled's CPU resources. These uploads were extremely slow, and over 50% of them failed to fully upload their disks.

madrid is running fcf7980

You can read the full adventure here https://github.com/oxidecomputer/artemiscellaneous/blob/main/journal/2024-05-06-slow-disk-uploads.md#adventures-in-parallel-boot-disk-uploads

But here's the important bits:

I uploaded the images like this:

seq 1 64 | xargs -I '%' -P 64 sh -c '
    oxide disk import \
    --project iodriver --description "iodriver boot medium" \
    --path "/dev/shm/iodriver.iso" \
    --disk "iodriver-boot-%"
'

Each image uploaded at about 300 KiB/s, or about 18MiB/s aggregate average. This generated tremendous resource contention in nexus.

prstat from the sled with nexus:

root@BRM42220046:~# prstat
Total: 332 processes, 12302 lwps, load averages: 127.20, 120.51, 83.41
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 12198 root     1007M  943M cpu96    0    0  84:49:00  98% nexus/145
   163 root     7156K 5616K sleep   59    0   0:40:44 0.2% devfsadm/14
  4207 root      792M  606M sleep   59    0   0:16:42 0.1% cockroach/136
     5 root        0K    0K sleep   98  -20   0:02:22 0.0% zpool-rpool/327
  1496 root        0K    0K sleep   98  -20   0:04:13 0.0% zpool-oxp_4a1be/327
   652 root       92M   57M sleep   59    0   0:05:02 0.0% sled-agent/135
  1569 root        0K    0K sleep   98  -20   0:01:30 0.0% zpool-oxp_bc8f5/327
 10782 root       57M   47M sleep   59    0   0:00:02 0.0% dtrace/2
  1546 root        0K    0K sleep   98  -20   0:01:00 0.0% zpool-oxp_8ada9/327
   142 root       33M   29M sleep   59    0   0:00:02 0.0% syseventd/19
   587 root     4540K 2392K sleep   59    0   0:00:02 0.0% syslogd/10
   543 root     2152K  728K sleep   59    0   0:00:00 0.0% utmpd/1
   484 root     7444K 4944K sleep   59    0   0:00:06 0.0% nscd/23
    48 root     3076K 1996K sleep   59    0   0:00:00 0.0% dlmgmtd/7
   732 root        0K    0K sleep   98  -20   0:00:00 0.0% zpool-oxi_f32c2/327

flamegraph of nexus (view raw for interactivity)

disk-upload-flamegraph-nexus

I was uploading these images from an instance in dogfood which is located in the same physical place as madrid, meaning I had a high-speed connection. You can see the graph of network traffic from the dogfood instance is very sporadic despite this.

image

The problem, ultimately, is that Nexus is not caching its CruciblePantryClient. The oxide cli splits the image up into 512KiB blocks. Each time one of these 512KiB blocks is uploaded, Nexus creates a new CruciblePantryClient from scratch, and pays all the overhead of setting up that client. The flamegraph shows the overhead is mostly down to Reqwest preparing a TLS-capable connector (though it won't use TLS at all for this connection), which causes openssl to read & parse the system trusted certificate roots.

We are managing to upload 37 of these blocks per second, but we have 64 CLI invocations * 8 worker threads per invocation = 512 client worker threads fighting over this throughput, which leads to a number of them failing, presumably due to connection timeouts.

This problem is probably not unique to massively parallel uploads, and uploading a few images in parallel likely generates a considerable load (though much less contention), but I have not measured this.

Caching the CruciblePantryClient between block uploads should make this a lot better.

@iliana
Copy link
Contributor

iliana commented May 7, 2024

It probably makes sense to audit Nexus for any reqwest::Client creation inside async tasks, because that is evidently a blocking operation that can result in some pretty nasty resource contention.

Client is relatively cheap to clone, because it's an Arc<ClientRef>. ("You do not have to wrap the Client in an Rc or Arc to reuse it, because it already uses an Arc internally.") So we could build it once, keep it around in the ServerContext, and pass a clone around to any code that needs to create a new Progenitor client.

I think we have a number of differently-configured clients (different timeouts, for instance), so we may need to be a bit more clever, but Artemis and I are going to try out a basic version of this change with the Crucible Pantry client.

@jgallagher
Copy link
Contributor

Client is relatively cheap to clone, because it's an Arc<ClientRef>. ("You do not have to wrap the Client in an Rc or Arc to reuse it, because it already uses an Arc internally.") So we could build it once, keep it around in the ServerContext, and pass a clone around to any code that needs to create a new Progenitor client.

I realize I just suggested @smklein deprioritize connection pooling a bit, but I think this is the kind of thing we need it for. Keeping individual clients around is definitely wrong if they're configured with specific IP addresses (which is pretty common throughout omicron, I think?), since the may go stale if reconfigurator adjusts services. I'm less sure about keeping clients around that are given DNS names, but then we're subject to however reqwest does DNS resolution (which is also something we ultimately want to control as part of the connection pool).

@iliana
Copy link
Contributor

iliana commented May 7, 2024

It's pretty common to have one-off Progenitor-generated clients with specific IPs around, but that's a level up from reqwest::Client (we pass the IP as part of the base URL for the Progenitor-generated client). So if we keep one (or a few, depending on what timeout settings we need) around and reuse them with <progenitor_crate>::Client::new_with_client, that will get us Reqwest's connection pooling. We may want to swap that out later.

I think this and the CRDB connection pooling issues are not orthogonal, but a stop-gap solution for HTTP seems easier at least. It's kind of weird how expensive reqwest::Client::new is.

iliana added a commit that referenced this issue May 14, 2024
Fixes #5717. @faithanalog will follow-up with performance numbers she's
running now.

We should probably do this to other clients we generate on-the-fly in
Nexus, too, but this is probably the worst offender to fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants