Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multipart layer fetch #10177

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

azr
Copy link
Contributor

@azr azr commented May 6, 2024

TLDR: this makes pulls of big images ~2x faster, and closes #9922. Questions first, explanation second, metrics third, observations last.

cc: #8160, #4989


I have two (and a half) questions:

  • Do you want this ?
  • How would you rather I pass the parallelism + chunk_size parameters instead of getenv ?
    • Would a config file setting be better ?
      • in this case, do you have a sane example I can take from ?
    • Are env variables – like for enabling tracing - better ?

Hello Containerd People, I have this draft PR I would like to get your eyes on.

It basically makes pulls faster, but also tries to have not such a big memory impact, by getting consecutive chunks of the layers and immediately pushing them in the pipe (that writes to a file + that signature checksum thing).
I noticed it made pulls ~2x faster, when using the correct settings.

The settings have a big impact, and so I did a bunch of perf tests with different settings, here are some results on a ~8GB image using a r6id.4xlarge instance, pulling it from s3.
Gains are somewhat similar on a ~27GB and a ~100GB image (with a little tiny bit of slowdown)
I also tried on an nvme, and a ebs drives, they are ofc slower but gains are still the same.


Metrics on a r6id.4xlarge timing crictl pull of a 8.6GB image.

The first one with 13 tries is with 0 parallelism, it's the current code.
The rest are tries with different settings

  • c_para (max number of chunks being pulled per layer at once)
  • chunk_size_mb ( size of chunks in mb )
  • ctd_max_con ( max # of layer pulled at once )
tmpfs tests:
dst    agv_time          count(*)
-----  ----------------  --------
tmpfs  44.0761538461539  13      

dst    c_para  chunk_size_b  ctd_max_con  agv_time  count(*)
-----  ------  ------------  -----------  --------  --------
tmpfs  110     32            3            22.625    4       
tmpfs  100     32            3            22.64     5       
tmpfs  130     32            2            22.76     1       
tmpfs  120     32            4            22.824    5       
tmpfs  110     32            2            22.85     1       
tmpfs  80      32            4            22.99     1       
tmpfs  110     32            4            23.018    5       
tmpfs  90      64            4            23.09     1       
tmpfs  90      32            3            23.18     1       
tmpfs  110     64            3            23.2125   4       
tmpfs  80      64            3            23.29     1       
tmpfs  90      64            3            23.32     1       
tmpfs  100     32            4            23.352    5       
tmpfs  70      15            4            23.4      1       
tmpfs  100     64            3            23.65     5       
tmpfs  120     15            3            23.68     1       
tmpfs  110     64            2            23.74     1       
tmpfs  100     64            4            23.77     5       
tmpfs  70      32            4            23.81     5       
tmpfs  120     32            3            23.83     5
[...]
nvme (885GB) tests:
dst         agv_time          count(*)
----------  ----------------  --------
added-nvme  47.4008333333333  12      

dst         c_para  chunk_size_mb  ctd_max_con  agv_time  count(*)
----------  ------  ------------  -----------  --------  --------
added-nvme  130     32            3            25.24     1       
added-nvme  70      32            4            26.1      1       
added-nvme  80      32            3            26.31     1       
added-nvme  100     32            3            26.38     1       
added-nvme  120     32            4            26.58     1       
added-nvme  130     32            2            26.71     1       
added-nvme  80      32            4            26.73     1       
added-nvme  120     10            3            26.82     1       
added-nvme  80      64            3            26.93     1       

Observations, I did a little go program to multipart download big files directly into a file at different positions with different requests, and that was much faster than piping single threadedly into a file. Containerd pipes in a checksumer and then pipes into a file. I think that this can in some conditions create some sort of thrashing, hence why the parameters are very important here.

That simple go program had pretty bad perfs with one connection, but I was able to saturate the network with multiple connection, with better or on par perfs with aws-crt.

I think that for maximum perfs, we could try to re-architecture things a bit; like concurrently write directly into the tmpfile, and then tell the checksumer our progress, so that it can do that in parallel, and then carry on like usual.

@k8s-ci-robot
Copy link

Hi @azr. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@azr azr force-pushed the azr/parallel-layer-fetch branch 4 times, most recently from 2531c18 to 0748b0f Compare May 7, 2024 09:57
@azr azr marked this pull request as ready for review May 7, 2024 09:59
@azr azr changed the title Parallel layer fetch Multipart layer fetch May 7, 2024
@swagatbora90
Copy link
Contributor

swagatbora90 commented May 8, 2024

@azr Thanks for the PR, this looks promising. I wonder if you were able to get any memory usage data from your tests? Previous effort to use ECR containerd resolver , which has similar multipart layer download, showed that it can take up disproportionate amount of memory specially when we increase the number of parallel chunks(without providing significant benefit to latency). The high memory utilization was mainly from the htcat library that ECR resolver uses to performance parallel Ranged Gets. I think we should understand these tradeoffs.

Also can you share some information about your test image? Number of layers? Size of individual layers?

@akhilerm
Copy link
Member

akhilerm commented May 9, 2024

/ok-to-test

@azr
Copy link
Contributor Author

azr commented May 14, 2024

Hey @swagatbora90 , of course !

The theory in my mind is that this should use max/worst-case max_concurrent_downloads * (max_parallelism * goroutine footprint) memory ; where the goroutine footprint should be: the goroutine stack, 32 * 1024 bytes (of buffer), request clone. io.Copy will create buffers of 32 * 1024 bytes here; I have not tried playing with buffer sizes, could be an option too.

I think memory usage would be better if we were to directly write in parallel in a file at different positions, with 'holes'. And, sort of tell our progress to the checksumer with no-op writers that tell where we are, etc. (DL actually was so much faster this way in a test program I did, but it was not doing any unpacking, etc.)

I also think it could be nice to be able to have a per registry parallelism setting, because not all registries are s3 backed, and docker.io seems to throttle things at 60mb/s.


Topology of images:

~8GB image

From crictl images, size is 3.97GB

dive infos:
Screenshot 2024-05-14 at 15 14 03

Total Image size: 8.6 GB                                                                                                                -rw-r--r--   1000:1000      205 B  │   │   │   └── README
Potential wasted space: 34 MB                                                                                                           drwxr-xr-x   1000:1000      319 B  │   │   ├── Xresources
Image efficiency score: 99 %
~27GB image

From crictl images, size is 17.7GB

dive infos:
Screenshot 2024-05-14 at 15 44 37

Total Image size: 27 GB
Potential wasted space: 147 MB
Image efficiency score: 99 %

Here are memory usages, where I'm periodically storing ps -p $pid -o rss= of containerd in debug mode started with vscode, and gctraces enabled.

~27GB image pull, max_concurrent_downloads: 2, 0 parallelism (before)

memory_usage_17g_pull_before
(typo, replace KB by MB here.)

~27GB image pull, max_concurrent_downloads: 2, 110 parallelism, 32mb chunks

memory_usage_17g_pull_110p_32mbc
(typo, replace KB by MB here.)


GC traces:

8GB image with `GODEBUG=gctrace=1`, parallelism set to 110 and chunksize set to 32
INFO[2024-05-13T14:35:20.998417039Z] PullImage "..." 
gc 6 @7.661s 0%: 0.050+1.4+0.049 ms clock, 0.80+0.24/4.2/9.7+0.79 ms cpu, 7->7->5 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 7 @8.021s 0%: 0.053+1.5+0.053 ms clock, 0.86+0.057/4.5/9.6+0.85 ms cpu, 11->12->6 MB, 12 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 81 @2400.006s 0%: 0.039+0.20+0.002 ms clock, 0.15+0/0.18/0.43+0.010 ms cpu, 0->0->0 MB, 1 MB goal, 0 MB stacks, 0 MB globals, 4 P (forced)
gc 8 @8.246s 0%: 0.14+1.5+0.069 ms clock, 2.3+0.091/4.9/11+1.1 ms cpu, 13->13->9 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 9 @10.168s 0%: 0.70+5.3+0.056 ms clock, 11+12/15/0.12+0.91 ms cpu, 18->20->11 MB, 19 MB goal, 1 MB stacks, 0 MB globals, 16 P
gc 10 @10.181s 0%: 0.32+4.0+0.10 ms clock, 5.2+13/10/0+1.6 ms cpu, 21->22->11 MB, 25 MB goal, 1 MB stacks, 0 MB globals, 16 P
gc 11 @10.868s 0%: 0.16+2.0+0.008 ms clock, 2.5+5.2/6.9/8.7+0.14 ms cpu, 26->26->19 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 12 @11.141s 0%: 0.10+2.3+0.055 ms clock, 1.6+0.23/7.3/14+0.88 ms cpu, 37->37->21 MB, 41 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 13 @11.366s 0%: 0.94+3.0+0.051 ms clock, 15+0.19/8.1/13+0.82 ms cpu, 40->40->22 MB, 43 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 14 @11.940s 0%: 0.81+2.0+0.047 ms clock, 12+0.29/7.1/13+0.76 ms cpu, 41->41->22 MB, 45 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 15 @12.879s 0%: 0.45+2.8+0.084 ms clock, 7.3+0.18/6.8/14+1.3 ms cpu, 43->43->22 MB, 45 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 16 @13.172s 0%: 0.052+2.5+0.089 ms clock, 0.83+0.21/8.1/13+1.4 ms cpu, 45->45->23 MB, 46 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 17 @13.453s 0%: 0.22+3.7+0.069 ms clock, 3.5+0.21/7.6/14+1.1 ms cpu, 46->47->23 MB, 47 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 18 @13.867s 0%: 0.14+2.4+0.080 ms clock, 2.2+0.17/7.2/14+1.2 ms cpu, 47->47->23 MB, 48 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 19 @14.120s 0%: 0.051+3.1+0.047 ms clock, 0.82+0.63/7.6/14+0.75 ms cpu, 48->49->25 MB, 49 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 20 @14.352s 0%: 0.055+2.8+0.007 ms clock, 0.88+0.14/6.4/12+0.11 ms cpu, 50->51->20 MB, 52 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 21 @14.490s 0%: 0.12+2.7+0.052 ms clock, 1.9+0.14/6.7/12+0.84 ms cpu, 41->42->13 MB, 42 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 22 @14.528s 0%: 0.12+1.8+0.051 ms clock, 2.0+0.22/6.5/12+0.81 ms cpu, 27->27->19 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 23 @15.572s 0%: 0.14+2.1+0.078 ms clock, 2.2+0.083/6.6/12+1.2 ms cpu, 39->39->20 MB, 40 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 24 @15.737s 0%: 0.053+1.5+0.076 ms clock, 0.85+0.092/5.3/12+1.2 ms cpu, 40->41->20 MB, 42 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 25 @15.963s 0%: 0.60+2.4+0.082 ms clock, 9.6+0.067/6.4/11+1.3 ms cpu, 39->40->12 MB, 41 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 26 @28.410s 0%: 0.18+1.4+0.004 ms clock, 2.9+0.064/4.8/9.9+0.072 ms cpu, 24->25->13 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 27 @28.530s 0%: 0.054+2.2+0.051 ms clock, 0.86+0/5.8/8.6+0.82 ms cpu, 27->27->13 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 28 @28.638s 0%: 0.044+1.7+0.052 ms clock, 0.71+0.063/5.2/9.0+0.84 ms cpu, 27->27->13 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 29 @28.771s 0%: 0.043+1.7+0.047 ms clock, 0.69+0.087/5.3/10+0.75 ms cpu, 27->27->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 30 @29.766s 0%: 0.11+2.1+0.087 ms clock, 1.8+0/6.7/9.7+1.3 ms cpu, 28->28->14 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 31 @34.550s 0%: 0.054+1.9+0.004 ms clock, 0.87+0.062/5.3/9.8+0.072 ms cpu, 28->28->14 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 32 @34.665s 0%: 0.046+1.5+0.051 ms clock, 0.75+0.057/4.9/10+0.83 ms cpu, 29->29->15 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 33 @34.779s 0%: 0.043+1.4+0.008 ms clock, 0.70+0.076/4.7/10+0.13 ms cpu, 30->30->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 34 @34.915s 0%: 0.12+1.7+0.010 ms clock, 1.9+0.10/5.2/10+0.16 ms cpu, 30->30->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 35 @35.284s 0%: 0.052+1.4+0.005 ms clock, 0.84+0.072/4.7/10+0.081 ms cpu, 31->31->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 36 @35.414s 0%: 0.11+1.7+0.047 ms clock, 1.9+0.095/6.0/10+0.75 ms cpu, 31->32->16 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 37 @35.544s 0%: 0.049+2.2+0.055 ms clock, 0.79+0.081/6.7/11+0.89 ms cpu, 32->32->17 MB, 33 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 38 @35.695s 0%: 0.10+2.3+0.004 ms clock, 1.6+0.058/5.6/9.6+0.077 ms cpu, 34->34->17 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 39 @35.876s 0%: 0.14+2.4+0.047 ms clock, 2.2+0.073/6.0/9.6+0.75 ms cpu, 34->34->17 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 40 @35.997s 0%: 0.046+2.2+0.006 ms clock, 0.74+0.064/6.1/10+0.11 ms cpu, 34->35->17 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 41 @36.117s 0%: 0.10+2.3+0.046 ms clock, 1.7+0.058/5.8/10+0.74 ms cpu, 35->35->17 MB, 36 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 42 @36.237s 0%: 0.039+2.4+0.048 ms clock, 0.63+0.069/4.9/11+0.77 ms cpu, 35->35->18 MB, 36 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 43 @36.362s 0%: 0.038+1.9+0.020 ms clock, 0.61+0.077/5.8/10+0.33 ms cpu, 36->36->18 MB, 37 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 44 @36.488s 0%: 0.031+2.5+0.008 ms clock, 0.49+0.079/5.7/9.7+0.13 ms cpu, 36->37->18 MB, 37 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 82 @2430.001s 0%: 0.029+0.17+0.003 ms clock, 0.11+0/0.16/0.41+0.012 ms cpu, 0->0->0 MB, 1 MB goal, 0 MB stacks, 0 MB globals, 4 P (forced)
gc 45 @39.116s 0%: 0.50+2.7+0.089 ms clock, 8.0+0.24/6.5/10+1.4 ms cpu, 37->37->19 MB, 38 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 46 @39.838s 0%: 0.053+2.5+0.051 ms clock, 0.85+0.078/6.8/11+0.81 ms cpu, 38->38->20 MB, 39 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 47 @39.957s 0%: 0.059+1.4+0.055 ms clock, 0.94+0.11/4.5/10+0.88 ms cpu, 40->40->13 MB, 41 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 48 @40.036s 0%: 0.041+1.7+0.056 ms clock, 0.66+0.066/5.5/8.8+0.91 ms cpu, 27->28->7 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 49 @40.061s 0%: 0.030+2.1+0.059 ms clock, 0.49+0.060/6.3/9.7+0.94 ms cpu, 13->14->7 MB, 15 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 50 @40.083s 0%: 0.075+1.7+0.048 ms clock, 1.2+0.17/5.0/8.8+0.78 ms cpu, 13->14->7 MB, 15 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 51 @40.104s 0%: 0.12+1.5+0.081 ms clock, 1.9+0.049/4.9/9.6+1.3 ms cpu, 14->15->9 MB, 16 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 52 @40.115s 0%: 0.18+5.4+0.15 ms clock, 2.9+0.51/11/16+2.4 ms cpu, 17->18->13 MB, 19 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 53 @40.157s 0%: 0.083+1.8+0.092 ms clock, 1.3+0.078/5.2/8.5+1.4 ms cpu, 26->26->13 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 54 @40.847s 0%: 0.052+1.5+0.050 ms clock, 0.83+0.060/4.6/10+0.80 ms cpu, 26->26->13 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 55 @41.094s 0%: 0.14+2.5+0.046 ms clock, 2.2+0.14/6.5/10+0.74 ms cpu, 25->26->13 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 56 @41.191s 0%: 0.80+1.4+0.050 ms clock, 12+0.062/4.3/10+0.80 ms cpu, 26->27->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 57 @41.464s 0%: 0.053+1.5+0.004 ms clock, 0.85+0.069/4.7/10+0.078 ms cpu, 27->27->14 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 58 @41.838s 0%: 0.053+1.4+0.005 ms clock, 0.85+0.067/4.8/11+0.084 ms cpu, 28->28->14 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 59 @41.993s 0%: 0.049+2.4+0.065 ms clock, 0.79+0.071/5.2/9.7+1.0 ms cpu, 29->29->15 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 60 @42.457s 0%: 0.052+1.7+0.050 ms clock, 0.83+0.080/5.5/10+0.81 ms cpu, 30->30->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 61 @42.654s 0%: 0.053+1.7+0.095 ms clock, 0.85+0.064/4.9/10+1.5 ms cpu, 30->30->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 62 @42.936s 0%: 0.051+1.7+0.049 ms clock, 0.83+0.079/5.5/10+0.79 ms cpu, 30->30->15 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 63 @43.177s 0%: 0.050+1.6+0.057 ms clock, 0.81+0.068/5.5/11+0.92 ms cpu, 31->31->16 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 64 @43.333s 0%: 0.12+2.3+0.005 ms clock, 2.0+0.061/5.8/9.8+0.094 ms cpu, 32->32->17 MB, 33 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 65 @43.477s 0%: 0.051+2.0+0.004 ms clock, 0.82+0.067/6.6/11+0.076 ms cpu, 34->34->17 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 66 @43.619s 0%: 0.14+2.0+0.10 ms clock, 2.3+0.058/5.3/10+1.7 ms cpu, 34->34->17 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 67 @44.696s 0%: 0.053+1.4+0.006 ms clock, 0.85+0.073/4.6/10+0.099 ms cpu, 34->35->14 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 68 @44.768s 0%: 0.034+1.4+0.004 ms clock, 0.55+0.051/4.6/10+0.075 ms cpu, 28->28->6 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 69 @44.814s 0%: 0.034+1.7+0.048 ms clock, 0.55+0.071/4.9/10+0.77 ms cpu, 13->13->6 MB, 13 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 70 @44.845s 0%: 0.12+3.1+0.12 ms clock, 1.9+0/5.4/11+2.0 ms cpu, 17->17->12 MB, 17 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 71 @44.966s 0%: 0.086+1.5+0.005 ms clock, 1.3+0.10/4.7/10+0.089 ms cpu, 24->24->13 MB, 25 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 72 @45.108s 0%: 0.16+2.1+0.082 ms clock, 2.5+0.16/5.8/9.7+1.3 ms cpu, 26->26->13 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 73 @45.260s 0%: 0.10+1.3+0.005 ms clock, 1.6+0.058/4.3/10+0.094 ms cpu, 27->27->13 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 74 @45.463s 0%: 0.045+1.5+0.004 ms clock, 0.73+0.11/4.9/9.4+0.074 ms cpu, 27->27->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 75 @45.641s 0%: 0.14+1.7+0.005 ms clock, 2.2+0.063/5.0/10+0.088 ms cpu, 28->28->14 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 76 @45.797s 0%: 0.039+1.3+0.006 ms clock, 0.63+0.067/4.6/10+0.097 ms cpu, 29->29->13 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 77 @45.860s 0%: 0.030+1.1+0.004 ms clock, 0.48+0.85/3.8/9.2+0.075 ms cpu, 29->30->11 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
8GB image with `GODEBUG=gctrace=1`, parallelism set to 0 ( existing code )
gc 6 @6.173s 0%: 0.044+1.4+0.051 ms clock, 0.70+0.15/4.5/9.4+0.82 ms cpu, 8->8->5 MB, 9 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 7 @6.557s 0%: 0.12+1.6+0.004 ms clock, 1.9+0.42/5.7/7.0+0.070 ms cpu, 11->13->8 MB, 12 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 8 @12.802s 0%: 0.16+1.8+0.096 ms clock, 2.5+0.90/5.1/7.6+1.5 ms cpu, 18->19->15 MB, 18 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 9 @13.092s 0%: 0.11+1.2+0.041 ms clock, 1.8+0.095/4.3/9.7+0.67 ms cpu, 29->29->15 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 10 @13.231s 0%: 0.047+1.2+0.054 ms clock, 0.76+0.11/4.0/9.6+0.87 ms cpu, 27->27->15 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 11 @13.707s 0%: 0.047+1.3+0.056 ms clock, 0.76+0.21/4.3/9.9+0.90 ms cpu, 28->28->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 111 @3300.010s 0%: 0.051+0.19+0.002 ms clock, 0.20+0/0.17/0.44+0.011 ms cpu, 0->0->0 MB, 1 MB goal, 0 MB stacks, 0 MB globals, 4 P (forced)
gc 12 @13.868s 0%: 0.15+1.9+0.004 ms clock, 2.5+0.088/5.2/9.2+0.077 ms cpu, 28->28->16 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 13 @14.677s 0%: 0.22+1.8+0.095 ms clock, 3.6+0.13/5.4/10+1.5 ms cpu, 31->31->16 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 14 @14.908s 0%: 0.21+1.9+1.0 ms clock, 3.4+0.14/6.6/10+16 ms cpu, 32->33->16 MB, 33 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 15 @15.091s 0%: 0.18+2.5+0.058 ms clock, 2.9+0.085/6.7/11+0.94 ms cpu, 33->34->17 MB, 34 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 16 @15.312s 0%: 0.053+2.4+0.050 ms clock, 0.86+0.080/6.0/9.8+0.80 ms cpu, 34->35->18 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 17 @15.650s 0%: 0.049+2.1+0.005 ms clock, 0.79+0.063/5.7/10+0.088 ms cpu, 36->36->18 MB, 37 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 18 @15.829s 0%: 0.11+3.2+0.058 ms clock, 1.9+0.084/6.7/9.4+0.93 ms cpu, 36->37->18 MB, 37 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 19 @16.008s 0%: 0.050+2.9+0.005 ms clock, 0.80+0.070/7.4/10+0.080 ms cpu, 37->38->20 MB, 38 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 20 @16.184s 0%: 0.049+1.8+0.047 ms clock, 0.79+0.11/4.9/9.7+0.76 ms cpu, 40->41->15 MB, 42 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 21 @16.289s 0%: 0.052+2.5+0.005 ms clock, 0.84+0.087/5.3/9.6+0.095 ms cpu, 31->31->8 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 22 @16.338s 0%: 0.28+1.3+0.11 ms clock, 4.5+2.0/4.6/8.1+1.8 ms cpu, 16->22->13 MB, 21 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 23 @17.335s 0%: 0.073+1.2+0.026 ms clock, 1.1+0.20/4.2/9.2+0.42 ms cpu, 27->27->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 24 @17.461s 0%: 0.051+1.5+0.049 ms clock, 0.82+0.094/4.6/5.8+0.78 ms cpu, 29->29->15 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 25 @17.592s 0%: 0.11+1.5+0.004 ms clock, 1.8+0.047/4.6/9.6+0.069 ms cpu, 30->30->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 26 @17.779s 0%: 0.048+1.3+0.047 ms clock, 0.77+0.019/4.2/9.2+0.75 ms cpu, 31->31->8 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 112 @3330.001s 0%: 0.033+0.17+0.003 ms clock, 0.13+0/0.16/0.41+0.015 ms cpu, 0->0->0 MB, 1 MB goal, 0 MB stacks, 0 MB globals, 4 P (forced)
gc 27 @58.347s 0%: 0.14+1.9+0.056 ms clock, 2.2+0/5.3/8.8+0.90 ms cpu, 17->17->14 MB, 17 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 28 @58.451s 0%: 0.13+1.5+0.049 ms clock, 2.0+0.079/4.9/9.6+0.78 ms cpu, 29->30->12 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 29 @58.573s 0%: 0.073+1.4+0.046 ms clock, 1.1+0.11/4.5/9.1+0.74 ms cpu, 25->25->13 MB, 25 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 30 @58.673s 0%: 0.050+1.4+0.046 ms clock, 0.80+0.10/4.5/9.1+0.74 ms cpu, 26->26->13 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 31 @58.818s 0%: 0.050+1.2+0.048 ms clock, 0.80+0.071/4.1/9.3+0.77 ms cpu, 26->26->13 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 32 @62.929s 0%: 0.13+1.7+0.071 ms clock, 2.1+0.083/5.1/9.1+1.1 ms cpu, 27->27->13 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 33 @64.699s 0%: 0.11+1.8+0.053 ms clock, 1.8+0.086/4.7/9.0+0.85 ms cpu, 27->27->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 34 @64.801s 0%: 0.082+2.2+0.099 ms clock, 1.3+0.048/5.3/9.0+1.5 ms cpu, 27->28->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 35 @64.910s 0%: 0.050+1.8+0.051 ms clock, 0.81+0.11/5.2/9.2+0.83 ms cpu, 28->28->14 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 36 @65.038s 0%: 0.079+1.7+0.050 ms clock, 1.2+0.056/5.2/9.2+0.81 ms cpu, 29->29->14 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 37 @65.383s 0%: 0.14+1.9+0.004 ms clock, 2.3+0.069/6.1/9.2+0.076 ms cpu, 29->30->15 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 38 @65.507s 0%: 0.050+1.7+0.060 ms clock, 0.81+0.15/5.2/9.4+0.96 ms cpu, 30->30->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 39 @65.636s 0%: 0.050+2.4+0.054 ms clock, 0.81+0.15/6.3/9.4+0.86 ms cpu, 30->31->16 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 40 @65.770s 0%: 0.050+2.1+0.052 ms clock, 0.80+0/5.3/10+0.83 ms cpu, 32->32->16 MB, 33 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 41 @65.942s 0%: 0.052+2.2+0.054 ms clock, 0.83+0.083/5.8/9.5+0.87 ms cpu, 32->32->16 MB, 33 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 42 @66.057s 0%: 0.047+1.9+0.052 ms clock, 0.75+0.054/5.3/9.5+0.84 ms cpu, 32->33->16 MB, 33 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 43 @66.171s 0%: 0.040+2.0+0.004 ms clock, 0.65+0.068/5.6/10+0.077 ms cpu, 33->34->17 MB, 34 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 44 @66.290s 0%: 0.037+1.8+0.050 ms clock, 0.60+0.079/4.9/9.2+0.80 ms cpu, 34->34->17 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 45 @66.407s 0%: 0.063+2.2+0.046 ms clock, 1.0+0.20/5.5/9.6+0.74 ms cpu, 34->34->17 MB, 35 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 46 @66.527s 0%: 0.048+2.4+0.046 ms clock, 0.78+0.078/6.3/8.2+0.73 ms cpu, 35->35->17 MB, 36 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 47 @69.148s 0%: 0.058+2.5+0.075 ms clock, 0.93+0.17/5.9/10+1.2 ms cpu, 35->35->18 MB, 36 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 48 @69.296s 0%: 0.31+3.0+0.055 ms clock, 5.0+0.057/7.4/8.7+0.88 ms cpu, 36->36->19 MB, 37 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 49 @70.016s 0%: 0.052+1.4+0.050 ms clock, 0.84+0.16/4.4/9.9+0.81 ms cpu, 39->39->13 MB, 40 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 50 @70.060s 0%: 0.092+1.5+0.054 ms clock, 1.4+0.15/4.7/8.2+0.87 ms cpu, 26->27->13 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 51 @70.111s 0%: 0.099+1.5+0.054 ms clock, 1.5+0.15/4.8/8.5+0.87 ms cpu, 26->27->6 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 52 @70.134s 0%: 0.032+1.5+0.008 ms clock, 0.51+0.17/4.2/9.2+0.13 ms cpu, 13->13->7 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 53 @70.159s 0%: 0.083+1.4+0.084 ms clock, 1.3+0.12/4.2/8.5+1.3 ms cpu, 13->14->7 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 54 @70.179s 0%: 0.029+1.4+0.053 ms clock, 0.46+0.053/4.5/8.7+0.85 ms cpu, 13->13->6 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 55 @70.211s 0%: 0.091+1.2+0.004 ms clock, 1.4+0.89/4.4/8.6+0.068 ms cpu, 12->13->6 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 56 @70.217s 0%: 0.070+1.2+0.080 ms clock, 1.1+0.25/4.0/8.7+1.2 ms cpu, 12->13->11 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 57 @70.296s 0%: 0.14+2.2+0.10 ms clock, 2.3+0/5.2/9.3+1.7 ms cpu, 23->23->12 MB, 24 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 58 @71.177s 0%: 0.051+1.3+0.007 ms clock, 0.82+0.049/4.1/9.3+0.11 ms cpu, 24->24->12 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 59 @71.259s 0%: 0.048+1.5+0.048 ms clock, 0.77+0.043/4.5/8.9+0.77 ms cpu, 24->25->13 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 60 @71.383s 0%: 0.10+2.2+0.11 ms clock, 1.7+0.048/5.6/9.5+1.8 ms cpu, 25->26->13 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 61 @71.904s 0%: 0.052+2.0+0.005 ms clock, 0.83+0.11/5.4/9.4+0.081 ms cpu, 27->27->13 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 62 @72.029s 0%: 0.048+1.5+0.046 ms clock, 0.78+0.18/4.2/9.3+0.74 ms cpu, 27->27->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 63 @72.482s 0%: 0.15+1.8+0.008 ms clock, 2.4+0.060/4.9/9.5+0.14 ms cpu, 27->28->14 MB, 29 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 64 @72.640s 0%: 0.052+1.4+0.057 ms clock, 0.83+0.065/4.4/10+0.91 ms cpu, 28->28->14 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 65 @72.849s 0%: 0.051+1.5+0.049 ms clock, 0.82+0.055/4.8/9.8+0.78 ms cpu, 29->29->14 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 66 @73.122s 0%: 0.050+2.2+0.085 ms clock, 0.81+0.11/5.1/9.3+1.3 ms cpu, 29->30->15 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 67 @73.342s 0%: 0.14+1.6+0.055 ms clock, 2.3+0.087/4.8/10+0.88 ms cpu, 30->30->15 MB, 31 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 68 @73.502s 0%: 0.14+1.8+0.004 ms clock, 2.2+0.16/6.0/9.7+0.074 ms cpu, 31->31->16 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 69 @73.641s 0%: 0.050+2.3+0.005 ms clock, 0.81+0.053/5.9/9.2+0.081 ms cpu, 32->33->16 MB, 33 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 113 @3360.001s 0%: 0.048+0.46+0.002 ms clock, 0.19+0/0.43/0.17+0.011 ms cpu, 0->0->0 MB, 1 MB goal, 0 MB stacks, 0 MB globals, 4 P (forced)
gc 70 @73.786s 0%: 0.052+3.2+0.058 ms clock, 0.83+0.063/7.2/10+0.93 ms cpu, 32->33->16 MB, 33 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 71 @74.857s 0%: 0.055+1.4+0.048 ms clock, 0.88+0.091/4.3/9.3+0.78 ms cpu, 33->33->13 MB, 34 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 72 @74.925s 0%: 0.042+1.3+0.005 ms clock, 0.67+0.071/4.1/9.2+0.080 ms cpu, 26->26->5 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 73 @74.966s 0%: 0.036+1.2+0.049 ms clock, 0.59+0.067/3.8/9.1+0.78 ms cpu, 11->11->5 MB, 12 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 74 @75.002s 0%: 0.10+1.3+0.005 ms clock, 1.6+1.6/4.6/7.5+0.081 ms cpu, 11->12->6 MB, 12 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 75 @75.006s 0%: 0.020+1.3+0.10 ms clock, 0.32+0.053/3.8/9.0+1.6 ms cpu, 11->12->11 MB, 13 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 76 @75.119s 0%: 0.049+1.4+0.067 ms clock, 0.78+0.10/4.2/9.3+1.0 ms cpu, 22->22->12 MB, 24 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 77 @75.222s 0%: 0.11+1.7+0.077 ms clock, 1.9+0.086/5.0/10+1.2 ms cpu, 24->24->12 MB, 25 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 78 @75.342s 0%: 0.049+1.6+0.046 ms clock, 0.78+0.19/5.4/10+0.73 ms cpu, 24->25->13 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 79 @75.552s 0%: 0.12+1.3+0.056 ms clock, 2.0+0.052/4.2/9.6+0.90 ms cpu, 25->25->13 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 80 @75.734s 0%: 0.048+1.5+0.047 ms clock, 0.77+0.052/4.5/9.0+0.76 ms cpu, 27->27->13 MB, 27 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 81 @75.860s 0%: 0.31+1.4+0.093 ms clock, 5.0+0.055/4.5/9.9+1.4 ms cpu, 27->27->14 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 82 @75.985s 0%: 0.044+1.3+0.051 ms clock, 0.70+0.045/3.9/8.8+0.82 ms cpu, 28->28->5 MB, 28 MB goal, 0 MB stacks, 0 MB globals, 16 P
gc 83 @76.008s 0%: 0.031+1.3+0.046 ms clock, 0.50+0.70/4.5/9.9+0.75 ms cpu, 15->15->11 MB, 15 MB goal, 0 MB stacks, 0 MB globals, 16 P

Grpc tracing screenshots from the same run (8GB image with `GODEBUG=gctrace=1`, parallelism set to 110 and chunksize set to 32): Screenshot 2024-05-13 at 16 39 45

Screenshot from another run for a ~27GB image, after a while, all chunks seem to take the same amount of time, ~22s, we've probably reached the writing speed burst limit, and are slowly taking more time to do things:

Screenshot 2024-05-13 at 16 00 21

@azr

This comment was marked as outdated.

@azr azr force-pushed the azr/parallel-layer-fetch branch 10 times, most recently from c13969f to 8fc47db Compare May 21, 2024 15:07
fetch big layers of images using more than one connection

Signed-off-by: Adrien Delorme <azr@users.noreply.github.com>
@swagatbora90
Copy link
Contributor

@azr Thanks for adding the performance numbers. I ran some tests as well using your patch and the memory usage looks better than what I saw in the htcat implementation specially with high parallelism count.

However, I do observe that increasing parallelism does not yield better latency and may lead to higher memory usage (I think there is a number of other factors to consider here mainly type of instance used for testing, network bandwidth). I tried to limit the test to a single image with a single layer and fixing the chunk size to 20 MB. A lower parallelism count(3 or 4) may be preferable than setting parallelism to upwards of 10.

Using a c7.12xlarge instance to pull a 3GB single layer image from ECR private repo.

Parallelism Count Chunk Size(MB) Total Download time(sec) Network Pull time(sec) Download Speed(MBPS) Max Memory used (from cgroups memory.peak)
1 20 65.9 51.78 53 15.9
2 20 39.3 32.08 88.8 17.5
3 20 36.6 22.57 95.4 18
4 20 36.8 16.82 94.8 17
5 20 36.8 14.78 94.8 17
10 20 36.9 13.92 94.6 20
20 20 36.9 14.31 94.6 22
30 20 36.7 14.97 95.1 26
40 20 36.8 14.21 94.8 31
50 20 36.7 14.3 95.1 36
100 20 36.8 14.91 94.8 52

multipart1

Also the network download time was much faster (see Network Pull time) (~15sec) while containerd took additional ~20secs to complete the pull (before it started unpacking). I calculated the Network Download time by periodically calling /containerd.services.content.v1.Content/ListStatuses" filtering the layer digest and checking when the content.Offset == content.Size. I am still not sure why containerd takes so much time after it has already committed to the content store, pprof does not show any significant cpu usage by containerd either during this time. Are we blocked on GC or some underlying syscall(fp.Sync) to complete?

@swagatbora90
Copy link
Contributor

@dmcgowan @kzys

@dmcgowan dmcgowan added this to the 2.1 milestone May 22, 2024
@dmcgowan
Copy link
Member

Thanks @azr, the numbers on this look good. @swagatbora90 super helpful stats.

We should continue to respect max_concurrent_downloads as the upper limit for actively downloading connections. Can we make the parallelism dynamic, maybe use TryAcquire until we hit the limit to determine the parallelism. It definitely makes sense to use the concurrency allowance to download the first layers faster rather than spreading across all layers, since unpack still takes up a significant amount of the pull time.

For configuration, let's use the transfer service configuration. This won't make it in for 2.0 and CRI will be switching to transfer service by 2.1.

@azr azr force-pushed the azr/parallel-layer-fetch branch from 8fc47db to 504bd15 Compare May 23, 2024 07:49
@azr
Copy link
Contributor Author

azr commented May 27, 2024

Hey @dmcgowan ! Nice, thanks.
Good idea to keep max_concurrent_downloads docs/behaviour consistent. I was poking around your suggestion to lock N goroutines, and the thing is max_concurrent_downloads also controls how many unpacks are happening at once because these are all being called in stacked handlers.

I have options in mind, and have to think/test of a good way to do this. I might have to introduce a max_concurrent_unpacks / max_concurrent_ops setting. On it ! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelise layer downloads
5 participants