Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloads going over cdn.dl.k8s.io are much slower than direct downloads from the bucket #5755

Closed
xmudrii opened this issue Aug 24, 2023 · 21 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.

Comments

@xmudrii
Copy link
Member

xmudrii commented Aug 24, 2023

I've observed that downloads using curl going over cdn.dl.k8s.io (dl.k8s.io) are much slower than direct downloads from the bucket (storage.googleapis.com/kubernetes-release).

For example, downloading kubelet v1.28.1 directly from the bucket yields the following results:

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.28.1/bin/linux/amd64/kubelet
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  105M  100  105M    0     0  23.8M      0  0:00:04  0:00:04 --:--:-- 23.8M

The download took 4 seconds in total. However, downloading via the CDN yields much different results:

curl -LO https://dl.k8s.io/v1.28.1/bin/linux/amd64/kubelet
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   138  100   138    0     0    744      0 --:--:-- --:--:-- --:--:--   745
100  105M  100  105M    0     0  1643k      0  0:01:05  0:01:05 --:--:-- 1784k

It took one minute and five seconds to download the same file.

Update: it turns out that cache miss downloads are slow, and cache hit downloads are fast. This can be be determined from x-cache: MISS and x-cache: HIT headers. Once the file is cached on Fastly side, downloads are fast, but prior to that, downloads are insanely slow.

/sig k8s-infra
/priority important-soon
/kind bug
cc @ameukam @BenTheElder

@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/bug Categorizes issue or PR as related to a bug. labels Aug 24, 2023
@xmudrii
Copy link
Member Author

xmudrii commented Aug 24, 2023

Update: it turns out that cache miss downloads are slow, and cache hit downloads are fast. This can be be determined from x-cache: MISS and x-cache: HIT headers. Once the file is cached on Fastly side, downloads are fast, but prior to that, downloads are insanely slow.

@xmudrii xmudrii changed the title Downloads using curl going over cdn.dl.k8s.io are much slower than direct downloads from the bucket Downloads going over cdn.dl.k8s.io are much slower than direct downloads from the bucket Aug 25, 2023
@xrstf
Copy link

xrstf commented Sep 26, 2023

It might be related, but the CDN is not just slow, it's inconsistent. 1.29-alpha.1 was released yesterday, but depending from where you perform a curl -L https://dl.k8s.io/release/latest-1.29.txt, you receive either alpha.0 or alpha.1

This will even change on the same computer if you just re-run the same curl command a few seconds later. Not sure if individual CDN servers "downgrade" their data or if I'm just hitting tons of random CDN nodes that all have an inconsistent state, but it's weird and sadly unreliable :/

These two request happened basically at the same time:

< HTTP/2 200 
< x-guploader-uploadid: ADPycdutDBgx7kyHbX7GUaTmNyxVRNVE82erWSx3_jmUaV5c01OeI7dkYmcu9pfg9gj5BTsgpYgYhWRUMYxkNtP4PVKi26f6HtKM
< expires: Sun, 24 Sep 2023 12:42:09 GMT
< last-modified: Wed, 26 Jul 2023 09:06:19 GMT
< etag: "9b59bd47d18f2395481cf230a43a56e0"
< content-type: text/plain
< cache-control: private, no-store
< accept-ranges: bytes
< date: Tue, 26 Sep 2023 10:40:55 GMT
< via: 1.1 varnish
< age: 165525
< x-served-by: cache-fra-etou8220117-FRA
< x-cache: HIT
< x-cache-hits: 1
< access-control-allow-origin: *
< content-length: 15
< 
* Connection #1 to host cdn.dl.k8s.io left intact
v1.29.0-alpha.0

and

< HTTP/2 200
< x-guploader-uploadid: ADPycds7gWeT690zb-SSaamOrnGHAi6AgaV_K0SWCSe5XMLoJ1zFIE0NiJNe0v8Nr0STrfLXh5GwEv5JBgB6RhU6cqOdVHcHyJIy
< expires: Tue, 26 Sep 2023 07:08:47 GMT
< last-modified: Mon, 25 Sep 2023 20:56:50 GMT
< etag: "7d852bf327f00c76b50173de7dbaebf6"
< content-type: text/plain
< cache-control: private, no-store
< accept-ranges: bytes
< date: Tue, 26 Sep 2023 10:40:50 GMT
< via: 1.1 varnish
< age: 12723
< x-served-by: cache-muc13944-MUC
< x-cache: HIT
< x-cache-hits: 1
< access-control-allow-origin: *
< content-length: 15
<
* Connection #1 to host cdn.dl.k8s.io left intact
v1.29.0-alpha.1

Both claim a cache hit, but return different results.

@xmudrii
Copy link
Member Author

xmudrii commented Sep 26, 2023

This can lead to serious issues. It looks like you're getting served from FRA and MUC, and these nodes might indeed have different cache. I think we should ignore version markers from cache, these can get changed often, especially latest ones.

@ameukam
Copy link
Member

ameukam commented Sep 26, 2023

Yeah. We are not specific about file extensions for the cache configuration.

I'll open a PR to fix it this week. Another option could be to directly serve those version makers through the nginx instance instance of the CDN provider.

@ameukam
Copy link
Member

ameukam commented Sep 26, 2023

@xrstf can you open an new issue with what you described ?
To better track what's happening. Thanks!

@xrstf
Copy link

xrstf commented Sep 26, 2023

Can do, done => #5900.

@ameukam
Copy link
Member

ameukam commented Sep 27, 2023

We increased the TTL for the different objects in #5871. Hopefully the situation should be better.

The current CDN is a "pull-through" cache so a MISS is expected for any object at the POP close the client for the first request. Our real issue the number of the objects that need to be cached at edge. We have a lot of objects (in this case binaries) rarely pulled. I don't think there is an efficient mechanism to warm all the POP of the CDN provider for all the objects we currently host but I open to any suggestions.

Note that our cache is currently over 99% now. I don't think we can do more that.

image

@BenTheElder
Copy link
Member

IIRC a mid-level cache was mentioned talking to fastly previously?

@ameukam
Copy link
Member

ameukam commented Sep 27, 2023

IIRC a mid-level cache was mentioned talking to fastly previously?

maybe you're talking about Origin Shield ? If that the case, the feature is mostly efficient with regional buckets which is not the case for gs://kubernetes-release. I'll ask about the exact requirements for this feature.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@ameukam
Copy link
Member

ameukam commented Jan 29, 2024

@xmudrii is the problem still happening ?

@xmudrii
Copy link
Member Author

xmudrii commented Jan 29, 2024

@ameukam I'll check and get back to you

@xmudrii
Copy link
Member Author

xmudrii commented Feb 12, 2024

@ameukam This is still the issue for non-cached artifacts downloaded over dl.k8s.io, see the screenshot:

image

@xmudrii
Copy link
Member Author

xmudrii commented Feb 12, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2024
@ameukam
Copy link
Member

ameukam commented Feb 12, 2024

Non-cached artifacts going through Fasly will always be slow for the first request on the POP close the requester. Fastly don't replicate all the objects over it's entire network. Objects are cached based on the requests. If the object is not present at Fastly Edge, it will always be slower than the origin.

@xmudrii
Copy link
Member Author

xmudrii commented Feb 12, 2024

@ameukam Is there anything that we can do to make it at least a little faster? The difference is huge, it takes 5 seconds when downloading directly from the bucket, but about 1 minutes and 30 seconds when downloading from the CDN. Subsequent requests might be slow as well because there's a chance to get you redirected to some other edge location.

@ameukam
Copy link
Member

ameukam commented Feb 12, 2024

One possibility could be Fastly Origin Shield but we need to switch to the origin to a regional bucket.

@xmudrii
Copy link
Member Author

xmudrii commented Feb 12, 2024

Even cached requests are much slower for me. Something that takes 3-5 seconds when downloaded from the bucket directly takes 30-40 seconds when downloaded via CDN. I double-checked with @xrstf and he sees okay speeds on 2nd and 3rd try (the 1st try is also slow for him), but that's not the case for me.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 12, 2024
@xmudrii
Copy link
Member Author

xmudrii commented May 20, 2024

I think this has been mostly fixed, I didn't observe it for a while, closing the issue for now
/close

@k8s-ci-robot
Copy link
Contributor

@xmudrii: Closing this issue.

In response to this:

I think this has been mostly fixed, I didn't observe it for a while, closing the issue for now
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
None yet
Development

No branches or pull requests

6 participants