-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker tasks on generic worker sometimes hit issues with caches #538
Comments
@matt-boris and I dug into this a bunch today. Here's what we found:
There are two very confusing parts of this still:
I think the next step here is to add a bunch of additional logging to |
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in #538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
I managed to reproduce this issue with additional debugging information in this task. In it, we have one cache configured:
My additional debugging consisting of dumping all the files in the cache directory, as well as the Interestingly (and annoyingly), the immediate previous run was in a very odd state: it was claimed by a worker that didn't seem to exist in GCP (at least, not under the worker id that was given). It's not clear to me whether or not this is related to the issue, but it's coincidental enough to note here. (The previous time we saw this the errors came after a well handled spot termination.) It does seem like there's a legitimate bug in |
Is the cache local to the worker, or downloaded? If local, I'm especially confused how two runs of a task on different workers could interfere with one another in terms of local cache. A few guesses:
|
Thanks Dustin; you were the closest thing we had to an expert based on the blame 😬
In this case, we have a mounted cache volume, which (I believe) is used across multiple workers, which could explain this part?
I'll check into these theories. The last in particular is an interesting theory. We already do things in |
I apologize, I only barely remember this! But, my reading of the So, I think the place to look when this occurs is the previous run on the same worker. Another theory, and this one I'm more hopeful about: there is some background processing in the worker to clean up "old" caches. I would imagine that doing so involves walking the directory tree and deleting things, and it would seem sensible for that to start with |
Sorry, I only saw after pinging you that it was....a decade ago that you review it 😬
Right, right, thank you for pointing this out! I kept reading the cache definitions as a mount definition, but clearly this is not the case after a re-read.
And indeed, that's right where we find this in my most recent tests:
Curiously though, the original task this was reported in don't seem to have this correlation. We no longer have history for those workers though unfortunately :(.
That does sound very plausible, indeed! Is garbageCollection what you're referring to here? |
One thing I realized while looking at things just now is that the reason we don't hit this on the non-d2g tasks is because none of them have caches configured. The d2g ones have caches configured ostensibly for the docker images, but they end up getting used for the VCS checkout as well AFAICT. |
That sounds like a promising lead! I don't know the relevant GW code, but |
For example: https://firefox-ci-tc.services.mozilla.com/tasks/IvbeCQBuRuKIOaeOIGEfHg
We had an initial run of this task which got spot killed. Subsequent runs failed with:
The text was updated successfully, but these errors were encountered: