Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

figure out DinD + podutils #8780

Closed
BenTheElder opened this issue Jul 23, 2018 · 18 comments
Closed

figure out DinD + podutils #8780

BenTheElder opened this issue Jul 23, 2018 · 18 comments
Assignees
Labels
area/images priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@BenTheElder
Copy link
Member

This is a specific issue to track the need for Docker in Docker with the podutils. I am also filing a related but more general issue for the podutils and init / forking that may wind up solving this one. xref: #8779

In short, with the old bootstrap + scenario scripts and their images we have support for using docker at runtime in prowjobs, with the podutils we do not. We need to sort this out. A complication here is that the podutils now control the entrypoint, we can run a similar script under the entrypoint but before the actual test command, but that leaves other questions, like how we will handle zombie processes...

xref: #8763 (comment), #7844 (comment)

@BenTheElder
Copy link
Member Author

/assign

@stevekuznetsov
Copy link
Contributor

The current implementation of the utils is super generic and there is merit to that, but if we wanted to have a dind job type that we could handle in the infra by allowing access to a socket in the test container, that seems OK with me. Would allow us to run the container runtime in a separate container and share the socket while still ensuring things about the lifecycle and logging since the infra would know about the extra container.

@BenTheElder
Copy link
Member Author

BenTheElder commented Jul 23, 2018

@stevekuznetsov I don't think that will work properly due to volume mounts. See #8779 (comment)

Seperately from DinD, most things we run fork, avoiding zombies is something to handle more generally. What needs to happen here specifically is a different snippet to boot docker (instead of having it in the images/bootstrap entrypoint.sh)

@BenTheElder
Copy link
Member Author

We should be able to refactor out just the DinD parts from images/bootstrap/runner and let the pdoutils job be entrypoint test_cmd where test_cmd is something like dind_runner actual-test-command.

This just reminded me that zombies are also coming to eat our nodes. For that something like /bin/sh entrypoint test_cmd would probably be sufficient.

I'll handle the refactor bit now and circle back on zombies. Kubernetes may also soon have better ability to manage this problem as well... (it does have some alpha gated enhancements now IIRC) .. we're not the only ones with "the PID1 problem".

@BenTheElder BenTheElder added area/images priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 23, 2018
@BenTheElder BenTheElder added this to the 1.12 milestone Jul 23, 2018
@stevekuznetsov
Copy link
Contributor

I'll handle the refactor bit now and circle back on zombies. Kubernetes may also soon have better ability to manage this problem as well

I'd check where the pause container work is, that was supposed to handle the issue IIRC

@BenTheElder
Copy link
Member Author

It looks like we are waiting on kubernetes/kubernetes#1615 before the pause container can reap zombies. It has some logic already, but currently it doesn't work as the pod doesn't share a PID namespace yet.
The last update is that 1.10 had this as an alpha feature kubernetes/kubernetes#1615 (comment), until just two hours ago A PR opened to push it to beta kubernetes/kubernetes#66507 🤞

It will be a while before Prow is running on this though, as even 1.11 has it as alpha, 1.12 may have it as beta.

@stevekuznetsov
Copy link
Contributor

It will be a while before Prow is running on this though

You are stock GKE, right? I think we may be running with it right now on ours, but we fiddle with cluster flags a lot.

@stevekuznetsov
Copy link
Contributor

let the pdoutils job be entrypoint test_cmd where test_cmd is something like dind_runner actual-test-command.

Are there benefits from integrating more tightly for classes of jobs (DinD, jobs that want VMs, stuff like that) or do we just want to black-box everything and keep it as run by entrypoint?

@BenTheElder
Copy link
Member Author

Stock GKE does not enable alpha features to my knowledge. We could almost certainly sidestep this but it would be hacky.

Are there benefits from integrating more tightly for classes of jobs (DinD, jobs that want VMs, stuff like that) or do we just want to black-box everything and keep it as run by entrypoint?

We should keep it flexible. I don't think we want to support VMs etc as part of PRow. I do think we need to support forking, as tons of useful utilities fork. This issue is really for tracking that our images also need an update because they assume they can do $magic in the entrypoint before running bootstrap.

@verb
Copy link
Contributor

verb commented Jul 23, 2018

FYI, GKE supports alpha clusters, but they're only suitable for testing.

@stevekuznetsov Are you running with PodShareProcessNamespace=true? I haven't received much feedback on this feature. I'd like to gather a few examples of people using it successfully.

@BenTheElder
Copy link
Member Author

Yeah, alpha clusters are not a suitable option for Prow. Some end to end testing uses them though.

@BenTheElder
Copy link
Member Author

In theory after #8874 a job should be able to use sysv DinD + podutils relatively fine. I'll experiment with this or help someone port a job to it soonish. perhaps @MHBauer

@spiffxp
Copy link
Member

spiffxp commented Sep 28, 2018

/milestone v1.13
I actually have no clue here, I'm going based off of the priority label, otherwise I would have punted from the milestone. Any updates?

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.12, v1.13 Sep 28, 2018
@BenTheElder
Copy link
Member Author

This is definitely v1.13 I expect to start poking this more next week :-)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2018
@idealhack
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2018
@spiffxp
Copy link
Member

spiffxp commented Jan 3, 2019

/milestone 2019-goals

@BenTheElder I punted this out given that I saw no activity here in the v1.13 cycle. Please add back to v1.14 or close out or do nothing as appropriate.

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.13, 2019-goals Jan 3, 2019
@BenTheElder
Copy link
Member Author

pod-utils + dind works with our sysv dind, it's not quite ideal though, but we should probably track options for not hijacking ENTRYPOINT in another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/images priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants