Drivers build not triggered for x86_64 #1158

maxgio92 · 2023-05-17T12:46:24Z

Describe the bug

Not all driver building-postsubmit jobs have been triggered after new driverkit configurations have been introduced - as usual weekly crawling result from kernel-crawler.

How to reproduce it

Please check:

jobs for x86_64 vs
jobx for arm64.

Expected behaviour

All the jobs for x86 to have been triggered accordingly.

FedeDP · 2023-05-19T07:44:48Z

Note: all the jobs have always been instead triggered correctly during normal weekly kernel-crawler CI.
This time, instead, as we added new driver version (therefore copy/pasting an existing driver version to a new folder) and removed the oldest one (2.0.0+driver), not all jobs were triggered.

At the same time, PR CI correctly triggered all presubmit jobs instead.

FedeDP · 2023-07-04T07:26:52Z

I might have found a good candidate; note the difference between the presubmit and postsubmit Prow documentation about run_if_changed:

Pre:

Presubmit jobs apply run_if_changed and skip_if_only_changed filters based on which files were modified in any of the commits in the pull request.

Post:

Postsubmit jobs apply run_if_changed and skip_if_only_changed filters based on which files were modified by the commits included in the specific push event from github.

So, postsubmit are based on the specific push event triggered by github.
What if, given the large size of the diff/high number of files changed, the push event is somewhat truncated, ie: not all changed files are pushed?

Payloads are capped to 25M: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads#push

Note: Payloads are capped at 25 MB. If your event generates a larger payload, a webhook will not be fired. This may happen, for example, on a create event if many branches or tags are pushed at once. We suggest monitoring your payload size to ensure delivery.

But this is not our case imho because Prow receives the event with a full payload (otherwise it won't be able to decode it); but there might be some other hard limits on the number of files changes pushed with the event.

Prow code confirms this behavior: https://github.com/kubernetes/test-infra/blob/b7c54c4d7991d27ee926d184d3a09c7b7738c65b/prow/plugins/trigger/push.go#L28: as you can see, it uses PushEvent data to fetch all changed files.
Note how the presubmit instead uses normal git client to fetch list of changed files: https://github.com/kubernetes/test-infra/blob/b7c54c4d7991d27ee926d184d3a09c7b7738c65b/prow/plugins/trigger/pull-request.go#L358C37-L358C37 (https://github.com/kubernetes/test-infra/blob/master/prow/config/jobs.go#L537)

FedeDP · 2023-07-10T08:44:04Z

So, github http api does limit list of files changed to 3000: https://docs.github.com/en/rest/pulls/pulls?apiVersion=2022-11-28#list-pull-requests-files

This is the method used to fetch pull request changes; therefore i'd expect them to be cut at 3000 files too. Instead, we are experiencing the issue only on push to master. Push events have the same 3000 changes hard limit.

I don't get why this is working on pull request but not on push to master :/ I'd expect the issue to be present on pull request too!

FedeDP · 2023-07-10T09:05:35Z

I now get why it is working on presubmits!
Our presubmits jobs are not arch dependent; therefore we are effectively mostly triggering on aarch64 configs, BUT we trigger presubmit jobs for both arch (since they are not arch dependent).
On postsubmit, instead, given that we have aarch64 and x86_64 build jobs, we mostly trigger only aarch64 build and some x86_64 until we reach the 3000 files changed limit.

FedeDP · 2023-07-10T10:32:12Z

I see 2 main paths forward:

open a PR on kubernetes Prow to properly fetch changed files through git client using before and after refs from Push github event: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads#push
split our workflow to open up a PR on test-infra for each crawled distro; this does not require any change to Prow (so, much quicker fix), but at the same time, it means we will open (and have to check/approve) like 15PRs each week...

FedeDP · 2023-07-17T10:01:28Z

Adding 2 more possible "fixes":

Run the update-kernels workflow daily instead of weekly: running it more frequently will generate less diff thus we would be more confident that all jobs get triggered; it has also the benefit that we would be much quicker to support new kernels
having a single post submit job that just triggers whenever driverkit/config path is changed, and its entrypoint would diff HEAD~1 on master to manually check changed files and then trigger the correct job for each changed <distro,arch>.

FedeDP · 2023-09-05T08:32:37Z

split our workflow to open up a PR on test-infra for each crawled distro; this does not require any change to Prow (so, much quicker fix), but at the same time, it means we will open (and have to check/approve) like 15PRs each week...

Run the update-kernels workflow daily instead of weekly: running it more frequently will generate less diff thus we would be more confident that all jobs get triggered; it has also the benefit that we would be much quicker to support new kernels

Note: these ones would not fix the "new driverversion added" issues.

FedeDP · 2023-09-07T11:54:45Z

A short term solution would be to change all post submit triggers to trigger to any change to driverkit/config/... folder, ie from this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/almalinux_.+'

to this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/.+'

In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

PROs:

super easy solution

CONs:

we would spawn useless pods
we would spawn more nodes than needed

Let's assume that each week we build likely ~90% of supported distro, it would not be that impactful (ie: we would just spawn 10% useless pods).
The "big" issue would rise when we manually trigger kernel-crawler on a single distro; in that case, instead of just building new drivers for that distro, we would spawn all pods to build all drivers (whose build should be skipped anyway because they are already present on download.falco.org).

@maxgio92 wdyt?

FedeDP · 2023-09-07T11:57:15Z

In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

We could also let the entrypoint of each job do the diff HEAD~1 and decide whether it should do anything, even if i'd avoid adding too much logic in those scripts.

Signed-off-by: Federico Di Pierro <nierro92@gmail.com>

maxgio92 · 2023-09-14T11:56:58Z

A short term solution would be to change all post submit triggers to trigger to any change to driverkit/config/... folder, ie from this:
run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/almalinux_.+'
to this:
run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/.+'
In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

PROs:
* super easy solution
CONs:
* we would spawn useless pods

* we would spawn more nodes than needed
Let's assume that each week we build likely ~90% of supported distro, it would not be that impactful (ie: we would just spawn 10% useless pods). The "big" issue would rise when we manually trigger kernel-crawler on a single distro; in that case, instead of just building new drivers for that distro, we would spawn all pods to build all drivers (whose build should be skipped anyway because they are already present on download.falco.org).

@maxgio92 wdyt?

Thank you for this detailed investigation and sorry for my late response.

I'd try as hard as possible to avoid triggering all distros jobs.

maxgio92 · 2023-09-14T12:00:27Z

What about instead of reducing the granularity of the grid to increase the jobs frequency as you suggested, from weekly to one every two days for example, @FedeDP?

PS: in any case IMHO are both short term solutions.

FedeDP · 2023-09-14T12:16:53Z

What about instead of reducing the granularity of the grid to increase the jobs frequency as you suggested, from weekly to one every two days for example

That would unfortunately not solve the issues when we add a completely new driver version from scratch.

FedeDP · 2023-09-14T12:17:27Z

PS: in any case IMHO are both short term solutions.

Agree btw :/

Signed-off-by: Federico Di Pierro <nierro92@gmail.com>

poiana · 2023-12-13T15:47:12Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP · 2023-12-13T15:48:18Z

/remove-lifecycle stale

poiana · 2024-03-12T15:49:44Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

FedeDP · 2024-03-12T15:59:00Z

/remove-lifecycle stale

maxgio92 added the kind/bug label May 17, 2023

This was referenced Sep 5, 2023

wip: fix(prow/plugins/trigger): use git client instead of github PushEvent while checking files changed kubernetes/test-infra#30615

Closed

new(driverkit): add new driver version 6.0.0+driver and remove the old 3.0.1+driver #1229

Merged

FedeDP added a commit that referenced this issue Sep 12, 2023

chore(config/jobs): workaround #1158

6bcd02d

Signed-off-by: Federico Di Pierro <nierro92@gmail.com>

FedeDP mentioned this issue Sep 12, 2023

chore(config/jobs): workaround #1158 #1236

Open

FedeDP added a commit that referenced this issue Sep 19, 2023

chore(config/jobs): workaround #1158

3f0d56e

Signed-off-by: Federico Di Pierro <nierro92@gmail.com>

poiana added the lifecycle/stale label Dec 13, 2023

poiana removed the lifecycle/stale label Dec 13, 2023

poiana added the lifecycle/stale label Mar 12, 2024

poiana removed the lifecycle/stale label Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drivers build not triggered for x86_64 #1158

Drivers build not triggered for x86_64 #1158

maxgio92 commented May 17, 2023

FedeDP commented May 19, 2023

FedeDP commented Jul 4, 2023 •

edited

FedeDP commented Jul 10, 2023

FedeDP commented Jul 10, 2023

FedeDP commented Jul 10, 2023

FedeDP commented Jul 17, 2023 •

edited

FedeDP commented Sep 5, 2023 •

edited

FedeDP commented Sep 7, 2023 •

edited

FedeDP commented Sep 7, 2023 •

edited

maxgio92 commented Sep 14, 2023

maxgio92 commented Sep 14, 2023 •

edited

FedeDP commented Sep 14, 2023

FedeDP commented Sep 14, 2023

poiana commented Dec 13, 2023

FedeDP commented Dec 13, 2023

poiana commented Mar 12, 2024

FedeDP commented Mar 12, 2024

Drivers build not triggered for x86_64 #1158

Drivers build not triggered for x86_64 #1158

Comments

maxgio92 commented May 17, 2023

FedeDP commented May 19, 2023

FedeDP commented Jul 4, 2023 • edited

FedeDP commented Jul 10, 2023

FedeDP commented Jul 10, 2023

FedeDP commented Jul 10, 2023

FedeDP commented Jul 17, 2023 • edited

FedeDP commented Sep 5, 2023 • edited

FedeDP commented Sep 7, 2023 • edited

FedeDP commented Sep 7, 2023 • edited

maxgio92 commented Sep 14, 2023

maxgio92 commented Sep 14, 2023 • edited

FedeDP commented Sep 14, 2023

FedeDP commented Sep 14, 2023

poiana commented Dec 13, 2023

FedeDP commented Dec 13, 2023

poiana commented Mar 12, 2024

FedeDP commented Mar 12, 2024

FedeDP commented Jul 4, 2023 •

edited

FedeDP commented Jul 17, 2023 •

edited

FedeDP commented Sep 5, 2023 •

edited

FedeDP commented Sep 7, 2023 •

edited

FedeDP commented Sep 7, 2023 •

edited

maxgio92 commented Sep 14, 2023 •

edited