Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create github-actions-usage lambda #810

Merged
merged 25 commits into from Feb 27, 2024
Merged

Create github-actions-usage lambda #810

merged 25 commits into from Feb 27, 2024

Conversation

akash1810
Copy link
Member

@akash1810 akash1810 commented Feb 22, 2024

The contents field of the github_workflows table contains the contents of a repository's GitHub Workflows, as a YAML string. Querying YAML in SQL isn't trivial.

What does this change?

This change adds a new table guardian_github_actions_usage, to track the Actions we're using. It has the schema:

CREATE TABLE guardian_github_actions_usage (
    evaluated_on TIMESTAMP(3) NOT NULL,
    full_name TEXT NOT NULL,
    workflow_path TEXT NOT NULL,
    workflow_uses TEXT[] NOT NULL
);

For ease, a view view_github_actions is also created, which yields results similar to1:

full_name workflow_path action_name version
guardian/service-catalogue .github/workflows/ci.yml actions/checkout v4
guardian/service-catalogue .github/workflows/ci.yml actions/setup-node v4
guardian/service-catalogue .github/workflows/snyk.yml guardian/.github/.github/workflows/sbt-node-snyk.yml main

The guardian_github_actions_usage table is populated by a new AWS Lambda, triggered once the GitHubRepositories ECS task completes successfully.

The Lambda works as follows:

  1. Get all records from the github_workflows table
  2. For each record returned, parse the contents field
    1. If the field is empty, log a message
    2. Use GitHub's @actions/workflow-parser to parse the contents field2
    3. If validation fails, log a message
  3. Extract the uses string from the contents
  4. Save results to database

Validating the Workflow content against the schema allows us to type-cast the rows with confidence, and without any complex parsing.

The Lambda is invoked when the GitHubRepositories task successfully ends.

Note

  • This change requires a database migration
  • We should monitor the logs, and reduce the number of validation errors over time
  • We're not filtering out archived repositories

Why?

To observe how we're using GitHub Actions across the organisation, and how it compares to the guidance provided by GitHub.

Other usage for this data includes answering:

How has it been verified?

  • Ran locally
  • Deployed to CODE, started the GitHubRepositories task, observed lambda starting once the task completes, and the new table being populated

Notes

Comparing @actions/workflow-parser to Schema Store

Previous commits in this branch used Schema Store to parse the contents field of the github_workflows table.

Schema Store is a community maintained set of YAML schemas. It's also used by JetBrains IDEs (and others, I imagine!).

@actions/workflow-parser is used within the GitHub Actions VS Code extension. In theory, @actions/workflow-parser will be kept up to date. It's unclear if this library is intended for external use, or if its an internal library. However, it's on NPM...

Its worth noting Schema Store and @actions/workflow-parser yield slightly different results. At present, the github_workflows table has 1315 rows. Schema Store can process 1305 of them, and @actions/workflow-parser can process 1301 of them.

Additionally, Schema Store and @actions/workflow-parser have differing results for some files:

I'll create an issue on the two repositories to ask for clarification here.

The default error messages provided by @actions/workflow-parser are more helpful, describing the line and column of an error:

Failed to parse workflow - path:.github/workflows/ci.yml repository:guardian/grid errors:2 [
  `.github/workflows/ci.yml (Line: 82, Col: 13): Unexpected symbol: '"'. Located at position 1 within expression: "! github.event.pull_request.head.repo.fork"`,
  `.github/workflows/ci.yml (Line: 87, Col: 13): Unexpected symbol: '"'. Located at position 1 within expression: "! github.event.pull_request.head.repo.fork"`
]
Full comparison

[!NOTE]
I've since dropped the memory size of the lambda to 512MB, as 1024MB was evidently too much. These statistics are not exact anymore, but the trend still holds.

Metric Schema Store @actions/workflow-parser
Duration 178020.30 ms 6453.02 ms
Billed Duration 178021 ms 6454 ms
Memory Size 1024 MB 1024 MB
Max Memory Used 256 MB 243 MB
Init Duration 369.99 ms 286.61 ms

Alternative solutions

Just use SQL

An alternative to this change would be to add a contents_as_json column to the CloudQuery table - cloudquery/cloudquery#16846.

However, the schema of a Workflow file (specifically where the uses property can appear) is tricky. It can appear within steps, or outside.

That is, the SQL to extract these values would be complicated (we'd likely continue to abstract it away into a view though).

Footnotes

  1. Some columns removed, for brevity

  2. https://github.com/actions/languageservices/tree/main/workflow-parser is written by GitHub, and used by various GitHub authored GitHub VS Code extensions

packages/github-actions-usage/src/query.test.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/query.test.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/query.test.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/query.ts Outdated Show resolved Hide resolved
@akash1810 akash1810 marked this pull request as ready for review February 23, 2024 09:17
@akash1810 akash1810 requested review from a team as code owners February 23, 2024 09:17
Copy link

@adamnfish adamnfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Super useful feature, and I really like the code style overall. Separating side effects from logic make it easier to test the logic, perhaps we could take advantaghe of this with a couple more tests?

What's the behaviour we want when any part of this fails? Not something to worry about in this PR, but do we want best efforts for the data that it can process, or do we want it to stop completely? In either case, how would we find out that this happened, and resolve it?

I'm also a bit nervous about the schema validation, because it looks like a potential source of failures in the future. There are some comments about this below.

Great stuff though, really exciting to have this data.

packages/cdk/lib/github-actions-usage.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/README.md Outdated Show resolved Hide resolved
packages/github-actions-usage/README.md Outdated Show resolved Hide resolved
packages/github-actions-usage/src/index.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/query.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/query.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/query.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/query.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/query.ts Outdated Show resolved Hide resolved
scripts/ci.sh Outdated Show resolved Hide resolved
@akash1810 akash1810 force-pushed the aa/github-actions-usage branch 10 times, most recently from 346413a to 1789f8a Compare February 25, 2024 01:56
@akash1810 akash1810 marked this pull request as draft February 26, 2024 14:00
@akash1810 akash1810 force-pushed the aa/github-actions-usage branch 3 times, most recently from a176033 to 2c90698 Compare February 26, 2024 22:43
@akash1810 akash1810 marked this pull request as ready for review February 26, 2024 22:52
By joining `github_workflows` with `github_repositories` at query time,
the volume of code reduces, due to less validation being necessary.
Organise the logic into three distinct steps:
1. Read data
2. Transform data
3. Write data

This should improve readability.
This view shows the archived status of a repository, the name of the Action being used, and the version.
Triggered when the `GitHubRepositories` task stops successfully.
Copy link
Contributor

@chrislomaxjones chrislomaxjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

packages/github-actions-usage/src/config.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/db-read.ts Show resolved Hide resolved
packages/github-actions-usage/src/index.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/transform.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/src/transform.ts Outdated Show resolved Hide resolved
packages/github-actions-usage/README.md Show resolved Hide resolved
…low-parser`

Swap the community maintained Schema Store parsing of GitHub Workflow files
for the GitHub authored `@actions/workflow-parser` NPM module.

Interestingly, the two approaches yield slightly different results:
- https://github.com/guardian/service-catalogue/blob/3f25b6c553e1ed2b192bce11cc15e6f25afa7239/.github/workflows/ecs-publish.yml
  - Schema Store fails to parse
  - `@actions/workflow-parser` able to parse
- https://github.com/guardian/grid/blob/e37e3acaeea198beb82896a143cc9c66d200aadc/.github/workflows/ci.yml
  - Schema Store able to parse
  - `@actions/workflow-parser` fails to parse

Anecdotally, this version also appears to be more performant.

`@actions/workflow-parser` is an ESM only module, and it was proving tricky to get Jest to work with it.
For this reason, the tests use Node's native test runner.
Copy link
Contributor

@chrislomaxjones chrislomaxjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff - this is going to be really useful!

@akash1810 akash1810 merged commit c1c687c into main Feb 27, 2024
2 checks passed
@akash1810 akash1810 deleted the aa/github-actions-usage branch February 27, 2024 12:07
@akash1810
Copy link
Member Author

akash1810 commented Feb 27, 2024

Rollout steps:

  • Perform database migration
  • Create database user
  • Manually run the lambda, to observe the lambda working
  • Manually run the GitHubRepositories ECS task, to trigger the lambda, to observe the infrastructure working

@tjsilver
Copy link
Contributor

This is great - thanks @akash1810 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants