Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Concurrent file uploads #53

Merged
merged 33 commits into from
Jul 3, 2023
Merged

feat: Concurrent file uploads #53

merged 33 commits into from
Jul 3, 2023

Conversation

js-murph
Copy link
Contributor

@js-murph js-murph commented Jun 1, 2023

What?

This adds the ability for the test-collector to upload multiple files in parallel.

Why?

For large projects we are seeing instances where the performance is significantly slower than alternatives for displaying test-summary annotations, sometimes in the magnitude of several minutes slower. This adds an option for configuring the concurrency of the uploads that in my testing so far has helped halve the time spent performing this work.

@mcncl mcncl requested a review from toote June 16, 2023 01:10
tests/pre-exit-success.bats Outdated Show resolved Hide resolved
Co-authored-by: Matías Bellone <toote@users.noreply.github.com>
Copy link
Contributor

@toote toote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have sworn your tests would have failed but there apparently is a bug in the stubbing library in which concurrent calls to the mock requiring a single stub for all calls 🤷

That said, I found that you were also missing 2 test cases:

  • what happens when one of the curl calls fail
  • waiting for uploads to finish (the iterations loop of the code you added)

I wanted to provide a rough implementation for you to look at but the first one is complicated due to my first comment. And for the latter I stumbled with a bug: apparently, the -p option is not universal for all implementations of ps.

If I create a test case with 2 concurrent uploads and change the stub for curl to be sleep 25; echo... . The test still passes unless I also add a assert_output --partial "Waiting for uploads to finish" in which case it fails and shows the output actually shows an error:

   output (23 lines):
     Uploading '3' files matching '**/*/junit-*.xml'
     Uploading './tests/fixtures/junit-1.xml'...
     Uploading './tests/fixtures/junit-2.xml'...
     ps: unrecognized option: p
     BusyBox v1.31.1 () multi-call binary.

Comment on lines 75 to 77
assert_output --partial "curl success 1"
assert_output --partial "curl success 1"
assert_output --partial "curl success 1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these three lines do not guarantee that they appear 3 times, if curl success 1 appears just one time, they would all pass.

If you take my previous suggestion to tailor the output according to the file being uploaded, you can check for each independently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to actually count the line amounts:

  assert_equal "$(echo "$output" | grep -c "curl success")" "3"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented your alternative suggestion, I think that sufficiently covers the intent of the test.

Copy link
Contributor Author

@js-murph js-murph Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the cross-platform compatibility issues I've replaced ps -p with kill -0 it was the only test I could find that had consistent behaviour across platforms to test a pid exists. kill -0 can have some surprising behaviour depending on the users permission to the pid... but I think given all of the pids we are dealing with are children of the script it should be fine.

Copy link
Contributor Author

@js-murph js-murph Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been attempting to add the missing tests, but I just seem to keep running into the error:

2023-06-19 06:37:02 UTC | ✗ Concurrency waits when the queue is full
-- | --
  | 2023-06-19 06:37:04 UTC | (from function `unstub' in file /usr/lib/bats/bats-mock/stub.bash, line 70,
  | 2023-06-19 06:37:04 UTC | in test file tests/pre-exit-success.bats, line 89)
  | 2023-06-19 06:37:04 UTC | `unstub curl' failed

I'm not very familiar with bats, it seems as though I'm likely configuring it wrong but bats-mock is hiding the underlying error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty, I think I have a working implementation! Unfortunately it looks like I need to check for "error, continuing" rather than the timeout message specifically... but afaict that might be unavoidable as messages outside of the scope of the command group seem to get lost.

Regardless, it still tests that the first group is failing and that we return only one success, which might be close enough for our purposes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my tests suggest that nothing actually happens if the process is killed so you can not check for either output:

     Uploading '3' files matching '**/*/junit-*.xml'
     Uploading './tests/fixtures/junit-1.xml'...
     Uploading './tests/fixtures/junit-2.xml'...
     sleeping for data=@"./tests/fixtures/junit-1.xml"
     sleeping for data=@"./tests/fixtures/junit-2.xml"
     Upload '2678' has been running for more than '3' seconds, killing it
     Upload '2679' has been running for more than '3' seconds, killing it
     Uploading './tests/fixtures/junit-3.xml'...
     curl success data=@"./tests/fixtures/junit-3.xml"

But I believe it is testing that the process does not break explosively :)

Copy link
Contributor Author

@js-murph js-murph Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, weird, because I was seeing:

Uploading '3' files matching '**/*/junit-*.xml'
Uploading './tests/fixtures/junit-1.xml'...
Uploading './tests/fixtures/junit-3.xml'...
sleeping for data=@"./tests/fixtures/junit-1.xml"
curl success data=@"./tests/fixtures/junit-3.xml"
/plugin/hooks/pre-exit: line 180: kill: (83) - No such process
Uploading './tests/fixtures/junit-2.xml'...
Error uploading, will continue

I wonder why we're seeing different output 🤔. Probably not a big deal, interesting though.

Edit: The test command I am using is docker run --rm -ti -e CURL_STUB_DEBUG=/dev/tty -v $PWD:/plugin buildkite/plugin-tester:v4.1.0 bats tests/pre-exit-success.bats -f 'Concurrency gracefully handles command-group timeout'

Copy link
Contributor

@toote toote Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run it, the first batch is 1 and 2, the second batch is just 3 so the first two processes get killed. In your run, the first batch was 1 and 3, only one of those gets killed so the second one generates the No such process line (which points to the fact that the command needs stderr redirection to /dev/null It may be related to how files are laid out in the filesystem for the order in which find reports them back.

Most interestingly: the last one fails because of the curl stub only having one line (in your scenario it would need two) and that causes the Error uploading line, otherwise you should be seeing sleeping for on the 2 file as well. Even if the test succeeded for the output, it would fail when unstubbing curl :(

Instead of trying to figure out exactly how to work around this non-determinism, I would suggest you add a | sort to the FIND_CMD[@] call in line 156 so that the order of files is always the same no matter what.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that fixed that issue :)

tests/pre-exit-success.bats Outdated Show resolved Hide resolved
hooks/pre-exit Outdated Show resolved Hide resolved
js-murph and others added 6 commits June 19, 2023 14:28
Co-authored-by: Matías Bellone <toote@users.noreply.github.com>
* fix: Use timeout var instead of magic numbers for concurrency
* fix: Add missing concurrency tests
hooks/pre-exit Outdated Show resolved Hide resolved
hooks/pre-exit Outdated
Comment on lines 198 to 200
if [[ "$ANNOTATE" != "false" ]]; then
save-report-url "${CURL_RESP_FILE}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized there is a bug here: the file is shared among all upload calls so there is a race condition as to which URL is saved for annotations and which are not. There are two alternatives to fix this:

1- make a note/check about annotation not compatible with concurrent uploads
2- change the upload function to accept another parameter that defines the filename to save the file that you can then pass through to the save-report-url

Whichever you prefer I'll be OK with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. I'll investigate this further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fixed now, I use mktemp to generate a temporary file in the command group if CURL_RESP_FILE is unset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fixed now, I use mktemp to generate a temporary file in the command group if CURL_RESP_FILE is unset.

@js-murph
Copy link
Contributor Author

js-murph commented Jun 30, 2023

Hmmm it seems there is a failing test, but I'm unable to reproduce. The tests run successfully both locally for me and on our buildkite cluster.

@toote
Copy link
Contributor

toote commented Jul 1, 2023

Hmmm it seems there is a failing test, but I'm unable to reproduce. The tests run successfully both locally for me and on our buildkite cluster.

Same thing happens to me in my local environment. (see note below)

I've been thinking what can be done about it without actually committing some debugging information and the best I could come up with wold be to change that test's assertion (the one that is failing) to one that is more standard so that more information is printed when it fails. In particular, do something like:

  assert_line --regexp '^curl success .*junit-1.xml"'
  assert_line --regexp '^curl success .*junit-2.xml"'
  assert_line --regexp '^curl success .*junit-3.xml"'

While this may not cause the test to succeed when it wasn't, it will probably show more information (including the whole output) when it does fail.

Based on the explanation below, I believe we need to pick one of two bad options:
1- skip the tests by adding a skip 'bats-mock does not correctly support concurrent exeuctions and makes this test flaky' to the first line of the test

This would leave the tests ready to be run when the underlying bats-mock library is fixed (probably with some modifications) but means that functionality may break if the tests are never run.

2- ask someone to re-run the test until the race condition is not triggered

This takes advantage of all the hard-work you've done and actually tests this feauter. On the other hand, not only it puts additional burden on the staff when the test fails, but also reduces the reliability of tests as an indicator that something is wrong with changes for every PR and merge going forward.

@pzeballos: do you (or anyone else in the team) have any opinion on the matter?

Note

While writing the suggestion and making sure that it would work I was able to duplicate the issue. There are actually two scenarios in which the test can fail, but both are due to a race condition in bats-mock not correctly supporting concurrent execution of stubs.

In the first scenario (which is easier to explain), the 3rd scenario runs after the first and second one do concurrently. Because there is a single stub for curl, that call fails and causes the whole test to fail in the unstub curl call:

pre-exit-success.bats
 bats-mock(curl): got curl -X POST --silent --show-error --max-time 30 --form format=junit --form data=@"./tests/fixtures/junit-2.xml" --form run_env[CI]=buildkite --form run_env[key]=an-id --form run_env[url]=https://url.com/ --form run_env[branch]=a-branch --form run_env[commit_sha]=a-commit --form run_env[number]=123 --form run_env[job_id]=321 --form run_env[message]=A message --form run_env[collector]=test-collector-buildkite-plugin https://analytics-api.buildkite.com/v1/uploads -H Authorization: Token token="a-secret-analytics-token"
bats-mock(curl): arguments [31] = '-X' 'POST' '--silent' '--show-error' '--max-time' '30' '--form' 'format=junit' '--form' 'data=@\"./tests/fixtures/junit-2.xml\"' '--form' 'run_env\[CI\]=buildkite' '--form' 'run_env\[key\]=an-id' '--form' 'run_env\[url\]=https://url.com/' '--form' 'run_env\[branch\]=a-branch' '--form' 'run_env\[commit_sha\]=a-commit' '--form' 'run_env\[number\]=123' '--form' 'run_env\[job_id\]=321' '--form' 'run_env\[message\]=A\ message' '--form' 'run_env\[collector\]=test-collector-buildkite-plugin' 'https://analytics-api.buildkite.com/v1/uploads' '-H' 'Authorization:\ Token\ token=\"a-secret-analytics-token\"' 
bats-mock(curl): got curl -X POST --silent --show-error --max-time 30 --form format=junit --form data=@"./tests/fixtures/junit-1.xml" --form run_env[CI]=buildkite --form run_env[key]=an-id --form run_env[url]=https://url.com/ --form run_env[branch]=a-branch --form run_env[commit_sha]=a-commit --form run_env[number]=123 --form run_env[job_id]=321 --form run_env[message]=A message --form run_env[collector]=test-collector-buildkite-plugin https://analytics-api.buildkite.com/v1/uploads -H Authorization: Token token="a-secret-analytics-token"
bats-mock(curl): patterns  [31] = '-X' 'POST' '--silent' '--show-error' '--max-time' '30' '--form' 'format=junit' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '\*' '-H' '\*' 
bats-mock(curl): arguments [31] = '-X' 'POST' '--silent' '--show-error' '--max-time' '30' '--form' 'format=junit' '--form' 'data=@\"./tests/fixtures/junit-1.xml\"' '--form' 'run_env\[CI\]=buildkite' '--form' 'run_env\[key\]=an-id' '--form' 'run_env\[url\]=https://url.com/' '--form' 'run_env\[branch\]=a-branch' '--form' 'run_env\[commit_sha\]=a-commit' '--form' 'run_env\[number\]=123' '--form' 'run_env\[job_id\]=321' '--form' 'run_env\[message\]=A\ message' '--form' 'run_env\[collector\]=test-collector-buildkite-plugin' 'https://analytics-api.buildkite.com/v1/uploads' '-H' 'Authorization:\ Token\ token=\"a-secret-analytics-token\"' 
bats-mock(curl): running echo "curl success ${10}"
bats-mock(curl): command result was 0
bats-mock(curl): result 0
bats-mock(curl): patterns  [31] = '-X' 'POST' '--silent' '--show-error' '--max-time' '30' '--form' 'format=junit' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '\*' '-H' '\*' 
bats-mock(curl): running echo "curl success ${10}"
bats-mock(curl): command result was 0
bats-mock(curl): result 0
bats-mock(curl): got curl -X POST --silent --show-error --max-time 30 --form format=junit --form data=@"./tests/fixtures/junit-3.xml" --form run_env[CI]=buildkite --form run_env[key]=an-id --form run_env[url]=https://url.com/ --form run_env[branch]=a-branch --form run_env[commit_sha]=a-commit --form run_env[number]=123 --form run_env[job_id]=321 --form run_env[message]=A message --form run_env[collector]=test-collector-buildkite-plugin https://analytics-api.buildkite.com/v1/uploads -H Authorization: Token token="a-secret-analytics-token"
bats-mock(curl): no plan row found
bats-mock(curl): result 1
bats-mock(curl): unstubbing
 ✗ Uploads multiple files concurrently does not break basic functionality
   (from function `unstub' in file /usr/lib/bats/bats-mock/stub.bash, line 70,
    in test file tests/pre-exit-success.bats, line 69)
     `unstub curl' failed

1 test, 1 failure

In the other scenario, which I believe is what happened in the run of your commit, the 3rd stub is executed after the first one finished but before the second one

pre-exit-success.bats
 bats-mock(curl): got curl -X POST --silent --show-error --max-time 30 --form format=junit --form data=@"./tests/fixtures/junit-1.xml" --form run_env[CI]=buildkite --form run_env[key]=an-id --form run_env[url]=https://url.com/ --form run_env[branch]=a-branch --form run_env[commit_sha]=a-commit --form run_env[number]=123 --form run_env[job_id]=321 --form run_env[message]=A message --form run_env[collector]=test-collector-buildkite-plugin https://analytics-api.buildkite.com/v1/uploads -H Authorization: Token token="a-secret-analytics-token"
bats-mock(curl): arguments [31] = '-X' 'POST' '--silent' '--show-error' '--max-time' '30' '--form' 'format=junit' '--form' 'data=@\"./tests/fixtures/junit-1.xml\"' '--form' 'run_env\[CI\]=buildkite' '--form' 'run_env\[key\]=an-id' '--form' 'run_env\[url\]=https://url.com/' '--form' 'run_env\[branch\]=a-branch' '--form' 'run_env\[commit_sha\]=a-commit' '--form' 'run_env\[number\]=123' '--form' 'run_env\[job_id\]=321' '--form' 'run_env\[message\]=A\ message' '--form' 'run_env\[collector\]=test-collector-buildkite-plugin' 'https://analytics-api.buildkite.com/v1/uploads' '-H' 'Authorization:\ Token\ token=\"a-secret-analytics-token\"' 
bats-mock(curl): got curl -X POST --silent --show-error --max-time 30 --form format=junit --form data=@"./tests/fixtures/junit-2.xml" --form run_env[CI]=buildkite --form run_env[key]=an-id --form run_env[url]=https://url.com/ --form run_env[branch]=a-branch --form run_env[commit_sha]=a-commit --form run_env[number]=123 --form run_env[job_id]=321 --form run_env[message]=A message --form run_env[collector]=test-collector-buildkite-plugin https://analytics-api.buildkite.com/v1/uploads -H Authorization: Token token="a-secret-analytics-token"
bats-mock(curl): patterns  [31] = '-X' 'POST' '--silent' '--show-error' '--max-time' '30' '--form' 'format=junit' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '\*' '-H' '\*' 
bats-mock(curl): arguments [31] = '-X' 'POST' '--silent' '--show-error' '--max-time' '30' '--form' 'format=junit' '--form' 'data=@\"./tests/fixtures/junit-2.xml\"' '--form' 'run_env\[CI\]=buildkite' '--form' 'run_env\[key\]=an-id' '--form' 'run_env\[url\]=https://url.com/' '--form' 'run_env\[branch\]=a-branch' '--form' 'run_env\[commit_sha\]=a-commit' '--form' 'run_env\[number\]=123' '--form' 'run_env\[job_id\]=321' '--form' 'run_env\[message\]=A\ message' '--form' 'run_env\[collector\]=test-collector-buildkite-plugin' 'https://analytics-api.buildkite.com/v1/uploads' '-H' 'Authorization:\ Token\ token=\"a-secret-analytics-token\"' 
bats-mock(curl): running echo "curl success ${10}"
bats-mock(curl): command result was 0
bats-mock(curl): result 0
bats-mock(curl): got curl -X POST --silent --show-error --max-time 30 --form format=junit --form data=@"./tests/fixtures/junit-3.xml" --form run_env[CI]=buildkite --form run_env[key]=an-id --form run_env[url]=https://url.com/ --form run_env[branch]=a-branch --form run_env[commit_sha]=a-commit --form run_env[number]=123 --form run_env[job_id]=321 --form run_env[message]=A message --form run_env[collector]=test-collector-buildkite-plugin https://analytics-api.buildkite.com/v1/uploads -H Authorization: Token token="a-secret-analytics-token"
bats-mock(curl): no plan row found
bats-mock(curl): result 1
bats-mock(curl): patterns  [31] = '-X' 'POST' '--silent' '--show-error' '--max-time' '30' '--form' 'format=junit' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '--form' '\*' '\*' '-H' '\*' 
bats-mock(curl): running echo "curl success ${10}"
bats-mock(curl): command result was 0
bats-mock(curl): result 0
bats-mock(curl): unstubbing
 ✗ Uploads multiple files concurrently does not break basic functionality
   (from function `assert_line' in file /usr/lib/bats/bats-assert/src/assert_line.bash, line 219,
    in test file tests/pre-exit-success.bats, line 77)
     `assert_line --regexp '^curl success .*junit-3.xml"'' failed
   
   -- no output line matches regular expression --
   regexp : ^curl success .*junit-3.xml"
   output (7 lines):
     Uploading '3' files matching '**/*/junit-*.xml'
     Uploading './tests/fixtures/junit-1.xml'...
     Uploading './tests/fixtures/junit-2.xml'...
     Uploading './tests/fixtures/junit-3.xml'...
     curl success data=@"./tests/fixtures/junit-1.xml"
     Error uploading, will continue
     curl success data=@"./tests/fixtures/junit-2.xml"
   --
   

1 test, 1 failure

@pzeballos
Copy link
Contributor

pzeballos commented Jul 1, 2023

😳 this is definitely a tricky case! bats 😅
I'll go with option (1): "skip the tests by adding a skip 'bats-mock does not correctly support concurrent executions and makes this test flaky' to the first line of the test"

And we'll see what we can do to bring the tests back (I manually retired the step until it passed)

Thank you very much @js-murph for all your efforts and the amazing contribution you are making to the plugin 🤗 Do you think you can do that last change (add a skip with a message)? 🙏🏻 and I believe that we'll be ready to finally merge it 🤞🏻

@js-murph
Copy link
Contributor Author

js-murph commented Jul 2, 2023

Thank you both! I've pushed a skip on the concurrency does not break basic functionality test. Appreciate the explanation and helping diagnose the problem! :)

@pzeballos
Copy link
Contributor

Thank you very much @js-murph! 🤗

Copy link
Contributor

@toote toote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@pzeballos pzeballos merged commit 8b7e6e2 into buildkite-plugins:main Jul 3, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants