Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: intermittent test failures #200

Open
njsmith opened this issue Jun 11, 2017 · 49 comments
Open

Tracking issue: intermittent test failures #200

njsmith opened this issue Jun 11, 2017 · 49 comments

Comments

@njsmith
Copy link
Member

njsmith commented Jun 11, 2017

It's the nature of I/O libraries like Trio that their test suites are prone to weird intermittent failures. But they're often hard to track down, and usually the way you encounter them is that you're trying to land some other unrelated feature and the CI randomly fails, so the temptation is to click "re-run build" and worry about it later.

This temptation must be resisted. If left unchecked, you eventually end up with tests that fail all the time for unknown reasons and no-one trusts them and it's this constant drag on development. Flaky tests must be eradicated.

But to make things extra fun, there's another problem: CI services genuinely are a bit flaky, so when you see a weird failure or lock-up in the tests then it's often unclear whether this is a bug in our code, or just some cloud provider having indigestion. And you don't want to waste hours trying to reproduce indigestion. Which means we need to compare notes across multiple failures. Which is tricky when I see one failure, and you see another, and neither of us realizes that we're seeing the same thing. Hence: this issue.

What to do if you see a weird test failure that makes no sense:

  • Visit this bug; it's Tracking issue: intermittent test failures #200 so it's hopefully easy to remember.

  • Check to see if anyone else has reported the same failure

  • Either way, add a note recording what you saw. Make sure to link to the failed test log.

  • Special notes for specific CI services:

    • If it's a failed azure pipelines run, then go to the build page, click the "..." menu icon in the upper right, and select "Retain build". (Otherwise the logs will be deleted after 30 days or something like that.)
    • If it's a failed travis-ci run, DO NOT CLICK THE "RESTART BUILD" OR "RESTART JOB" BUTTON! That will wipe out the log and replace it with your new run, so we lose the information about what failed. Instead, close and then re-open the PR; this will tickle Travis into re-testing your commit, but in a way that gives the new build a new URL, so that the old log remains accessible.

Issues we're monitoring currently

@njsmith
Copy link
Member Author

njsmith commented Jun 17, 2017

Regarding the weird pypy nightly freeze in test_local: I downloaded the pypy-c-jit-91601-609a3cdf9cf7-linux64 nightly, and have let it loop running the trio testsuite for the last a few hours on my laptop, and I haven't been able to reproduce the problem so far. (Though I did get two failures in test_ki.py, both on this line, which is getting "did not raise KeyboardInterrupt". The test sessions otherwise finished normally.)

@njsmith njsmith changed the title testing.py needs to get broken down into submodules Tracking issue: intermittent test failures Jun 17, 2017
This was referenced Jun 17, 2017
@njsmith
Copy link
Member Author

njsmith commented Aug 4, 2017

#119 is now fixed, I think / hope. Demoting it to "on the radar".

@njsmith
Copy link
Member Author

njsmith commented Aug 8, 2017

Got fed up and fixed #140 :-)

@njsmith
Copy link
Member Author

njsmith commented Aug 18, 2017

#140 came back. This makes no sense at all.

@njsmith
Copy link
Member Author

njsmith commented Sep 7, 2017

Freeze in test_local.py::test_run_local_simultaneous_runs on pypy3 5.8 – maybe just travis being flaky, maybe something more.

@njsmith
Copy link
Member Author

njsmith commented Nov 6, 2017

Here's a weird one: https://travis-ci.org/python-trio/trio/jobs/298164123

It looks like our test for CPython 3.6.2 on MacOS, one of our calls to the synchronous, stdlib function SSLSocket.unwrap() raised an SSLWantWriteError. Which should be impossible for a synchronous call, I think? Maybe this is some weird intermittent bug in the stdlib ssl module?

@njsmith
Copy link
Member Author

njsmith commented Dec 1, 2017

Here's a new one, I guess somehow introduced by #358: a timeout test failing on windows because a 1 second sleep is measured to take just a tinnnnny bit less than 1 second.
https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21
#361 attempts a diagnosis and fix.

@njsmith
Copy link
Member Author

njsmith commented Dec 5, 2017

The weird SSL failure happened again: https://travis-ci.org/python-trio/trio/jobs/311618077
Again on CPython 3.6 on MacOS.

Filed upstream as bpo-32219. Possibly for now we should ignore SSLWantWriteError there as a workaround.

Edit: #365 is the PR for ignoring it.

@njsmith
Copy link
Member Author

njsmith commented Dec 21, 2017

Another freeze on PyPy nightly in tests/test_local.py::test_run_local_simultaneous_runs: https://travis-ci.org/python-trio/trio/jobs/319497598

Same thing happened on Sept. 7, above: #200 (comment)
And back in June: #200 (comment)

Filed a bug: #379

@njsmith
Copy link
Member Author

njsmith commented Feb 21, 2018

Sigh: #447

@njsmith
Copy link
Member Author

njsmith commented Apr 21, 2018

#508

@njsmith
Copy link
Member Author

njsmith commented May 19, 2018

There was a mysterious appveyor build failure here: #535

@pquentin
Copy link
Member

pquentin commented Jun 13, 2018

Strange PyPy nightly failure: https://travis-ci.org/python-trio/trio/jobs/391945139

Since it happened on master, I can't close/reopen a PR, but restarting the job produced the same effects.

(I think someone restarted the job above and it finally worked: the job is now green.)

@Fuyukai
Copy link
Member

Fuyukai commented Jul 29, 2018

Jenkins keeps eating pull requests with a segfault (but only sometimes). Looks like a bug in immutable library - but can't reproduce it locally, and I don't know how to get the core dump.

@njsmith
Copy link
Member Author

njsmith commented Jul 29, 2018

Here's a log with the segfault on Jenkins: https://ci.cryptography.io/blue/rest/organizations/jenkins/pipelines/python-trio/pipelines/trio/branches/PR-575/runs/2/nodes/6/steps/33/log/?start=0

The crash-handler traceback shows it as happening on line 27 of contextvars/__init__.py, which seems to be:

        self._data = immutables.Map()

And immutables.Map is a complex data structure implemented in C, so agreed that this smells like a bug in that code.

Filed a bug upstream here: MagicStack/immutables#7

@njsmith
Copy link
Member Author

njsmith commented Jul 30, 2018

The Jenkins thing has been worked around by #583... but now that we're running MacOS tests on Travis, we get #584 instead.

@njsmith
Copy link
Member Author

njsmith commented Sep 26, 2018

a segfault in pypy 3.6 nightly, apparently related to the faulthandler timeout firing in trio/tests/test_ssl.py::test_renegotiation_randomized: https://travis-ci.org/python-trio/trio/jobs/433409577

Reported it on the #pypy irc channel anyway, though there's not much to go on yet

@njsmith
Copy link
Member Author

njsmith commented Oct 4, 2018

Another strange pypy 3.6 nightly faulthandler traceback: https://travis-ci.org/python-trio/trio/jobs/436962955

I don't really understand this one at all.

@wgwz
Copy link
Contributor

wgwz commented Feb 20, 2020

What if we change it to mkdir -p?

       -p, --parents
              no error if existing, make parent directories as needed

@njsmith
Copy link
Member Author

njsmith commented Feb 21, 2020

Sure, that might work too. We're just flailing in the dark here, so it's probably too optimistic to hope we can make rational fine-grained choices between similar options :-)

@njsmith
Copy link
Member Author

njsmith commented Apr 29, 2020

Recently hit on Github Actions, Windows (3.8, x64, with non-IFS LSP):

2020-04-29T19:02:16.5149922Z ================================== FAILURES ===================================
2020-04-29T19:02:16.5150729Z ____________________ test_WaitForMultipleObjects_sync_slow ____________________
2020-04-29T19:02:16.5150934Z 
2020-04-29T19:02:16.5151963Z     @slow
2020-04-29T19:02:16.5152264Z     async def test_WaitForMultipleObjects_sync_slow():
2020-04-29T19:02:16.5152519Z         # This does a series of test in which the main thread sync-waits for
2020-04-29T19:02:16.5153005Z         # handles, while we spawn a thread to set the handles after a short while.
2020-04-29T19:02:16.5153264Z     
2020-04-29T19:02:16.5153831Z         TIMEOUT = 0.3
2020-04-29T19:02:16.5154042Z     
2020-04-29T19:02:16.5154267Z         # One handle
2020-04-29T19:02:16.5154463Z         handle1 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5154710Z         t0 = _core.current_time()
2020-04-29T19:02:16.5154947Z         async with _core.open_nursery() as nursery:
2020-04-29T19:02:16.5155174Z             nursery.start_soon(
2020-04-29T19:02:16.5155411Z                 trio.to_thread.run_sync, WaitForMultipleObjects_sync, handle1
2020-04-29T19:02:16.5155659Z             )
2020-04-29T19:02:16.5156204Z             await _timeouts.sleep(TIMEOUT)
2020-04-29T19:02:16.5156565Z             # If we would comment the line below, the above thread will be stuck,
2020-04-29T19:02:16.5156788Z             # and Trio won't exit this scope
2020-04-29T19:02:16.5157024Z             kernel32.SetEvent(handle1)
2020-04-29T19:02:16.5157228Z         t1 = _core.current_time()
2020-04-29T19:02:16.5157443Z         assert TIMEOUT <= (t1 - t0) < 2.0 * TIMEOUT
2020-04-29T19:02:16.5157672Z         kernel32.CloseHandle(handle1)
2020-04-29T19:02:16.5157902Z         print('test_WaitForMultipleObjects_sync_slow one OK')
2020-04-29T19:02:16.5158129Z     
2020-04-29T19:02:16.5158349Z         # Two handles, signal first
2020-04-29T19:02:16.5158722Z         handle1 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5159122Z         handle2 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5159386Z         t0 = _core.current_time()
2020-04-29T19:02:16.5159658Z         async with _core.open_nursery() as nursery:
2020-04-29T19:02:16.5159891Z             nursery.start_soon(
2020-04-29T19:02:16.5160368Z                 trio.to_thread.run_sync, WaitForMultipleObjects_sync, handle1,
2020-04-29T19:02:16.5160603Z                 handle2
2020-04-29T19:02:16.5160837Z             )
2020-04-29T19:02:16.5161061Z             await _timeouts.sleep(TIMEOUT)
2020-04-29T19:02:16.5161320Z             kernel32.SetEvent(handle1)
2020-04-29T19:02:16.5161549Z         t1 = _core.current_time()
2020-04-29T19:02:16.5161867Z >       assert TIMEOUT <= (t1 - t0) < 2.0 * TIMEOUT
2020-04-29T19:02:16.5162633Z E       assert (70981.67862983143 - 70980.60035523142) < (2.0 * 0.3)
2020-04-29T19:02:16.5162845Z 
2020-04-29T19:02:16.5163600Z C:\hostedtoolcache\windows\Python\3.8.2\x64\Lib\site-packages\trio\tests\test_wait_for_object.py:106: AssertionError
2020-04-29T19:02:16.5164689Z ---------------------------- Captured stdout call -----------------------------
2020-04-29T19:02:16.5164985Z test_WaitForMultipleObjects_sync_slow one OK
2020-04-29T19:02:16.5173571Z ============================== warnings summary ===============================

@njsmith
Copy link
Member Author

njsmith commented May 29, 2020

Here's an interesting example of a flaky test: test_run_in_worker_thread_abandoned wants to make sure that a background thread didn't spew any traceback on the console. But, it gets a false positive because of random ResourceWarnings being spewed on the console at just the wrong time:

        # Make sure we don't have a "Exception in thread ..." dump to the console:
        out, err = capfd.readouterr()
>       assert not out and not err
E       AssertionError: assert (not '' and not 'Exception ignored in: <socket.socket fd=65, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>\nReso...nResourceWarning: unclosed <socket.socket fd=80, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>\n')

https://github.com/python-trio/trio/pull/1551/checks?check_run_id=720114354

njsmith added a commit to njsmith/trio that referenced this issue May 29, 2020
@njsmith
Copy link
Member Author

njsmith commented Jun 1, 2020

Mysterious hang in test_warn_on_failed_cancel_terminate, on the Fedora build: https://travis-ci.org/github/python-trio/trio/jobs/693376980

There aren't really any useful details in the log. I think the only place where we could hang is at the end of the nursery block waiting for the task to exit, and we do call proc.terminate(), so either the terminate failed somehow or the detect-when-subprocess-exited code failed somehow? This build is using the pidfd subprocess management code so it could potentially indicate some flakiness there? Unclear.

@pquentin
Copy link
Member

pquentin commented Jun 4, 2020

I think I've seen this one a few times already.

https://github.com/python-trio/trio/pull/1574/checks?check_run_id=739572585

___________________ test_warn_deprecated_no_instead_or_issue ___________________

recwarn_always = WarningsRecorder(record=True)

    def test_warn_deprecated_no_instead_or_issue(recwarn_always):
        # Explicitly no instead or issue
        warn_deprecated("water", "1.3", issue=None, instead=None)
>       assert len(recwarn_always) == 1
E       assert 5 == 1
E         +5
E         -1

/Users/runner/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/trio/tests/test_deprecate.py:46: AssertionError

@njsmith
Copy link
Member Author

njsmith commented Jun 4, 2020 via email

@pquentin
Copy link
Member

pquentin commented Jun 8, 2020

@njsmith We're using recwarn + .pop already, which is identical to pytest.warns + .pop. I think @oremanj has found the correct fix in #1851, which I just merged.

@pquentin
Copy link
Member

pquentin commented Jun 9, 2020

test_open_tcp_listeners_backlog failed again on macOS 3.8: https://github.com/python-trio/trio/pull/1575/checks?check_run_id=738751745

@pquentin
Copy link
Member

New failure in the thread cache tests: #1604

@njsmith
Copy link
Member Author

njsmith commented Aug 27, 2020

Transient failure in test_pipes on macOS, maybe just a too-aggressive timeout: #1701 (comment)

@pquentin
Copy link
Member

test_signals (as reported in #1170, it seems) failed again in #1705

@pquentin
Copy link
Member

pquentin commented Sep 4, 2020

test_pipes failed again on macOS 3.8: #1713 (comment)

@njsmith
Copy link
Member Author

njsmith commented Sep 5, 2020

test_signals (as reported in #1170, it seems) failed again in #1705

This failure is super weird, and different from the issue in #1170. The test essentially does:

        with move_on_after(1.0) as scope:
            async with await open_process(SLEEP(3600)) as proc:
                proc.terminate()
        assert not scope.cancelled_caught
        assert proc.returncode == -SIGTERM

...and then the test fails on the last line because proc.returncode is -9 == -SIGKILL, when we were expecting -15 == SIGTERM.

So it seems like somehow, the child process is dying from a SIGKILL. How could that be? The __aexit__ from the async with proc: block can send a SIGKILL to the process, but only after await proc.wait() either returns normally or is cancelled, and then verifying that proc._proc.returncode is None:

        try:
            await self.wait()
        finally:
            if self._proc.returncode is None:
                self.kill()
                with trio.CancelScope(shield=True):
                    await self.wait()

I don't think wait could be cancelled here, because the test code does assert not scope.cancelled_caught to confirm that the timeout isn't firing. So that suggests that proc.wait must have returned after the SIGTERM, so proc._proc.returncode should have already been set to -SIGTERM. In fact, the code for wait even says assert self._proc.returncode is not None. So the finally: block in Process.aclose shouldn't have sent a SIGKILL.

So this seems to be one of those "that can't happen" errors... I don't know where this SIGKILL could be coming from.

test_pipes failed again on macOS 3.8: #1713 (comment)

I think this one is just a too-aggressive timeout: #1715

@pquentin
Copy link
Member

test_sleep failure in #1722

2020-09-12T11:46:46.7648568Z 
2020-09-12T11:46:46.7648713Z     @slow
2020-09-12T11:46:46.7648917Z     async def test_sleep():
2020-09-12T11:46:46.7649137Z         async def sleep_1():
2020-09-12T11:46:46.7649425Z             await sleep_until(_core.current_time() + TARGET)
2020-09-12T11:46:46.7649860Z     
2020-09-12T11:46:46.7650124Z         await check_takes_about(sleep_1, TARGET)
2020-09-12T11:46:46.7650376Z     
2020-09-12T11:46:46.7650584Z         async def sleep_2():
2020-09-12T11:46:46.7650818Z             await sleep(TARGET)
2020-09-12T11:46:46.7651032Z     
2020-09-12T11:46:46.7651302Z >       await check_takes_about(sleep_2, TARGET)
2020-09-12T11:46:46.7651518Z 
2020-09-12T11:46:46.7652356Z C:\hostedtoolcache\windows\Python\3.8.5\x64\lib\site-packages\trio\tests\test_timeouts.py:54: 
2020-09-12T11:46:46.7652875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2020-09-12T11:46:46.7653051Z 
2020-09-12T11:46:46.7653400Z f = <function test_sleep.<locals>.sleep_2 at 0x000001BD6DA138B0>
2020-09-12T11:46:46.7653775Z expected_dur = 1.0
2020-09-12T11:46:46.7653996Z 
2020-09-12T11:46:46.7654303Z     async def check_takes_about(f, expected_dur):
2020-09-12T11:46:46.7654688Z         start = time.perf_counter()
2020-09-12T11:46:46.7655086Z         result = await outcome.acapture(f)
2020-09-12T11:46:46.7655492Z         dur = time.perf_counter() - start
2020-09-12T11:46:46.7655816Z         print(dur / expected_dur)
2020-09-12T11:46:46.7656243Z         # 1.5 is an arbitrary fudge factor because there's always some delay
2020-09-12T11:46:46.7656782Z         # between when we become eligible to wake up and when we actually do. We
2020-09-12T11:46:46.7657330Z         # used to sleep for 0.05, and regularly observed overruns of 1.6x on
2020-09-12T11:46:46.7657881Z         # Appveyor, and then started seeing overruns of 2.3x on Travis's macOS, so
2020-09-12T11:46:46.7658424Z         # now we bumped up the sleep to 1 second, marked the tests as slow, and
2020-09-12T11:46:46.7658950Z         # hopefully now the proportional error will be less huge.
2020-09-12T11:46:46.7659304Z         #
2020-09-12T11:46:46.7659687Z         # We also also for durations that are a hair shorter than expected. For
2020-09-12T11:46:46.7660219Z         # example, here's a run on Windows where a 1.0 second sleep was measured
2020-09-12T11:46:46.7660635Z         # to take 0.9999999999999858 seconds:
2020-09-12T11:46:46.7661220Z         #   https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21
2020-09-12T11:46:46.7661934Z         # I believe that what happened here is that Windows's low clock resolution
2020-09-12T11:46:46.7662524Z         # meant that our calls to time.monotonic() returned exactly the same
2020-09-12T11:46:46.7663101Z         # values as the calls inside the actual run loop, but the two subtractions
2020-09-12T11:46:46.7663758Z         # returned slightly different values because the run loop's clock adds a
2020-09-12T11:46:46.7664297Z         # random floating point offset to both times, which should cancel out, but
2020-09-12T11:46:46.7664799Z         # lol floating point we got slightly different rounding errors. (That
2020-09-12T11:46:46.7665276Z         # value above is exactly 128 ULPs below 1.0, which would make sense if it
2020-09-12T11:46:46.7665694Z         # started as a 1 ULP error at a different dynamic range.)
2020-09-12T11:46:46.7666054Z >       assert (1 - 1e-8) <= (dur / expected_dur) < 1.5
2020-09-12T11:46:46.7666318Z E       assert (1.5026676000000023 / 1.0) < 1.5
2020-09-12T11:46:46.7666477Z 
2020-09-12T11:46:46.7667579Z C:\hostedtoolcache\windows\Python\3.8.5\x64\lib\site-packages\trio\tests\test_timeouts.py:35: AssertionError
2020-09-12T11:46:46.7668130Z ---------------------------- Captured stdout call -----------------------------
2020-09-12T11:46:46.7668367Z 1.0003469999999979
2020-09-12T11:46:46.7668537Z 1.5026676000000023

@pquentin
Copy link
Member

pquentin commented Sep 15, 2020

macOS 3.8 test_interactive failure in #1726 (comment)

@pquentin
Copy link
Member

pquentin commented Sep 21, 2020

Another macOS 3.8 test_interactive failure in https://github.com/python-trio/trio/pull/1732/checks?check_run_id=1143057289

@pquentin
Copy link
Member

Yet another macOS 3.8 test_interactive failure in https://github.com/python-trio/trio/pull/1733/checks?check_run_id=1175835741.

I think Python 3.8 changed subprocess handling for macOS (and Linux), but I don't know the details nor why those changes would affect Trio.

@pquentin
Copy link
Member

pquentin commented Oct 5, 2020

macOS 3.7 interactive failure: https://github.com/python-trio/trio/pull/1747/checks?check_run_id=1208098695

So at least we know that this isn't specific to 3.8

@belm0
Copy link
Member

belm0 commented Oct 27, 2020

timeout on Windows builds occurring often again

#200 (comment)

@belm0
Copy link
Member

belm0 commented Apr 6, 2021

"Windows (3.8, x64, with IFS LSP)" test is often running endlessly with this loop:

+ netsh winsock show catalog
+ grep 'Layered Chain Entry'
+ sleep 1
+ netsh winsock show catalog
+ grep 'Layered Chain Entry'
+ sleep 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants