Skip to content

test_asyncio: test_subprocess_consistent_callbacks() fails randomly #108973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vstinner opened this issue Sep 6, 2023 · 16 comments
Closed

test_asyncio: test_subprocess_consistent_callbacks() fails randomly #108973

vstinner opened this issue Sep 6, 2023 · 16 comments
Labels
tests Tests in the Lib/test dir topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

vstinner commented Sep 6, 2023

The following test_asyncio test is unstable and fails randomly on buildbots. I saw failures on Linux and FreeBSD.

FAIL: test_subprocess_consistent_callbacks (test.test_asyncio.test_subprocess.SubprocessThreadedWatcherTests.test_subprocess_consistent_callbacks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto-pgo/build/Lib/test/test_asyncio/test_subprocess.py", line 788, in test_subprocess_consistent_callbacks
    self.loop.run_until_complete(main())
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto-pgo/build/Lib/asyncio/base_events.py", line 664, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z.lto-pgo/build/Lib/test/test_asyncio/test_subprocess.py", line 780, in main
    self.assertEqual(events, [
AssertionError: Lists differ: ['process_exited', ('pipe_data_received', 1, b'stdout')] != [('pipe_data_received', 1, b'stdout'), ('p[95 chars]ted']

First differing element 0:
'process_exited'
('pipe_data_received', 1, b'stdout')

Second list contains 3 additional elements.
First extra element 2:
'pipe_connection_lost'

- ['process_exited', ('pipe_data_received', 1, b'stdout')]
?  ------------------                                    ^

+ [('pipe_data_received', 1, b'stdout'),
?                                      ^

+  ('pipe_data_received', 2, b'stderr'),
+  'pipe_connection_lost',
+  'pipe_connection_lost',
+  'process_exited']

build: https://buildbot.python.org/all/#/builders/442/builds/4900

Linked PRs

@AlexWaygood AlexWaygood added type-bug An unexpected behavior, bug, or error tests Tests in the Lib/test dir topic-asyncio labels Sep 6, 2023
@github-project-automation github-project-automation bot moved this to Todo in asyncio Sep 6, 2023
@gvanrossum
Copy link
Member

Agreed. Feel free to suggest a PR that fixes this.

@vstinner
Copy link
Member Author

vstinner commented Sep 8, 2023

@kumaraditya303: It looks wrong that the "process_exited" event is reported after "pipe_data_received" event. Any idea on how to make sure that events are delivered in the right order?

First, I wanted to just tolerate events in any order in the test. But then I realized that maybe something is wrong here.

@kumaraditya303
Copy link
Contributor

"process_exited" should be the last event, also I am unable to reproduce it on any of my own testing, I guess something is special about freebsd but I don't have bandwidth to investigate for such platforms.

@vstinner
Copy link
Member Author

vstinner commented Sep 9, 2023

I saw failures on Linux and FreeBSD.

So far, I didn't manage to design a reliable way to trigger the bug. I will have a look at it later. I wanted to open an issue to keep track of it. There are many unstable tests these days in Python, and it annoys me to have flaky CI.

@vstinner
Copy link
Member Author

vstinner commented Sep 9, 2023

@kumaraditya303:

"process_exited" should be the last event, also I am unable to reproduce it on any of my own testing, I guess something is special about freebsd but I don't have bandwidth to investigate for such platforms.

Oh sure, it's perfectly fine if you don't have the bandwidth for that.

@kumaraditya303
Copy link
Contributor

Oh sure, it's perfectly fine if you don't have the bandwidth for that.

If you get a reliable way to trigger it on Linux then I will surely look into it.

@vstinner
Copy link
Member Author

vstinner commented Sep 9, 2023

Variant on FreeBSD:

FAIL: test_subprocess_consistent_callbacks (test.test_asyncio.test_subprocess.SubprocessThreadedWatcherTests.test_subprocess_consistent_callbacks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/cirrus-ci-build/Lib/test/test_asyncio/test_subprocess.py", line 788, in test_subprocess_consistent_callbacks
    self.loop.run_until_complete(main())
  File "/tmp/cirrus-ci-build/Lib/asyncio/base_events.py", line 664, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/cirrus-ci-build/Lib/test/test_asyncio/test_subprocess.py", line 780, in main
    self.assertEqual(events, [
AssertionError: Lists differ: [('pi[29 chars]t'), 'pipe_connection_lost', ('pipe_data_recei[57 chars]ted'] != [('pi[29 chars]t'), ('pipe_data_received', 2, b'stderr'), 'pi[57 chars]ted']
First differing element 1:
'pipe_connection_lost'
('pipe_data_received', 2, b'stderr')
  [('pipe_data_received', 1, b'stdout'),
-  'pipe_connection_lost',
   ('pipe_data_received', 2, b'stderr'),
   'pipe_connection_lost',
+  'pipe_connection_lost',
   'process_exited']

build: https://cirrus-ci.com/task/6142659970990080?logs=test#L878

@sorcio
Copy link
Contributor

sorcio commented Sep 12, 2023

One way to reproduce reliably is to have a child process that exits faster than python -c .... I used this test code to run a number of iterations quickly. On my machine, both ThreadedChildWatcher (on macOS and Linux) and PidfdChildWatcher produce a variety of event sequences with some frequency.

#97009 is insufficient. There is no timing guarantee, and delaying by two event loop iterations only prevents the error to be observable in most cases.

I'm unsure what asyncio wants to promise here. Should we call process_exited as soon as we are signaled that the child process exited, with the least delay? Then it can be out of order, and it often will be, reflecting what the OS is telling us. Should it strictly be called only after all other subprocess-related callbacks? Then it overlaps with connection_lost which is already serving the same purpose.

While the bug itself is not serious, if user code is written following the same pattern as test_subprocess_consistent_callbacks, i.e. expecting no more data after process_exited, then the impact is fairly bad, as it means that data can be lost.


A separate bit of inconsistency is to expect the two pipe_connection_lost to strictly follow the two pipe_data_received. I see no reason why they couldn't be interleaved:

[('pipe_data_received', 1, b'stdout'), 'pipe_connection_lost', ('pipe_data_received', 2, b'stderr'), 'pipe_connection_lost', 'process_exited']

In fact I'm surprised that the two data received are in a consistent order (stdout, then stderr) across thousands of runs. That's also something that shouldn't be guaranteed. I suspect it's an artifact of the order in which the waiters are registered.

To be pedantic, it's also not guaranteed that all data on a pipe is readable in a single call. For the purposes of this test it's unlikely to ever matter.


In short:

  1. process_exited has no guarantee to be called last (but connection_closed does), and I'm unsure if this is something that asyncio should ever promise. If user code relies on this, this is a serious bug.
  2. the only other guarantee is that each pipe_connection_closed comes after the corresponding pipe_data_receiveds, but there are a few orderings possible. Assuming the test is still considered meaningful, it should be relaxed.

@sorcio
Copy link
Contributor

sorcio commented Sep 12, 2023

...suspect it's an artifact of the order in which the waiters are registered.

Just for fun, I switched the order and I could see that every once in a while stderr comes first :) Best use of a multi-core machine is to have fun with accidental expectations of causality!

@vstinner
Copy link
Member Author

Issue seen on s390x RHEL7 LTO 3.x: https://buildbot.python.org/all/#/builders/402/builds/5394

@vstinner
Copy link
Member Author

vstinner added a commit to vstinner/cpython that referenced this issue Sep 14, 2023
Subprocess events can be delivered in any order: tolerate that in the
case.
vstinner added a commit to vstinner/cpython that referenced this issue Sep 14, 2023
Subprocess events can be delivered in any order: tolerate that in the
case.
vstinner added a commit to vstinner/cpython that referenced this issue Sep 14, 2023
Subprocess events can be delivered in any order: tolerate that in the
test.
@vstinner
Copy link
Member Author

I started to write a fix to ensure that events are always delivered in a specific order. For example, create 4 queues of callbacks:

  • priority 0 - pipe_data_received (highest)
  • priority 1 - pipe_connection_lost
  • priority 2 - process_exited
  • priority 3 - others (lowest)

The problem is that asynchronous programming is hard, system programming is hard, and as @sorcio wrote, multi-core CPUs, multithreading, and different event loop implementation make everything less deterministic.

My worry is if that for some reasons one of these events is not delivered, the other ones are queue... maybe forever :-(

When a process completes, I want to know it as soon as possible. Getting its pipe output is honestly secondary. Maybe the process was killed by a signal. Maybe something bad happens, who knows?

@sorcio:

I'm unsure what asyncio wants to promise here.

While it's technically possible to ensure that events are delivered in a determistic way... I'm not sure that it's worth it :-( The drawbacks seem to be expensive and risky to me.

Thanks @sorcio to digging deep into the issue and for your long analysis!

I'm now more confident that the test... should just endorse that async programming make such events less determistic and tolerate that these events can be delivery in any order. I wrote PR #109431 for that.

@gvanrossum
Copy link
Member

Is everyone here aware that "process exited" and "pipe closed" are orthogonal events? The process that exits may have shared the file descriptor with a child process that continues to exist. Or the process could close the file descriptor without exiting. I don't think we should enforce or require an order between such events.

@vstinner
Copy link
Member Author

@gvanrossum:

Is everyone here aware that "process exited" and "pipe closed" are orthogonal events? The process that exits may have shared the file descriptor with a child process that continues to exist. Or the process could close the file descriptor without exiting.

It just kills my idea of ordering events.

I don't think we should enforce or require an order between such events.

Ok, so we are on the same page: my PR #109431 fix the test by accepting that events are not ordered. With my change, the only thing which is tested is that we get all events, in any order.

@sorcio
Copy link
Contributor

sorcio commented Sep 15, 2023

In addition to the change to the test, it would make sense to revert #97009. It introduces unnecessary delay. The issue it was intended to fix was not diagnosed correctly and is not a bug. As I said before (see also Guido's comment), a program should not depend on process_exit to be delivered before any pipe event. It could be worth mentioning in the docs, if it's probable cause for confusion.

The original issue also points to the example in the docs which is not correct because it makes the same wrong assumption.

@vstinner
Copy link
Member Author

It could be worth mentioning in the docs, if it's probable cause for confusion.

Done in my (updated) PR.

The original issue also points to the example in the docs which is not correct because it makes the same wrong assumption.

Example fixed in my (updated) PR.

vstinner added a commit to vstinner/cpython that referenced this issue Sep 15, 2023
Subprocess events can be delivered in any order: tolerate that in the
test.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

Co-authored-by: Davide Rizzo <[email protected]>
vstinner added a commit to vstinner/cpython that referenced this issue Sep 19, 2023
Subprocess events can be delivered in any order: tolerate that in the
test.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

Co-authored-by: Davide Rizzo <[email protected]>
vstinner added a commit to vstinner/cpython that referenced this issue Sep 19, 2023
SubprocessProtocol process_exited() method can be called before
pipe_data_received() and pipe_connection_lost() methods. Document it
and adapt the test for that.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

Co-authored-by: Davide Rizzo <[email protected]>
vstinner added a commit to vstinner/cpython that referenced this issue Sep 19, 2023
SubprocessProtocol process_exited() method can be called before
pipe_data_received() and pipe_connection_lost() methods. Document it
and adapt the test for that.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

Co-authored-by: Davide Rizzo <[email protected]>
vstinner added a commit that referenced this issue Sep 20, 2023
SubprocessProtocol process_exited() method can be called before
pipe_data_received() and pipe_connection_lost() methods. Document it
and adapt the test for that.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

Co-authored-by: Davide Rizzo <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 20, 2023
…ythonGH-109431)

SubprocessProtocol process_exited() method can be called before
pipe_data_received() and pipe_connection_lost() methods. Document it
and adapt the test for that.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

(cherry picked from commit ced6924)

Co-authored-by: Victor Stinner <[email protected]>
Co-authored-by: Davide Rizzo <[email protected]>
@github-project-automation github-project-automation bot moved this from Todo to Done in asyncio Sep 20, 2023
vstinner added a commit that referenced this issue Sep 20, 2023
gh-108973: Fix asyncio SubprocessProtocol doc (#109431)

SubprocessProtocol process_exited() method can be called before
pipe_data_received() and pipe_connection_lost() methods. Document it
and adapt the example in the doc.

Co-authored-by: Davide Rizzo <[email protected]>
(cherry picked from commit ced6924)
csm10495 pushed a commit to csm10495/cpython that referenced this issue Sep 28, 2023
…ython#109431)

SubprocessProtocol process_exited() method can be called before
pipe_data_received() and pipe_connection_lost() methods. Document it
and adapt the test for that.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

Co-authored-by: Davide Rizzo <[email protected]>
Yhg1s pushed a commit that referenced this issue Oct 2, 2023
…H-109431) (#109609)

gh-108973: Fix asyncio test_subprocess_consistent_callbacks() (GH-109431)

SubprocessProtocol process_exited() method can be called before
pipe_data_received() and pipe_connection_lost() methods. Document it
and adapt the test for that.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

(cherry picked from commit ced6924)

Co-authored-by: Victor Stinner <[email protected]>
Co-authored-by: Davide Rizzo <[email protected]>
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
…ython#109431)

SubprocessProtocol process_exited() method can be called before
pipe_data_received() and pipe_connection_lost() methods. Document it
and adapt the test for that.

Revert commit 282edd7.
_child_watcher_callback() calls immediately _process_exited(): don't
add an additional delay with call_soon(). The reverted change didn't
make _process_exited() more determistic: it can still be called
before pipe_connection_lost() for example.

Co-authored-by: Davide Rizzo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

5 participants