Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Remove flake inducing timeouts #25847

Merged
merged 8 commits into from
Apr 30, 2021
Merged

Remove flake inducing timeouts #25847

merged 8 commits into from
Apr 30, 2021

Conversation

dnfield
Copy link
Contributor

@dnfield dnfield commented Apr 30, 2021

Fixes flutter/flutter#64087
Fixes flutter/flutter#66265

Per style guide (https://github.com/flutter/flutter/wiki/Style-guide-for-Flutter-repo#never-check-if-a-port-is-available-before-using-it-never-add-timeouts-and-other-race-conditions), we shouldn't have timeouts.

This changes all short timeouts that were intended to cause a timeout to zero.

It changes all long timeouts that were intended to be good enough to fml::TimeDelta::Max(). On a sufficiently slow machine, 1s was not enough time to get the first frame drawn, for example. The test will still get killed by the harness if it exceeds the total time budget (currently at 90s).

It also deletes one test entirely. This test is measuring how long the test itself takes to execute, and asserting that those times are greater or less than a certain number depending on the runtime mode. This is inherently flaky - it should not have been allowed in to begin with, and if we really need coverage of this it should be covered in some way that fakes out the clock. I assume, however, that this is covered by integration tests downstream.

@dnfield
Copy link
Contributor Author

dnfield commented Apr 30, 2021

Oh, one other problem with these timeouts - if they trigger in an unexpected way, the shell gets collected on the wrong thread, and the test process crashes with no backtrace.

Comment on lines -1306 to -1314
#if FLUTTER_RELEASE
// Our batch time is 1000ms. Hopefully the 800ms limit is relaxed enough to
// make it not too flaky.
ASSERT_TRUE(elapsed >= fml::TimeDelta::FromMilliseconds(800));
#else
// Our batch time is 100ms. Hopefully the 500ms limit is relaxed enough to
// make it not too flaky.
ASSERT_TRUE(elapsed <= fml::TimeDelta::FromMilliseconds(500));
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you could get what you want by just removing this assert right? That would eliminate all timeout except the one that's for the whole test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just realizing that's the whole point of this test. It doesn't really test much without it =T

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting this LGTM, I'd double check with the author of it just to see if they can think of another way to assert what they want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Author is @liyuqian

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the more I think about this, this either needs to be a benchmark and run as one, or it needs to mock out the clock to assert how many ticks can pass before events are emitted.

Copy link
Member

@gaaclarke gaaclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo pinging that test's author.

@@ -1430,7 +1379,7 @@ TEST_F(ShellTest, WaitForFirstFrameTimeout) {

RunEngine(shell.get(), std::move(configuration));
fml::Status result =
shell->WaitForFirstFrame(fml::TimeDelta::FromMilliseconds(10));
shell->WaitForFirstFrame(fml::TimeDelta::Zero());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the assert below right? It seems to be expecting a timeout indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We're saying we want to wait zero ms now instead of waiting 10. Theoretically something could actually produce a frame in 10ms. Even this isn't bullet proof but it's better.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that the intention is to time out as soon as possible, or to wait enough time? I wonder why 10 was chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very likely because that was short enough on the developer's machine to trigger the timeout. The problem is, on a fast enough machine or with enough improvements to the code under test, 10ms might not be long enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW, this test is currently trying to wait 10ms of monotonic time and seeing if it can or can't do work in that time. But that's susceptible to flakes.

@@ -1762,12 +1762,23 @@ fml::Status Shell::WaitForFirstFrame(fml::TimeDelta timeout) {
"because it is responsible for generating the frame.");
}

// Check for overflow.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaaclarke - mind taking another look at this?

I had to change this because otherwise we're susceptible to overflows when adding timeout to std::chrono::steady_clock::now().

wait_for does that addition behind the scenes and I guess treats the overflow as UB. I'm using wait_until to avoid the steady clock advancing while we do other calcuations and overflowing even though we checked.

This is covered by tests now that the timeouts use max/zero values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is an actual bug for us today - if someone were to pass a very large value into here, they'd get a timeout response due to overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like the API docs in the header file here https://github.com/flutter/engine/blob/master/shell/common/shell.h#L301 could use some more details or a warning about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think an easier fix would just be to say auto duration = min(max_duration, desired_duration); and use wait_for. Since that time is so long, no one will ever be able to tell the difference in their lifetime. That removes a branch and is much easier to verify it is correct assuming what was there before was correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Though technically isn't std::min a branch? :) Anyway I agree the code is more readable/maintainable this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. We can't do that. wait_for is still going to add to stead_clock::now(), which has incremented since checked for overflow, and now might overflow again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this:

  auto now = std::chrono::steady_clock::now();
  auto max_duration = std::chrono::steady_clock::time_point::max() - now;
  auto desired_duration = std::chrono::milliseconds(timeout.ToMilliseconds());

  std::unique_lock<std::mutex> lock(waiting_for_first_frame_mutex_);
  bool success = waiting_for_first_frame_condition_.wait_until(
      lock, now + std::min(desired_duration, max_duration),
      [&waiting_for_first_frame = waiting_for_first_frame_] {
        return !waiting_for_first_frame.load();
      });

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me too, as long as there isn't that branch I think it's more straightforward.

@@ -1762,12 +1762,23 @@ fml::Status Shell::WaitForFirstFrame(fml::TimeDelta timeout) {
"because it is responsible for generating the frame.");
}

// Check for overflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like the API docs in the header file here https://github.com/flutter/engine/blob/master/shell/common/shell.h#L301 could use some more details or a warning about this.

@@ -1410,8 +1357,7 @@ TEST_F(ShellTest, WaitForFirstFrameZeroSizeFrame) {

RunEngine(shell.get(), std::move(configuration));
PumpOneFrame(shell.get(), {1.0, 0.0, 0.0});
fml::Status result =
shell->WaitForFirstFrame(fml::TimeDelta::FromMilliseconds(1000));
fml::Status result = shell->WaitForFirstFrame(fml::TimeDelta::Zero());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be fml::TimeDelta::Max() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make us wait forever and time out the test. I'm not sure how much value this test adds over the other test(s) that make sure we timeout if we don't wait long enough.

The problem is, even if we keep it at 1s, there's no good way of knowing whether we timed out because it fails to produce a frame by design, or we timed out because we were running on hardware slow enough that 1s isn't long enough to produce a frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iskakaushik is in the blame for this, might have ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I missed the ASSERT below this line saying that the wait is expected to timeout. So, is this testing that PumpOneFrame doesn't do anything for a zero size frame, and that's why the wait will time out? Is there another way for PumpOneFrame to communicate that it isn't going to do anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this test wants to assert that WaitForFirstFrame will time out if you pump a zero sized frame.

I think a better place to test this would be e.g. in the animator unittests, making sure that OnAnimatorDidDraw isn't called under this condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(that's where the shell would flip the wait condition)

@dnfield dnfield merged commit d462cfd into flutter:master Apr 30, 2021
@dnfield dnfield deleted the deflake branch April 30, 2021 21:51
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 1, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 1, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 1, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 1, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 1, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 1, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 1, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 1, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 2, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 2, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 3, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 3, 2021
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request May 3, 2021
zanderso pushed a commit to flutter/flutter that referenced this pull request May 3, 2021
* aac3658 Roll Skia from 14efdd3d50db to 3010f3d79193 (4 revisions) (flutter/engine#25857)

* f45f52f Move path part files to libraries (flutter/engine#25842)

* d96669f Roll Skia from 3010f3d79193 to 5276ba274b38 (1 revision) (flutter/engine#25859)

* 43994ab Use "blur_sigma" instead of "blur_radius" in Shadow. (flutter/engine#25760)

* 74ac010 Roll Dart SDK from 1e3e5efcd47e to 6397e8b91103 (18 revisions) (flutter/engine#25861)

* d462cfd Remove flake inducing timeouts (flutter/engine#25847)

* 258f63a Roll Skia from 5276ba274b38 to 31fddc376993 (14 revisions) (flutter/engine#25862)

* 1bf01a1 fuchsia: Reliably forward View insets (flutter/engine#25381)

* 2740062 Roll Skia from 31fddc376993 to 097263bb5089 (1 revision) (flutter/engine#25865)

* 4aae6b3 Roll Dart SDK from 6397e8b91103 to ee8eb0a65efa (1 revision) (flutter/engine#25869)

* d1f31c7 Roll Skia from 097263bb5089 to 5dfb3f40684b (1 revision) (flutter/engine#25870)

* 428ef10 deflake (flutter/engine#25864)

* 01b9db8 Roll Dart SDK from ee8eb0a65efa to 8c109a734bdc (1 revision) (flutter/engine#25872)

* e623c09 Roll Dart SDK from 8c109a734bdc to b98a1eec5eb5 (1 revision) (flutter/engine#25873)

* 7ce6226 Roll Dart SDK from b98a1eec5eb5 to 6ecae204598d (1 revision) (flutter/engine#25874)

* 554c24c Roll Skia from 5dfb3f40684b to 5c95bcb48b9b (1 revision) (flutter/engine#25876)

* 03bd2e3 Roll Dart SDK from 6ecae204598d to 3cc6cdab8eaf (2 revisions) (flutter/engine#25877)

* 53fdd89 Roll Skia from 5c95bcb48b9b to c779d432f336 (1 revision) (flutter/engine#25880)

* 98ee65b Roll Skia from c779d432f336 to ff8b52df55ff (2 revisions) (flutter/engine#25881)

* d84caa7 Roll Dart SDK from 3cc6cdab8eaf to b8f4018535fa (2 revisions) (flutter/engine#25885)

* 156760e Roll Skia from ff8b52df55ff to ec79349bad50 (1 revision) (flutter/engine#25886)

* 23cd8e5 Revert the Dart SDK to 1e3e5efcd47edeb7ae5a69e146c8ea0559305a98 (flutter/engine#25887)

* 578449f Fix crash when both FlutterFragmentActivity and FlutterFragment are destroyed and recreated (flutter/engine#25851)
gspencergoog pushed a commit to gspencergoog/engine that referenced this pull request May 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
4 participants