Skip to content

semaphore.h: handle spurious wakeups in TimedWait() on Linux#1021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 11, 2022

Conversation

dconeybe
Copy link
Contributor

@dconeybedconeybe commented Jul 9, 2022

This fixes a latent bug where Future::Wait(int timeout_milliseconds) would occasionally return prematurely, when neither the timeout had expired nor the Future been completed. This was due to the implementation of Semaphore::TimedWait(int milliseconds) which calls sem_timedwait() on Linux and Android and neglected to check if the errno was EINTR, in which case the wait should be restarted.

This bug surfaced as the integration tests for Firestore's TransactionTest.TestMaxAttempts flakily failing due to a call to Future.Await(int timeout_milliseconds) returning as if it had timed out when, in fact, no timeout had occurred.

Note that this fix only affects Linux and Android (which runs Linux under the hood).

@dconeybedconeybe added the skip-release-notes Skip release notes check label Jul 9, 2022
@dconeybedconeybe self-assigned this Jul 9, 2022
@github-actionsgithub-actionsbot added the tests: in-progress This PR's integration tests are in progress. label Jul 9, 2022
@firebasefirebase deleted a comment from github-actionsbotJul 9, 2022
@dconeybedconeybe removed the tests: in-progress This PR's integration tests are in progress. label Jul 9, 2022
@github-actionsgithub-actionsbot added the tests: in-progress This PR's integration tests are in progress. label Jul 9, 2022
@github-actions
Copy link

github-actionsbot commented Jul 9, 2022

❌  Integration test FAILED

Requested by @dconeybe on commit 26e918b
Last updated: Mon Jul 11 12:05 PDT 2022
View integration test log & download artifacts

FailuresConfigs
firestore[TEST] [FAILURE] [Android] [1/3 os: windows] [1/2 android_device: android_target]
(5 failed tests)  ServerTimestampTest.TestServerTimestampsCanReturnPreviousValueOfDifferentType
  ServerTimestampTest.TestServerTimestampsWorkViaTransactionUpdate
  ServerTimestampTest.TestServerTimestampsWorkViaUpdate
  WriteBatchTest.TestBatchesCommitAtomicallyRaisingCorrectEvents
  WriteBatchTest.TestBatchesFailAtomicallyRaisingCorrectEvents
[TEST] [FLAKINESS] [Android] [1/3 os: ubuntu] [1/2 android_device: android_target]
(1 failed tests)  CRASH/TIMEOUT

Add flaky tests to go/fpl-cpp-flake-tracker

@github-actionsgithub-actionsbot added the tests: succeeded This PR's integration tests succeeded. label Jul 9, 2022
@firebase-workflow-triggerfirebase-workflow-triggerbot removed the tests: in-progress This PR's integration tests are in progress. label Jul 9, 2022
@dconeybedconeybe added the tests-requested: quick Trigger a quick set of integration tests. label Jul 9, 2022
@github-actionsgithub-actionsbot added tests: in-progress This PR's integration tests are in progress. tests: succeeded This PR's integration tests succeeded. and removed tests-requested: quick Trigger a quick set of integration tests. tests: succeeded This PR's integration tests succeeded. labels Jul 9, 2022
@firebase-workflow-triggerfirebase-workflow-triggerbot removed the tests: in-progress This PR's integration tests are in progress. label Jul 9, 2022
@github-actionsgithub-actionsbot added the tests: failed This PR's integration tests failed. label Jul 9, 2022
@dconeybedconeybe added tests-requested: quick Trigger a quick set of integration tests. and removed skip-release-notes Skip release notes check labels Jul 11, 2022
@github-actionsgithub-actionsbot added tests: in-progress This PR's integration tests are in progress. and removed tests-requested: quick Trigger a quick set of integration tests. tests: failed This PR's integration tests failed. tests: succeeded This PR's integration tests succeeded. labels Jul 11, 2022
@dconeybedconeybe marked this pull request as ready for review July 11, 2022 14:40
@github-actionsgithub-actionsbot added the tests: failed This PR's integration tests failed. label Jul 11, 2022
@firebase-workflow-triggerfirebase-workflow-triggerbot removed the tests: in-progress This PR's integration tests are in progress. label Jul 11, 2022
@github-actionsgithub-actionsbot added the tests: succeeded This PR's integration tests succeeded. label Jul 11, 2022
@dconeybedconeybe removed the tests: failed This PR's integration tests failed. label Jul 11, 2022
@dconeybedconeybe merged commit 26e918b into mainJul 11, 2022
@dconeybedconeybe deleted the dconeybe/SemaphoreTimedWaitFix branch July 11, 2022 16:17
@github-actionsgithub-actionsbot added tests: in-progress This PR's integration tests are in progress. and removed tests: succeeded This PR's integration tests succeeded. labels Jul 11, 2022
// Return failure, since the timeout expired.
return false;
case EINVAL:
assert("sem_timedwait() failed with EINVAL" == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure, you want this to NOT actually assert in release builds, yes? (the default assert behavior?)

@@ -172,7 +172,30 @@ class Semaphore {
return WaitForSingleObject(semaphore_, milliseconds) == 0;
#else // not windows and not mac - should be Linux.
timespec t = internal::MsToAbsoluteTimespec(milliseconds);
return sem_timedwait(semaphore_, &t) == 0;
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test exercising this failure/fix to semaphore_test.cc? (Or, even better, does re-enabling the disabled MultithreadedStressTest in that file now work?)

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests added in #1036 and #1037

@github-actionsgithub-actionsbot added the tests: failed This PR's integration tests failed. label Jul 11, 2022
@firebase-workflow-triggerfirebase-workflow-triggerbot removed the tests: in-progress This PR's integration tests are in progress. label Jul 11, 2022
@dconeybe
Copy link
ContributorAuthor

Vindication! One of the nightly test runs failed with this assertion failure:

semaphore.h:189: bool firebase::Semaphore::TimedWait(int): Assertion `"sem_timedwait() failed with EINVAL" == 0' failed.

https://github.com/firebase/firebase-cpp-sdk/runs/7338303645

According to https://linux.die.net/man/3/sem_timedwait, EINVAL occurs in one of two cases:

  • sem is not a valid semaphore.
  • the value of abs_timeout.tv_nsecs is less than 0, or greater than or equal to 1000 million.
dconeybe added a commit that referenced this pull request Jul 20, 2022
@firebasefirebase locked and limited conversation to collaborators Aug 11, 2022
Sign up for freeto subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
tests: failedThis PR's integration tests failed.
3 participants
@dconeybe@jonsimantov@DellaBitta
close