Skip to content

Allow Ffi calls to be marked as potentially blocking / exiting the isolate. #51261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mkustermann opened this issue Feb 6, 2023 · 16 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi

Comments

@mkustermann
Copy link
Member

Some users are running into an issue where many isolates are calling out to C code that will then block. This can cause the dart app to no longer work due to our limitation on maximum number of threads that can be active in an isolate group at a given point in time.

The limitation is there to avoid too many threads executing Dart code at same time. This can lead to situations where X threads all have TLABs which may contain unallocated memory, but the X+1 thread tries to obtain TLAB and fails, which will cause it to trigger GC (despite other thread's TLAB still having unallocated memory)
=> Allowing unbounded number of threads to enter an isolate group can lead to excessive triggering of GCs (despite free memory in other thread's TLAB)

See runtime/vm/heap/scavenger.h for the current calculation of the limit:

  // The maximum number of Dart mutator threads we allow to execute at the same
  // time.
  static intptr_t MaxMutatorThreadCount() {
    // With a max new-space of 16 MB and 512kb TLABs we would allow up to 8
    // mutator threads to run at the same time.
    const intptr_t max_parallel_tlab_usage =
        (FLAG_new_gen_semi_max_size * MB) / Scavenger::kTLABSize;
    const intptr_t max_pool_size = max_parallel_tlab_usage / 4;
    return max_pool_size > 0 ? max_pool_size : 1;
  }

We may consider adding a boolean flag to specify that a FFI call may be blocking / should exit the isolate.

// Static binding
@Native("sleep", exitIsolate: true)
external void sleep(int seconds);

// Dynamic binding
dylib.lookup().asFunction<...>(exitIsolate: true);

to automatically exit and re-enter the isolate to avoid custom C code like this:

auto isolate = Dart_CurrentIsolate();
Dart_ExitIsolate();
<... run blocking C Code, e.g. sleep() ...>
Dart_EnterIsolate(isolate);

See motivating use case: #51254

@mkustermann mkustermann added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi labels Feb 6, 2023
@mkustermann
Copy link
Member Author

The underlying issue is that new space (which we require for bump-allocation) doesn't scale with number of threads. The fact that this limitation would surface to the FFI API does seem a bit iffy.

We could device a scheme where FFI calls will give up their TLAB on transitions to C and re-acquire on the way back and limit the number of outstanding TLABs instead of number of active isolates. Though that would make transitions more heavyweight, would make returning from C (as well as Dart C API calls) possibly blocking for arbitrary amount of time. Seems less than ideal.

/cc @rmacnak-google

@rmacnak-google
Copy link
Contributor

I think this can be for free for the uncontented case: What we could do is when a new mutator wants to enter the isolate and the limit has been reached, we can check if any existing mutators are in an ffi-exited safepoint state, CAS its safepoint state to one meaning it has been kicked out, causing the safepoint transition on the ffi-return to hit the slow path, and take its TLAB away. The safepoint transition slow path then has a new check if it needs to wait on the mutator count to re-enter as a mutator.

@dcharkes
Copy link
Contributor

Though that would make returning from C (as well as Dart C API calls) possibly blocking for arbitrary amount of time.

It would still be compatible with Dart's semantics of synchronous code on an isolate running to completion before any other code is run on that isolate.

However, it would change the scheduling which isolate runs when we have exhausted the max number of mutators in an isolate group, that might be surprising. Do we have some kind of scheduling logic for that? @mkustermann

@mkustermann
Copy link
Member Author

mkustermann commented Apr 3, 2023

I think this can be for free for the uncontented case: What we could do is when a new mutator wants to enter the isolate and the limit has been reached, we can check if any existing mutators are in an ffi-exited safepoint state, CAS its safepoint state to one meaning it has been kicked out, causing the safepoint transition on the ffi-return to hit the slow path, and take its TLAB away. The safepoint transition slow path then has a new check if it needs to wait on the mutator count to re-enter as a mutator.

That's an interesting idea.

I'm a little worried that doing this blindly can lead to situations where e.g. Flutter UI isolate does a FFI call, then another thread kicking the UI isolate out. When the FFI call on UI isolate returns it will take the slow path and block (which could freeze flutter UI).

This can also happen to some extent today as well - but only at event loop boundary (e.g. Flutter UI isolate is idle, N threads enter isolate group and then flutter UI isolate cannot enter anymore but has to wait).

If one mutator has been kicked out and returns from ffi call then in the slow path it should be allowed to kick out another thread if it's in a ffi call. That would mean the system would work flawlessly irrespective of number of threads - as long as there are not more than N threads executing Dart code concurrently (which may be an ok restriction as all dart code being executed will either go back to event loop or do ffi call eventually which are yield poitns). Though it will require some synchronization on both sides:

  • Calling native via ffi: If there's another thread waiting to execute Dart we need to notify it (could use similar mechanism as our existing "gc-safepoint-requested" bit which forces GeneratedToNative to slowpath)
  • Return from ffi call: Safepoint may have been stolen from the side, so we have to take slowpath and wait for mutator count (or kick another thread out if there's any in ffi-exited state)

@Piero512
Copy link

Piero512 commented Apr 3, 2023

Hi. I have a complaint about this. If we're going to expose to FFI developers to those kinds of Isolate details, why we can't have some way for native code to at least return (synchronously) Dart objects that can be created through the Dart_CObject struct?

copybara-service bot pushed a commit that referenced this issue Apr 12, 2023
For applications that want to have arbitrary number of isolates call
into native code that may be blocking, we expose the API functions that
allows those native threads to exit an isolate before running
long/blocking code.

Without the ability to exit/re-enter isolate, one may experience
deadlocks as we have a fixed limit on the number of concurrently
executing isolates atm.

In the longer term we may find a way to do this automatically
with low overhead, see [0]. But since those API functions are quite
stable and we already expose e.g. `Dart_{Enter,Exit}Scope`, I don't
see a reason not to expose `Dart_{Enter,Exit}Isolate`.

[0] Issue #51261

Issue #51254

TEST=ffi{,_2}/dl_api_exit_enter_isolate_test

Change-Id: I91c772ca962fddb87919663fea07939a498fa205
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/292722
Commit-Queue: Martin Kustermann <[email protected]>
Reviewed-by: Daco Harkes <[email protected]>
Reviewed-by: Ryan Macnak <[email protected]>
copybara-service bot pushed a commit that referenced this issue Apr 12, 2023
This reverts commit a251281.

Reason for revert: FFI tests fail to link on Windows, fail to load on product-mode Android

Original change's description:
> [vm] Expose Dart_{CurrentIsolate,ExitIsolate,EnterIsolate}
>
> For applications that want to have arbitrary number of isolates call
> into native code that may be blocking, we expose the API functions that
> allows those native threads to exit an isolate before running
> long/blocking code.
>
> Without the ability to exit/re-enter isolate, one may experience
> deadlocks as we have a fixed limit on the number of concurrently
> executing isolates atm.
>
> In the longer term we may find a way to do this automatically
> with low overhead, see [0]. But since those API functions are quite
> stable and we already expose e.g. `Dart_{Enter,Exit}Scope`, I don't
> see a reason not to expose `Dart_{Enter,Exit}Isolate`.
>
> [0] Issue #51261
>
> Issue #51254
>
> TEST=ffi{,_2}/dl_api_exit_enter_isolate_test
>
> Change-Id: I91c772ca962fddb87919663fea07939a498fa205
> Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/292722
> Commit-Queue: Martin Kustermann <[email protected]>
> Reviewed-by: Daco Harkes <[email protected]>
> Reviewed-by: Ryan Macnak <[email protected]>

Change-Id: I05ad5b9ce24754a68693160e470f8eb71a812c75
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/294860
Auto-Submit: Ryan Macnak <[email protected]>
Commit-Queue: Rubber Stamper <[email protected]>
Bot-Commit: Rubber Stamper <[email protected]>
copybara-service bot pushed a commit that referenced this issue Apr 13, 2023
For applications that want to have arbitrary number of isolates call
into native code that may be blocking, we expose the API functions that
allows those native threads to exit an isolate before running
long/blocking code.

Without the ability to exit/re-enter isolate, one may experience
deadlocks as we have a fixed limit on the number of concurrently
executing isolates atm.

In the longer term we may find a way to do this automatically
with low overhead, see [0]. But since those API functions are quite
stable and we already expose e.g. `Dart_{Enter,Exit}Scope`, I don't
see a reason not to expose `Dart_{Enter,Exit}Isolate`.

Difference to original CL:

  Do use STL synchronization primitives (as the ones in runtime/bin
  are not always available in shared libraries)


[0] Issue #51261

Issue #51254

TEST=ffi{,_2}/dl_api_exit_enter_isolate_test

Change-Id: Id817e8d4edb3db35f029248d62388cbd0682001d
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/294980
Reviewed-by: Daco Harkes <[email protected]>
@blakeAspentech
Copy link

@jonahwilliams this issue seems to have existed for a long time, yet I'm not seeing any workaround in this issue other than direct C code mentioned by @mkustermann . I haven't seen any documentation anywhere that 8+ isolates permanently freezes the whole app on iOS and Android (as per my reproduction). If there is no fix planned for this crash, still present on flutter 3.29.0 pre, then there should be documentation added to flutter isolate limitations.

Is there a plan of any sort to introduce a workaround, resolution, or documentation?

@mraleph
Copy link
Member

mraleph commented Feb 19, 2025

@blakeAspentech

The limitation is documented here: https://dart.dev/language/concurrency#synchronous-blocking-communication-between-isolates.

Looking at your reproduction I don't think 2a08770 is going to help you in any way though. You are spawning isolates which never yield and just do busy work forever. That requires #54687 to be addressed fully (rather than the current workaround we put in place).

I don't know how realistic this reproduction is though - 8 concurrent threads doing busy work in Dart on a mobile device is not something that is very common. Do you have a specific use case for that?

That being said - I think we can solve the UI locking issue without solving #54687, if we should probably keep 1 TLAB slot reserved for UI isolate at all times.

@blakeAspentech
Copy link

blakeAspentech commented Feb 19, 2025

@mraleph I unfortunately do have a use case.

Our app uses a localhost webview to load a mapbox-gl instance. That uses 1 isolate for hosting, 1 for data processing, and 1 for fetching and syncing data with a server. We then have 3 different syncing isolates handling constant data inflow/outflow from different sources. We have an additional one specifically for auth event handling which brings us to 8-- causing us to now see this issue immediately upon login (although not present on windows platform, for some reason).

Dart Isolate Docs specify that a user could have "have hundreds of isolates running concurrently and making progress". Is that statement false, or are we as an entire dev team using isolates wrong?

@Piero512
Copy link

Piero512 commented Feb 19, 2025

I am not a Flutter dev expert, but I don't see how are you keeping all 8 isolates busy without yielding once?

But definitively your situation is complicated. Perhaps you may benefit from building your engine with a bigger number of max mutators?

It would definitively be a way to tell if you're hitting the bug instead of other stuff

@blakeAspentech
Copy link

We don't compile our own dart sdk, we use prepackaged flutter dart. I'm not sure what you mean in terms of isolates yielding-- we have all of our isolates waiting on refresh timer triggers to sync roughly once every 30 seconds. Are you suggesting that instead of having all of them run independently, we manually run a bunch of isolate.pause() and isolate.resume()? otherwise, not sure what you mean about the isolates yielding

@Piero512
Copy link

Yielding happens when execution runs to an await statement

What do you use for waiting?

You should use something like stream.periodic (duration(seconds:30)) and await for it on a loop.

@blakeAspentech
Copy link

@Piero512

we use Timer.periodic, and definitely have awaits all over.

void startDataSync(){
    //Call sync method every 30 seconds
    if(_dataSyncTimer?.isActive ?? false) {
      return;
    }
    servicesConfig.syncInterface.loadSyncTimes();
    _dataSyncTimer = makePeriodicTimer(syncInterval, dataSyncCallBack, fireNow: true);
  }

  void dataSyncCallBack(final Timer timer) async{
    try {
      await servicesConfig.syncInterface.sync();
    }
    on Object catch (e, st) {
      _logger.severe("Unhandled exception in sync process!", e, st);
    }
    try {
      var syncMap = await servicesConfig.syncInterface.getSyncUrlMap();
      await servicesConfig.attachmentInterface.manageAttachments(syncMap);
    }
    on Object catch (e, st) {
      _logger.severe("Unhandled exception managing attachments", e, st);
    }
  }

  Timer? makePeriodicTimer(final Duration duration, final Function(Timer) callBack, {final bool fireNow = false}){
    return runZonedGuarded(() {
      Timer timer = Timer.periodic(syncInterval, callBack);
      if(fireNow){
        callBack(timer);
      }
      return timer;
    }, (final o, final st) => _logger.severe("Uncaught error in sync isolate", o, st));
  }

@mraleph
Copy link
Member

mraleph commented Feb 20, 2025

@blakeAspentech if your isolates are doing work in slices (due to asynchrony) rather than just spinning busily (like the repro you made) then nothing should lock up. The limitation is on the number of isolates running at the same time. Isolates drain their message queues and once they have no pending messages they will yield their slot and another isolate can take it. Do you have a crash log from iOS killing your app, to see what threads are doing at the moment when things lock up?

@blakeAspentech
Copy link

blakeAspentech commented Feb 20, 2025

@mraleph I have updated my reproduction demo to create 8 isolates at once, using similar logic to above. It does freeze when the isolates are working, despite awaits. Is there a way around this behavior? I think it might be what we are seeing on our main app.

While I'd love to provide you a crash log from our main app, which is still a hard no-recover freeze, I cannot do so due to security practices. I'm going to continue to try and make this repo app mimic the behavior of our main app as much as possible.

@Piero512
Copy link

Maybe you should file a different issue now, because your updated repro demo still exhibits completely synchronous behavior even though you put async everywhere you can.

Probably best to do it in the flutter repo so we can also discard platform channels blocking the UI isolate and stuff...

@mraleph
Copy link
Member

mraleph commented Feb 21, 2025

While I'd love to provide you a crash log from our main app, which is still a hard no-recover freeze I cannot do so due to security practices.

@blakeAspentech You can remove all the names and values which correspond to your app from the crash - replace them all with XXX - I don't really care about these. I only care about: symbols corresponding to OS internals, libc and Flutter.framework and whether code came from some native library or from Dart (App.framework). What I want to see for each thread is the following:

  • whether it has Dart code on the stack (e.g. any App.framework frames)
  • whether top frame is Dart or native frame, and what kind of native frame (FFI or Flutter engine internals or Dart VM internals)
  • whether thread is sleeping (e.g. on mutex) or actively running

That should be enough to actually to have a good picture and it will not reveal anything about your app.

Regarding the updated reproduction you have provided - it still exhibits the same problem, isolates perform large chunk of synchronous work (~5s) in Dart1. So if all of them go at it at the same time then they do exhaust mutator slots making UI isolate unable to run. Is this representative of what your mobile application does? e.g. do all auxiliarly isolates go at it "all gas no breaks" 5s at a time? Just to put this in perspective: iPhone 16 has 6 cores and of those only 2 are fast, top-of-the line Android phones usually come with 8 cores but again only 2 are actually fast (sometimes 2 of the remaining six are also okayish and sometimes all 6 remaining cores are slow). There is simply no space on mobile devices for 8 threads doing busy work for prolonged amount of time.

Here is the lay of the land here. We want to eventually lift the limitation on the number of concurrent mutators. #54687 is tracking that. However, it is not going to soon and I fundamentally don't even believe that it is something mobile apps should even attempt to do.

This means you have a choice:

  • I think you should look at multiplexing your isolates, e.g. put some of this work together into a single isolate.
  • You could ask Flutter folks to allow configuring --new_gen_semi_max_size (currently it is not on the list of Dart runtime flags which you could pass into the Dart VM via Flutter). Default value for 64-bit platforms is 16. Every 2Mb will add one more mutator slot (so with 18 Mb you will be able to have 9 isolates spinning in Dart at the same time, 20Mb - 10 isolates and so on).
  • You could build your own Flutter engine with an increased --new_gen_semi_max_size.

@Piero512 Please don't out of nowhere suggest to people to file different issues - especially when there are people from the Dart team are already discussing things on the issue. There is no reason to split discussion into additional issues which will just need to be closed as duplicates anyway. If we feel that discussion should go somewhere else - we will suggest that ourselves. Thanks for understanding.

Footnotes

  1. Work being done synchronously in Dart is the key here. If isolate goes into native code via FFI and the long running chunk of work actually happens there then 2a08770 should fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi
Projects
None yet
Development

No branches or pull requests

6 participants