[Proxying] Use a dedicated worker to pass messages between threads #18563

tlively · 2023-01-20T06:07:29Z

Rather than forwarding messages from one pthread to another through the main
thread, forward them through a dedicated worker, decreasing message latency in
cases where the main thread is busy. The new message relay worker is spawned via
a data URL so there are no new JS files for users to distribute. Whenever a
pthread is created, a new MessageChannel is created alongside it, with one
MessagePort sent to the relay and the other MessagePort sent to the pthread's
worker. The relay receives messages on these MessagePorts and forwards them on
to their recipients' MessagePorts.

It is possible for the relay to receive a message for a thread for which the
relay has no MessagePort. This can happen when pthread_create returns a
pthread_t before the main thread has finished asynchronously spinning up a new
worker to run the thread. The spawning thread may then immediately proxy work to
the new thread, causing a message to be sent to the relay before the main thread
has notified the relay of the new thread's existence and sent the relay the new
thread's MessagePort. When this happens the relay buffers the message and
forwards it along once it has received the recipient's MessagePort.

When a pthread exits or is cancelled, it closes its MessagePort so that no
further messages will be received on it, then the main thread notifies the relay
that the thread has been destroyed so the relay can release its resources for
that thread. Messages that make it to the relay after the exiting thread has
closed its MessagePort will be silently dropped or sent to the next thread
spawned with the same ID, but we have always considered it user error to proxy
to an exiting thread, so it's unclear whether doing anything better in that case
would be worth it.

tlively · 2023-01-20T06:07:42Z

Current dependencies on/for this PR:

main
- PR [Proxying] Use a dedicated worker to pass messages between threads #18563 👈

This comment was auto-generated by Graphite.

tlively · 2023-01-20T06:17:22Z

src/library_pthread.js

+#if ENVIRONMENT_MAY_BE_NODE
+              (ENVIRONMENT_IS_NODE ?
+               `
+(await import('node:worker_threads'))


Unfortunately I can't put this header in proxy_worker.js because Chrome complains about the existence of the import token, even if it isn't executed. Ideas for cleaner workarounds very welcome.

kripken · 2023-01-23T20:19:11Z

src/proxy_broker.js

+    let buffered = bufferedMessages.get(thread);
+    if (buffered === undefined) {
+      buffered = [];
+      bufferedMessages.set(thread, buffered);
+    }
+    buffered.push(msg);


JS has map.has which is shorter for stuff like this:

Suggested change

let buffered = bufferedMessages.get(thread);

if (buffered === undefined) {

buffered = [];

bufferedMessages.set(thread, buffered);

}

buffered.push(msg);

if (!bufferedMessages.has(thread)) {

bufferedMessages.set(thread, []);

}

bufferedMessages.get(thread).push(msg);

But it may be slightly less efficient OTOH, I'm not sure.

Yeah, I was trying to minimize the number of lookups to be 1 or 2 rather than 2 or 3. (I was annoyed that there's no way to do this in 1 lookup like I could in C++.) Do you think minimizing the code is more important than minimizing the lookups?

Hmm, in general I'd guess this is not very hot code (we don't pass that many messages using postMessage). So maybe size and clarity matter more?

kripken · 2023-01-23T20:24:03Z

src/worker.js

@@ -284,6 +287,7 @@ function handleMessage(e) {
      }
    } else if (e.data.cmd === 'cancel') { // Main thread is asking for a pthread_cancel() on this thread.
      if (Module['_pthread_self']()) {
+        closeProxyBrokerPort();


I think these need to go through Module, like the calls around them. worker.js is a separate file from the main JS.

Or, we can just add a proxy closing function here perhaps if it is short and trivial?

kripken · 2023-01-23T20:25:01Z

src/library_pthread.js

@@ -1094,8 +1135,10 @@ var LibraryPThread = {
  _emscripten_notify_task_queue: function(targetThreadId, currThreadId, mainThreadId, queue) {
    if (targetThreadId == currThreadId) {
      setTimeout(() => executeNotifiedProxyingQueue(queue));
+    } else if (targetThreadId == mainThreadId) {
+      postMessage({'cmd' : 'processProxyingQueue', 'queue' : queue});


Why is the main thread special here? (Is this just an optimization?)

Yeah, in principle the main thread could listen only on the message broker's channel for messages from all other threads, but since it already listens directly to every thread and the code isn't that complex, I figured it would be better to keep messages to the main thread as 1 hop instead of 2.

Perhaps worth a comment?

sbc100 · 2023-01-23T21:44:21Z

I wonder if "broker" is the right work here? How about "relay" instead?

tlively · 2023-01-26T00:09:54Z

I wonder if "broker" is the right work here? How about "relay" instead?

Sure, I'll change the name to "messageRelay"

tlively · 2023-01-26T06:33:30Z

I'm pretty sure the tests are failing because I never terminate the message relay worker. I tried to figure out where a good place to do that is, but I couldn't find one. It seems that we shouldn't terminate it if we should keep the runtime alive, but AFAICT we never explicitly decide to stop keeping the runtime alive as we terminate the process. @sbc100 or @kripken, do you have suggestions about when and where we should terminate the message relay?

kripken · 2023-01-26T18:44:53Z

What do you mean by "terminate?" If it's waiting in the event loop, I think it's fine to leave it. It will be cleaned up by the browser when the tab closes.

sbc100 · 2023-01-26T18:54:30Z

I'm pretty sure the tests are failing because I never terminate the message relay worker. I tried to figure out where a good place to do that is, but I couldn't find one. It seems that we shouldn't terminate it if we should keep the runtime alive, but AFAICT we never explicitly decide to stop keeping the runtime alive as we terminate the process. @sbc100 or @kripken, do you have suggestions about when and where we should terminate the message relay?

How about doing it as part of terminateAllThreads?

sbc100 · 2023-01-26T18:56:46Z

src/library_pthread.js

@@ -97,7 +98,6 @@ var LibraryPThread = {
      }
 #endif
    },
-


Why delete this empty line but include one before initMessageRelay?

Oops, will be more consistent here.

sbc100 · 2023-01-26T18:59:29Z

src/library_pthread.js

+        function handleMessage(msg) {
+          const thread = msg.data.targetThread;
+          const port = threadPorts.get(thread);
+          if (port !== undefined) {


Can this just be if (port)?

sbc100 · 2023-01-26T19:01:38Z

src/library_pthread.js

+              bufferedMessages.delete(thread);
+            }
+            return;
+          }


else if instead of return here? .. and then maybe assert in the final else?

We don't have assert available in this worker, but I will print an error at least.

sbc100 · 2023-01-26T19:03:22Z

src/library_pthread.js

+    }
+#else
+    MsgChannel = MessageChannel;
+#endif


Can we avoid declaring a new name here and just polyfil on node using var MessageChannel = require ...?

(or global.MessageChannel = require ...)

tlively · 2023-01-26T19:15:40Z

The browser already works fine, I think. Specifically I'm seeing that node doesn't exit, but I traced everything that happened at process exit and it never actually calls terminateAllThreads, so killing the worker in terminateAllThreads wouldn't solve the problem.

sbc100 · 2023-01-26T19:24:10Z

The browser already works fine, I think. Specifically I'm seeing that node doesn't exit, but I traced everything that happened at process exit and it never actually calls terminateAllThreads, so killing the worker in terminateAllThreads wouldn't solve the problem.

Are you building with EXIT_RUNTIME? I know @RReverser recently didn't some work to make threads exit correctly, even when EXIT_RUNTIME is not used.. check out #18305

tlively · 2023-01-26T20:20:20Z

I'm not building with EXIT_RUNTIME, but given that Node exited without that option before this PR, it would be a serious regression to start requiring EXIT_RUNTIME to get Node to exit.

Rather than forwarding messages from one pthread to another through the main thread, forward them through a dedicated worker, decreasing message latency in cases where the main thread is busy. The new message broker worker is spawned via a data URL so there are no new JS files for users to distribute. Whenever a pthread is created, a new MessageChannel is created alongside it, with one MessagePort sent to the broker and the other MessagePort sent to the pthread's worker. The broker receives messages on these MessagePorts and forwards them on to their recipients' MessagePorts. It is possible for the broker to receive a message for a thread for which the broker has no MessagePort. This can happen when pthread_create returns a pthread_t before the main thread has finished asynchronously spinning up a new worker to run the thread. The spawning thread may then immediately proxy work to the new thread, causing a message to be sent to the broker before the main thread has notified the broker of the new thread's existence and sent the broker the new thread's MessagePort. When this happens the broker buffers the message and forwards it along once it has received the recipient's MessagePort. When a pthread exits or is cancelled, it closes its MessagePort so that no further messages will be received on it, then the main thread notifies the broker that the thread has been destroyed so the broker can release its resources for that thread. Messages that make it to the broker after the exiting thread has closed its MessagePort will be silently dropped or sent to the next thread spawned with the same ID, but we have always considered it user error to proxy to an exiting thread, so it's unclear whether doing anything better in that case would be worth it.

sbc100

Nice! I'm surprised how simple this turned out to be. The only gross part the data URL stuff..

sbc100 · 2023-01-28T22:15:29Z

src/library_pthread.js

@@ -104,6 +104,8 @@ var LibraryPThread = {
      // things.
      PThread['receiveObjectTransfer'] = PThread.receiveObjectTransfer;
      PThread['threadInitTLS'] = PThread.threadInitTLS;
+      PThread['receiveMessageRelayPort'] = PThread.receiveMessageRelayPort;
+      PThread['closeMessageRelayPort'] = PThread.closeMessageRelayPort;


We should probably figure out a way to do this that isn't do hacky (one day).

How about just calling these messageRelayClose and messageReleyReceive? (Probably just my person preference for shorter names so feel free to ignore)

Agree that shorter names would be nicer, but I think it's useful to clarify that we're receiving the port and not an actual message from the relay.

sbc100 · 2023-01-28T22:16:51Z

src/library_pthread.js

@@ -586,6 +588,11 @@ Object.assign(global, {
      return PThread.unusedWorkers.pop();
    },

+    receiveMessageRelayPort: function(port) {
+      assert(ENVIRONMENT_IS_PTHREAD);


Wrap asserts in #if ASSERTIONS

sbc100 · 2023-01-28T22:18:57Z

src/library_pthread.js

+      if (ENVIRONMENT_IS_NODE) {
+        // TODO: Node 16+ has btoa, so remove this when we drop support for
+        // older Nodes.
+        global.btoa = (s) => { return Buffer.from(s).toString('base64'); };


just (s) => Buffer.from(s).toString('base64'); ?

How about wrapping this in if (!global.btoa) ? We might even consider using our src/polyfill/* mechanism?

Actually, I recently learned that I can use encodeURIComponent here and avoid all this btoa nonsense entirely.

sbc100 · 2023-01-28T22:21:19Z

src/library_pthread.js

+      global.MessageChannel = require('worker_threads').MessageChannel;
+    }
+#endif
+    var channel = new MessageChannel();


Maybe call this relayChannel?

sbc100 · 2023-01-28T22:23:11Z

src/library_pthread.js

-    worker.postMessage(msg, threadParams.transferList);
+    worker.postMessage(msg, threadParams.transferList.concat([channel.port1]));
+    PThread.messageRelay.postMessage({
+      'cmd': 'create',


Perhaps add a comment here.. Such as "Send one end of the relay channel to the newly created thread, and the other end of the messageRelay worker"?

sbc100 · 2023-01-28T22:23:55Z

src/library_pthread.js

@@ -1094,8 +1135,10 @@ var LibraryPThread = {
  _emscripten_notify_task_queue: function(targetThreadId, currThreadId, mainThreadId, queue) {
    if (targetThreadId == currThreadId) {
      setTimeout(() => executeNotifiedProxyingQueue(queue));
+    } else if (targetThreadId == mainThreadId) {
+      postMessage({'cmd' : 'processProxyingQueue', 'queue' : queue});


Perhaps worth a comment?

sbc100 · 2023-01-28T22:25:32Z

src/shell.js

+   * @param {string|URL} url
+   */
+  let NodeWorker = nodeWorkerThreads.Worker;
+  // Node requires data and file protocol urls to be URLs.


How about "Create a polyfill for the Worker web API based on nodes worker_threads which is slightly different"

sbc100 · 2023-01-28T22:26:09Z

src/shell.js

@@ -270,7 +270,22 @@ if (ENVIRONMENT_IS_NODE) {
    console.error('The "worker_threads" module is not supported in this node.js build - perhaps a newer version is needed?');
    throw e;


Unrelated, by we should probably skip this whole try/catch in release builds.

sbc100 · 2023-01-28T22:27:05Z

src/worker.js

@@ -217,6 +217,8 @@ function handleMessage(e) {
 #endif
 #endif // MODULARIZE && EXPORT_ES6
    } else if (e.data.cmd === 'run') {
+      Module['PThread'].receiveMessageRelayPort(e.data.port);
+      e.data.port.onmessage = handleMessage;


So we use the same handler for messages regardless of where they come from? interesting...

sbc100 · 2023-01-28T22:28:17Z

BTW, I think on the web at least we can give these workers useful names. We should certainly name this one something obvious (at least in debug builds).

sbc100 · 2023-01-28T22:28:27Z

(the names show up in dev tools IIUC).

sbc100 · 2023-01-28T22:29:21Z

Do you know that the status of workers-creating-workers is... because this thread could also be the thread manager if that works.

tlively · 2023-01-28T22:38:44Z

Do you know that the status of workers-creating-workers is... because this thread could also be the thread manager if that works.

IIRC, the latest safari allows this but previous versions did not. It would be great if we could make this worker responsible for managing thread lifetimes as well because then we could get a single, consistent picture of what threads are or are not live.

kleisauke · 2023-01-29T13:31:49Z

src/shell.js

+   */
+  let NodeWorker = nodeWorkerThreads.Worker;
+  // Node requires data and file protocol urls to be URLs.
+  class Worker extends NodeWorker {


ES6 classes are not yet used within the library sources, so I think this also requires an update to settings.TRANSPILE_TO_ES5.

Details

--- a/emcc.py +++ b/emcc.py @@ -2176,13 +2176,14 @@ def phase_linker_setup(options, state, newargs): # Emscripten requires certain ES6 constructs by default in library code # - https://caniuse.com/let : EDGE:12 FF:44 CHROME:49 SAFARI:11 # - https://caniuse.com/const : EDGE:12 FF:36 CHROME:49 SAFARI:11 + # - https://caniuse.com/class : EDGE:13 FF:45 CHROME:49 SAFARI:9 # - https://caniuse.com/arrow-functions: : EDGE:12 FF:22 CHROME:45 SAFARI:10 # - https://caniuse.com/mdn-javascript_builtins_object_assign: # EDGE:12 FF:34 CHROME:45 SAFARI:9 # Taking the highest requirements gives is our minimum: - # Max Version: EDGE:12 FF:44 CHROME:49 SAFARI:11 - settings.TRANSPILE_TO_ES5 = (settings.MIN_EDGE_VERSION < 12 or - settings.MIN_FIREFOX_VERSION < 44 or + # Max Version: EDGE:13 FF:45 CHROME:49 SAFARI:11 + settings.TRANSPILE_TO_ES5 = (settings.MIN_EDGE_VERSION < 13 or + settings.MIN_FIREFOX_VERSION < 45 or settings.MIN_CHROME_VERSION < 49 or settings.MIN_SAFARI_VERSION < 110000 or settings.MIN_IE_VERSION != 0x7FFFFFFF)

Browser versions old enough to not support classes also don't support threads at all, so I don't think this should ever need to be polyfilled. Furthermore, the polyfilling doesn't actually work because it causes node to error out with TypeError: Class constructor Worker cannot be invoked without 'new' 🤯

tlively · 2023-01-29T16:16:11Z

@sbc100 The deadlock in lsan.test_stdio_locking is reproducible on main and appears to be part of the cross-thread code catchup stuff you've been doing. I managed to look at a stack trace in dev tools and it was stuck doing code catchup while trying to allocate memory for the lsan instrumentation.

tlively · 2023-01-29T16:23:42Z

It also looks like the Worker polyfill isn't being pulled in for other.test_shared_memory_minimal_runtime. I don't know much about MINIMAL_RUNTIME. What is the best way to solve this?

kripken

lgtm % @sbc100 's comments

kripken · 2023-01-30T18:09:59Z

It also looks like the Worker polyfill isn't being pulled in for other.test_shared_memory_minimal_runtime. I don't know much about MINIMAL_RUNTIME. What is the best way to solve this?

I think we need to pull in the polyfill there. We try hard to avoid code size increases in that mode especially but this does fix deadlocks so we have no choice, as I see it.

tlively · 2023-01-31T00:15:30Z

I ended up copy-pasting the polyfill to shell_minimal.js. Adding a separate file to hold the shared code and #including it in both places seemed like overkill, but I can do something like that if you prefer.

tlively · 2023-02-01T00:14:37Z

The same 3 firefox tests keep failing with an unresponsive http server. @sbc100, is that safe to ignore? Should I merge this manually?

tlively · 2023-02-01T02:23:07Z

These three tests are failing reliably enough that I guess I'll install FF tomorrow and investigate further :/

tlively · 2023-02-01T18:52:20Z

Ok, wow, there is actually a difference in behavior between FF and Chrome here. This entire approach does not work in FF because the postMessage to the message relay via the MessagePort does not happen unless the sending worker returns to the event loop, which does not happen in the case of synchronous proxying.

sbc100 · 2023-02-01T18:57:07Z

Ok, wow, there is actually a difference in behavior between FF and Chrome here. This entire approach does not work in FF because the postMessage to the message relay via the MessagePort does not happen unless the sending worker returns to the event loop, which does not happen in the case of synchronous proxying.

Is it that case the postMessage normally does work without yielding? But postMessage via a MessagePort does not work? i.e. are normal postMessages handler differently to message port ones?

tlively · 2023-02-01T19:38:16Z

Yes, exactly.

tlively · 2023-02-01T19:50:12Z

I left a comment on the relevant firefox bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1752287

Closing this PR since it isn't portable across browsers. Maybe we can revisit this in the future along with #18633

tlively requested review from kripken and sbc100 January 20, 2023 06:07

tlively commented Jan 20, 2023

View reviewed changes

kripken reviewed Jan 23, 2023

View reviewed changes

sbc100 reviewed Jan 26, 2023

View reviewed changes

tlively added 9 commits January 26, 2023 14:28

support older nodes

d039ca8

proxyBroker => messageRelay, nicer code string

fe7f584

more node/browser compat

7bdf0c5

access closeMessageRelayPort correctly in worker.js

86ef558

debug

645b48e

Catch messages that slip through to main

26091a3

concat

10c43d7

fix node hang and address feedback

f29f084

tlively force-pushed the proxy-broker branch from 7d2b094 to f29f084 Compare January 26, 2023 22:48

tlively added 3 commits January 26, 2023 21:31

remove debugging prints

f7d0857

polyfill to make closure happy

d830511

let worker access port even with closure

b9ad6fe

sbc100 approved these changes Jan 28, 2023

View reviewed changes

kleisauke reviewed Jan 29, 2023

View reviewed changes

Merge branch 'main' into proxy-broker

a7dfb86

rebaseline test

26db8fd

kripken approved these changes Jan 30, 2023

View reviewed changes

tlively added 4 commits January 30, 2023 14:51

duplicate Worker polyfill to shell_minimal.js

0d169c0

comments

12950af

Address more feedback

8b35a92

wrap assertions and encodeURIComponent

788727c

only use minimal runtime polyfill on node

610d9f8

tlively enabled auto-merge (squash) January 31, 2023 20:53

tlively disabled auto-merge February 1, 2023 19:50

tlively closed this Feb 1, 2023

tlively deleted the proxy-broker branch February 8, 2024 19:07

@@ @@ -97,7 +98,6 @@ var LibraryPThread = { @@
                     }
               #endif
                   },

		@@ -270,7 +270,22 @@ if (ENVIRONMENT_IS_NODE) {
		console.error('The "worker_threads" module is not supported in this node.js build - perhaps a newer version is needed?');
		throw e;

[Proxying] Use a dedicated worker to pass messages between threads #18563

[Proxying] Use a dedicated worker to pass messages between threads #18563

Uh oh!

Conversation

tlively commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlively commented Jan 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlively Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbc100 commented Jan 23, 2023

Uh oh!

tlively commented Jan 26, 2023

Uh oh!

tlively commented Jan 26, 2023

Uh oh!

kripken commented Jan 26, 2023

Uh oh!

sbc100 commented Jan 26, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlively commented Jan 26, 2023

Uh oh!

sbc100 commented Jan 26, 2023

Uh oh!

tlively commented Jan 26, 2023

Uh oh!

sbc100 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlively commented Jan 20, 2023 •

edited

Loading

tlively Jan 23, 2023 •

edited

Loading