-
Notifications
You must be signed in to change notification settings - Fork 13.3k
DynamicLoaderDarwin load images in parallel #110439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DynamicLoaderDarwin load images in parallel #110439
Conversation
@augusto2112 Take a look when you have time This is just one possible approach to parallelizing the initial image loading. |
I've prototyped the approach with preload and published the implementation here for comparison. The overall performance of
Where |
This is an interesting idea, thanks for looking into it. When I spoke with Augusto about it, my main concern was that all of the Modules would be trying to add strings to the constant string pool and lock contention could become a bottleneck. I built github main unmodified, and with your parallel and parallel+preload patches and tried a quick benchmark. I built with My Slack has 936 binaries loaded, typical for a UI app these days. 10 of the binaries are outside the shared cache (app frameworks/dylibs). I did my tests on an M2 Max mac studio with 12 cores and 96GB of memory. Built Built One important thing I'm doing in this case is not creating a Target with a binary. In I'm curious what your machine/target process looks like, that we're seeing such different numbers. I'm guessing you were testing an unoptimized build given the time amounts. Does it look like I missed something with my test case? |
nb I used |
Yeah it sure would be a bottleneck. I didn't measure it precisely but I think I saw something about 30-50% of the time is spent on the mutexes in the string pools. And this is one of the reasons I gated the parallelized implementation behind the flag. One possible approach here would be to split the work done in the
Oh wow. 4.5 sec is amazingly fast. I'll try to reproduce your results.
The machine is MBP M1 10 cores/32 GB. I'm testing this patch on the swift fork built in the
|
Ah, yes Simulator debugging is a bit different than macOS process debugging. With macOS process debugging -- with lldb and the inferior process running on the same computer -- lldb is using the same shared cache as the inferior process, so to read the libraries it reads them out of its own memory. (they don't exist as separate binaries on-disk any more anyway - the other alternative would be read them out of inferior memory via debugserver) A simulator process does will have many libraries that don't exist in the shared cache in lldb's memory, so it will need to read them from a different source -- an on-disk discrete binary, or reading memory via debugserver -- so it makes sense that there could be a performance difference in this case. |
That number is a little bit of a cheat because I was using llvm-project main which doesn't demangle Swift method names. I also tried a run with the Xcode 16 lldb (which does demangle swift names, like swiftlang main which you're testing on) and it took around 7.3 seconds. There could be other differences, but there's a good bit of swift code in all the system libraries, that's probably what the difference was. |
Love to see this kind of work done. |
@jasonmolenda If you still have a build with the patch can you please compare the unmodified version with the patched one with |
@jasonmolenda I tried to reproduce your results but got drastically different numbers for parallel runs. Here's the setup:
Did you discard the first runs as well? |
Hmmm, interesting. I built my optimized builds with
I had the same The actual timings may be different because I'm not running a released version of macOS on this computer, but the relative difference between my different variations should be the same order between different computers I'd expect. You asked earlier what the times were like for one of the patched sources with the setting enabled and disabled, I was going to note that. Here's exactly what I saw with the RelWithDebInfo build of main plus the diff for this PR ("parallel"):
and
|
Ah wait, I see where the difference is coming in -- it's my own fault. I misread tcsh's time output and only looked at USER seconds. Looking at the wallclock times, we're 50% faster with the multithreading enabled for this PR's change. My apologies! |
For completeness sakes, here's what I see for parallel+preload PR 110646 with llvm-project main (no swift demangling) attaching to Slack (nearly all libraries in the shared cache)
(showed the results for 4 runs)
and github main unmodified
|
So yes, I'm seeing a 30% improvement with the multithreaded creation of Modules, for a macOS app mostly in the shared cache, and no swift name demangling (which is expensive). |
Nice. Also there's no significant difference between |
Anyway, my main goal is iOS apps running in the simulator. And for them, the speed-up is much more noticeable (at least for big apps). Let me know if you'd like me to measure something else. |
I know your benchmarking numbers are against the The other thing we may want to consider (you don't need to take this on in your current work) is the module creation in |
It is a bit interesting that the user and system time are so much higher -- looks like around 40% higher? -- than the single-threaded approach, I wonder if that's thread creation/teardown overhead, or if it's costlier to acquire locks and it's adding up. Wallclock is the right thing to optimize for, but it was a little surprising to see that. I'm not curious enough to look into it more, or ask anyone else to. Was the setting intended for testing purposes only, or did you intend to include that in a final PR? Sometimes when we land a new feature that may have unintended consequences, we'll have a flag enabling the feature by default as a way for people impacted to have a way to quickly disable it. After the feature/change has had some more widespread use, we drop the setting. |
I built the swiftlang
(don't read too much into this set of options, it's just an old command I have that I copy & paste) The clean sources
and the two different patchsets,
We're seeing better parallelism (~170% cpu usage with llvm-project main, ~235% cpu usage with swiftlang rebranch), likely because swiftlang rebranch can demangle swift mangled names, and that's quite expensive, and parallelizable. One small thing that I'm not thrilled about is that we lose the progress updates currently; I don't know if the progress updates system was really intended to handle this. Instead of seeing a notification for each binary, we see a notification for the first binary that starts, and the other notifications while it is processing are lost. On my system, the llvm threadpool is running 9 worker threads. For a local file load notification, they go so quickly it isn't much of a loss, but if module creation is slower -- if we're reading the binaries out of memory from an iOS device without an expanded shared cache, or we're able to find & load DWARF for all the binaries, the loss of the updates is not great. I don't know what the best approach is here, or if we just accept that difference. I haven't started looking at the code changes themselves yet - so far I was just trying to play around with them a bit and see how the behavior works. |
The latter. IMO the risks involved by parallelization are a bit too high to do it without a flag. I'm even thinking about having it opt-in rather than opt-out for some time. |
I'm fine with having a temporary setting to disable it, which we can remove after it has been in a release or two and we've had time for people to live on it in many different environments. But it should definitely be enabled by default when we're at the point to merge it, and the setting should only be a safety mechanism if this turns out to cause a problem for a configuration we weren't able to test. We're not at that point yet, but just outlining my thinking on this. I would even put it explicitly under an experimental node (e.g. see I was playing with the performance in a couple of different scenarios. For some reason that I haven't looked into, we're getting less parallelism when many of the binaries are in the shared cache in lldb. Maybe there is locking around the code which finds the binary in lldb's own shared cache, so when 9 threads try to do it at the same time, we have additional lock contention. That's why the simulator speedup is better than a macos-native process speedup, and the speedup for a remote iOS debug process with an expanded shared cache on the mac (so all the libraries are in separate mach-o files) was faster still. |
For what it's worth, this thread pool for parallel processing has been used in another part of lldb - it's used on ELF systems when processing DWARF, when we need to scan the debug info in the individual .o files iirc. So we've had some living-on time with the thread pool approach there, not used on Darwin, but used on other targets. I was chatting with Jim Ingham and he was a little bummed that we're looking at doing this in a single DynamicLoader plugin, instead of having the DynamicLoader plugin create a list of ModuleSpec's and having a central method in ModuleList or something, create Modules for each of them via a thread pool, and then the DynamicLoader plugin would set the section load addresses in the Target, run any scripting resources (python in .dSYMs), call ModulesDidLoad etc. I don't think you should have to do the more generalized approach in this PR, but a scheme where other targets like Linux can benefit from the same approach would be interesting, without duplicating the thread pool code in their plugins. More like something for future work. I still haven't looked at the specific code changes yet :) I've been trying to exercise this approach in a few different environments and it seems like a benefit in nearly every case, to varying degrees. (the only case where it didn't benefit was when lldb had no binaries locally, and had to read them all over the gdb remote serial protocol. In that case all threads were blocked on the communication over the USB cable, reading all the libraries. The multithreaded approach didn't make this slower, and didn't seem to cause a problem, those were the main things I was looking for.) |
That's fine with me either.
Sure, but can you clarify please should it be named |
I'm looking at the implementation of
Expressing the same chain of actions in a generalized way might be much more complex than implementing parallelization in each of the dynamic loaders separately. |
The One is that the command interpreter treats: a.b.experimental.c and a.b.c as aliases for one another. That's to support an experimental setting going from experimental to non-experimental without producing errors if people have used the experimental in scripts. The other is that if the user issues the command: settings set a.b.experimental.c whatever and neither a.b.experimental nor a.b.c exist, that is not reported as an error. That's to support commands that we use to guard experimental features, like in this case, so we can remove them later without causing errors in scripts. So if you did: plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load That would mean you would never get errors for misspelling anything under plugin, which doesn't seem like a great idea. So best practice is to have the experimental node control only the settings that actually are experimental: plugin.dynamic-loader.darwin.experimental.enable-parallel-image-load |
d6183d6
to
133bcda
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
133bcda
to
40b77ca
Compare
I renamed the setting to |
There is a lock_guard for |
Thanks for all the work you've done on this, and updating the setting. I looked over the implementations, and they all look like reasonable changes to me - I did laugh a little when I realized that 2/3rds of all the changes were related to adding the setting :) that's always a bit of boilerplate for the first setting in a plugin. In a process launch scenario, or attaching to a process when it was launched in a stopped state, there will only be two binaries, dyld and the main app binary, and I don't think there would be any much perf benefit to the "preload" approach where we can parallelize special binaries (dyld, main executable) -- dyld is just a little guy. But when we attach to a launched app, where we have the two special binaries and a thousand others, if we load those two special binaries sequentially, all the other binaries are blocked until a possibly-expensive main binary has been parsed, and I think that's where the real perf difference you were measuring kicked in (you showed a 10.5 second versus 13.4 second time difference for preload versus parallel in the beginning), do I have that right? Or were you attaching to a stopped process with just dyld+main binary, I can't imagine doing those two in parallel would lead to the savings you measured. I think the preload approach where we can parallelize the two special binaries along with all the others on a "fully launched process" attach scenario is the right choice, thanks for investigating both of them and presenting both options. Do you prefer the non-preload approach? I know preload makes for a bigger patch, but I can see how the perf benefit of preload could be significant with attaching to a fully launched process so we're not blocked on parsing the main binary before we begin all the others. |
That's right.
No, I think we should stick to a more general approach unless we have to perform the loading itself differently for the special modules. And currently, we don't have any difference between loading them and all the other - the calls to |
Sounds good, I agree. I marked #110646 as approved, please merge at your convenience. Thanks again for spotting this opportunity for multithreading and seeing it through! |
I'm closing this in favor #110646 |
When
plugin.dynamic-loader.darwin.enable-parallel-image-load
is enabledDynamicLoaderDarwin::AddModulesUsingImageInfos
will load images in parallel using the thread pool.This gives a performance boost up to 3-4 times for large projects (I'm measuring by the time of
AddModulesUsingImageInfos
).