Skip to content

DynamicLoaderDarwin load images in parallel #110439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DmT021
Copy link
Contributor

@DmT021 DmT021 commented Sep 29, 2024

When plugin.dynamic-loader.darwin.enable-parallel-image-load is enabled DynamicLoaderDarwin::AddModulesUsingImageInfos will load images in parallel using the thread pool.

This gives a performance boost up to 3-4 times for large projects (I'm measuring by the time of AddModulesUsingImageInfos).

@DmT021
Copy link
Contributor Author

DmT021 commented Sep 29, 2024

@augusto2112 Take a look when you have time

This is just one possible approach to parallelizing the initial image loading.
The other solution I'm checking now is to parallelize loading somewhere earlier, perhaps in DynamicLoaderMacOS::DoInitialImageFetch. The difference is that the UpdateSpecialBinariesFromNewImageInfos function isn't parallelized right now, so we still load dyld and the main executable image sequentially. If we parallelize the loading of all image_infos before calling UpdateSpecialBinariesFromNewImageInfos and AddModulesUsingImageInfos we might gain even better utilization of multithreading.

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 1, 2024

If we parallelize the loading of all image_infos before calling UpdateSpecialBinariesFromNewImageInfos and AddModulesUsingImageInfos we might gain even better utilization of multithreading.

I've prototyped the approach with preload and published the implementation here for comparison.

The overall performance of DynamicLoaderMacOS::DoInitialImageFetch() is better with preload, but the difference is not so dramatic and I'm not sure if we are ok with the additional complexity of that implementation.
The results are the following:

Variant Time (sec)
no preload, sequential 29.2
no preload, parallel 13.4
preload, sequential 28.9
preload, parallel 10.5

Where
"parallel" means plugin.dynamic-loader.darwin.enable-parallel-image-load=1
"preload" means this implementation

@jasonmolenda jasonmolenda self-requested a review October 6, 2024 03:12
@jasonmolenda
Copy link
Collaborator

This is an interesting idea, thanks for looking into it. When I spoke with Augusto about it, my main concern was that all of the Modules would be trying to add strings to the constant string pool and lock contention could become a bottleneck.

I built github main unmodified, and with your parallel and parallel+preload patches and tried a quick benchmark. I built with -DCMAKE_BUILD_TYPE=Debug and with -DCMAKE_BUILD_TYPE=RelWithDebInfo because unoptimized lldb can have very different perf behaviors. I picked a random UI app -- Slack, a third party app using Electron, I have my system configured so I can attach to it -- and I used the command sequence like time build.parallel-with-preload/bin/lldb -x -b -O 'settings set plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det' a few times for every configuration to get a rough idea of what perf looked like.

My Slack has 936 binaries loaded, typical for a UI app these days. 10 of the binaries are outside the shared cache (app frameworks/dylibs). I did my tests on an M2 Max mac studio with 12 cores and 96GB of memory.

Built RelWithDebInfo, unmodified lldb took 4.5 seconds. Parallel took 6.6 seconds. Parallel with preload took 6.7 seconds.

Built Debug, unmodified lldb took 27.6 seconds. Parallel took 35.5 seconds. Parallel plus preload took 35.6 seconds.

One important thing I'm doing in this case is not creating a Target with a binary. In Target::SetExecutableModule it calls ObjectFileMachO::GetDependentModules to get the list of all linked libraries, and loads the binaries at those filepaths into the target. Then at process launch/attach time, if we'd already parsed the symbol table for a binary, it would be reused.

I'm curious what your machine/target process looks like, that we're seeing such different numbers. I'm guessing you were testing an unoptimized build given the time amounts. Does it look like I missed something with my test case?

@jasonmolenda
Copy link
Collaborator

nb I used plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load as the settings name so I could use the same command for all three types of lldb's - the unmodified lldb would give an error on an unrecognized setting normally and I didn't want to see that. Adding .experimental to a path makes an unrecognized setting a no-op.

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 7, 2024

my main concern was that all of the Modules would be trying to add strings to the constant string pool and lock contention could become a bottleneck.

Yeah it sure would be a bottleneck. I didn't measure it precisely but I think I saw something about 30-50% of the time is spent on the mutexes in the string pools. And this is one of the reasons I gated the parallelized implementation behind the flag. One possible approach here would be to split the work done in the ObjectFileMachO::ParseSymtab into 3 phases: parsing, ConstString creation, and assigning the string to the symbols. So we can create all of the strings as one batch and minimize the locking penalty. ParseSymtab is a funny piece of code though and deserves its own pull request to address such things.

Built RelWithDebInfo, unmodified lldb took 4.5 seconds. Parallel took 6.6 seconds. Parallel with preload took 6.7 seconds.

Built Debug, unmodified lldb took 27.6 seconds. Parallel took 35.5 seconds. Parallel plus preload took 35.6 seconds.

Oh wow. 4.5 sec is amazingly fast. I'll try to reproduce your results.

I'm curious what your machine/target process looks like, that we're seeing such different numbers. I'm guessing you were testing an unoptimized build given the time amounts. Does it look like I missed something with my test case?

The machine is MBP M1 10 cores/32 GB. I'm testing this patch on the swift fork built in the Release mode and I'm attaching it to an iOS app(in the simulator) that has 978 modules. I'm running lldb --local-lldbinit in Instruments app with the following lldbinit file:

settings set plugin.dynamic-loader.darwin.enable-parallel-image-load 1
process attach --name browser_corp_ios
continue

@jasonmolenda
Copy link
Collaborator

Ah, yes Simulator debugging is a bit different than macOS process debugging. With macOS process debugging -- with lldb and the inferior process running on the same computer -- lldb is using the same shared cache as the inferior process, so to read the libraries it reads them out of its own memory. (they don't exist as separate binaries on-disk any more anyway - the other alternative would be read them out of inferior memory via debugserver)

A simulator process does will have many libraries that don't exist in the shared cache in lldb's memory, so it will need to read them from a different source -- an on-disk discrete binary, or reading memory via debugserver -- so it makes sense that there could be a performance difference in this case.

@jasonmolenda
Copy link
Collaborator

Built RelWithDebInfo, unmodified lldb took 4.5 seconds. Parallel took 6.6 seconds. Parallel with preload took 6.7 seconds.
Built Debug, unmodified lldb took 27.6 seconds. Parallel took 35.5 seconds. Parallel plus preload took 35.6 seconds.

Oh wow. 4.5 sec is amazingly fast. I'll try to reproduce your results.

That number is a little bit of a cheat because I was using llvm-project main which doesn't demangle Swift method names. I also tried a run with the Xcode 16 lldb (which does demangle swift names, like swiftlang main which you're testing on) and it took around 7.3 seconds. There could be other differences, but there's a good bit of swift code in all the system libraries, that's probably what the difference was.

@bulbazord
Copy link
Member

Love to see this kind of work done.
FWIW I worked on improving the performance of ObjectFileMachO::ParseSymtab in #106791.
That change to reverted and I haven't had the time to figure out what happened yet though...

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 7, 2024

@jasonmolenda If you still have a build with the patch can you please compare the unmodified version with the patched one with enable-parallel-image-load disabled? There shouldn't be a noticeable difference. If you don't it's ok, I haven't had time to inspect it yet but I'll try to dig into it today or tomorrow

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 7, 2024

@jasonmolenda I tried to reproduce your results but got drastically different numbers for parallel runs. Here's the setup:

  • Using this (the main) repository of llvm-project.
  • cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lldb;lld" -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" -DLLDB_BUILD_FRAMEWORK=1 -DLLDB_ENABLE_PYTHON=1 -DLLDB_ENABLE_SWIG=1
  • csrutil enable --without debug
  • Attaching to Slack
  • Using the same cmd line as you did except for the setting name
    time ~/w/swift-project/build/lldb-main/bin/lldb -x -b -o 'pro att -n Slack' -o 'det'
    time ~/w/swift-project/build/lldb-main/bin/lldb -x -b -O 'settings set plugin.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det'
  • Important I always discard the first run in a sequence because the first launch of lldb after a rebuild takes about 3-4 seconds longer than the subsequent ones. For example, the first run of #110439 enable-parallel-image-load=0 took 8.772 sec instead of ~5.4.
unmodified #110439 enable-parallel-image-load=0 #110439 enable-parallel-image-load=1 #110646 enable-parallel-image-load=0 #110646 enable-parallel-image-load=1
run 1 5.52 5.509 2.517 5.464 2.496
run 2 5.475 5.451 2.492 5.372 2.506
run 3 5.373 5.44 2.513 5.506 2.533
run 4 5.55 5.333 2.54 5.493 2.552
run 5 5.534 5.535 2.511 5.427 2.552
avg 5.4904 5.4536 2.5146 5.4524 2.5278

Did you discard the first runs as well?

@jasonmolenda
Copy link
Collaborator

jasonmolenda commented Oct 8, 2024

Hmmm, interesting. I built my optimized builds with

cmake ../llvm -G Ninja -DLLDB_ENABLE_SWIFT_SUPPORT=0 -DPython3_EXECUTABLE=/usr/bin/python3 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_ENABLE_PROJECTS='llvm;lldb;clang' '-DLLVM_ENABLE_RUNTIMES=libcxx;libcxxabi;libunwind' -DLLDB_ENABLE_PYTHON=1

I had the same main sources built with RelWithDebInfo, with Debug. Then I applied this PR (110439)'s current diff and built RelWithDebInfo and Debug ("parallel") and then cleared my checkout, then applied the parallel+preload current diff (110646) and built RelWithDebugInfo and Debug ("parallel-with-preload"). In short, I have six builds of github main right now, I still have all of these build products.

The actual timings may be different because I'm not running a released version of macOS on this computer, but the relative difference between my different variations should be the same order between different computers I'd expect.

You asked earlier what the times were like for one of the patched sources with the setting enabled and disabled, I was going to note that. Here's exactly what I saw with the RelWithDebInfo build of main plus the diff for this PR ("parallel"):

% time build.release.parallel/bin/lldb -x -b -o 'pro att -n Slack' -o 'det'
4.649u 2.614s 0:10.62 68.2%	0+0k 0+0io 3277pf+0w

% time build.release.parallel/bin/lldb -x -b -o 'pro att -n Slack' -o 'det'
4.581u 2.417s 0:10.57 66.1%	0+0k 0+0io 147pf+0w

% time build.release.parallel/bin/lldb -x -b -o 'pro att -n Slack' -o 'det'
4.534u 2.196s 0:09.92 67.7%	0+0k 0+0io 104pf+0w

and

% time build.release.parallel/bin/lldb -x -b -O 'settings set plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det'
6.621u 3.694s 0:06.00 171.8%	0+0k 0+0io 104pf+0w

% time build.release.parallel/bin/lldb -x -b -O 'settings set plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det'
6.562u 3.758s 0:06.15 167.6%	0+0k 0+0io 104pf+0w

% time build.release.parallel/bin/lldb -x -b -O 'settings set plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det'
6.562u 3.739s 0:06.19 166.2%	0+0k 0+0io 104pf+0w

time here is the tcsh time which is unlike all others lol, it is showing user seconds, system seconds, elapsed (wallclock) seconds and cpu usage. I'm not sure why it's consistently slower for me and not for you, that is surprising.

@jasonmolenda
Copy link
Collaborator

Ah wait, I see where the difference is coming in -- it's my own fault. I misread tcsh's time output and only looked at USER seconds. Looking at the wallclock times, we're 50% faster with the multithreading enabled for this PR's change. My apologies!

@jasonmolenda
Copy link
Collaborator

For completeness sakes, here's what I see for parallel+preload PR 110646 with llvm-project main (no swift demangling) attaching to Slack (nearly all libraries in the shared cache)

% time build.release.parallel-with-preload//bin/lldb -x -b  -o 'pro att -n Slack' -o 'det'
4.643u 2.672s 0:10.63 68.7%	0+0k 0+0io 3277pf+0w
4.509u 2.120s 0:09.39 70.5%	0+0k 0+0io 148pf+0w
4.509u 2.120s 0:09.39 70.5%	0+0k 0+0io 148pf+0w
4.519u 2.124s 0:09.38 70.6%	0+0k 0+0io 104pf+0w

(showed the results for 4 runs)
and with the multithreading enabled

% time build.release.parallel-with-preload/bin/lldb -x -b -O 'settings set plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det'
6.635u 3.840s 0:06.30 166.1%	0+0k 0+0io 104pf+0w
6.635u 3.840s 0:06.30 166.1%	0+0k 0+0io 104pf+0w
6.573u 3.776s 0:06.18 167.3%	0+0k 0+0io 104pf+0w
6.601u 3.781s 0:06.20 167.4%	0+0k 0+0io 104pf+0w

and github main unmodified

% time build.release/bin/lldb -x -b  -o 'pro att -n Slack' -o 'det'
4.571u 2.311s 0:09.55 72.0%	0+0k 0+0io 3258pf+0w
4.535u 2.112s 0:09.35 71.0%	0+0k 0+0io 146pf+0w
4.581u 2.193s 0:09.52 71.1%	0+0k 0+0io 103pf+0w
4.554u 2.178s 0:09.54 70.4%	0+0k 0+0io 103pf+0w

@jasonmolenda
Copy link
Collaborator

jasonmolenda commented Oct 8, 2024

So yes, I'm seeing a 30% improvement with the multithreaded creation of Modules, for a macOS app mostly in the shared cache, and no swift name demangling (which is expensive).

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 8, 2024

Nice. Also there's no significant difference between build.release and release.parallel-with-preload with sequential load which is good.
I wonder why your parallel runs are slightly less performant than mine relative to the sequential ones. I mean it's almost certainly the contention for the string pools' mutexes. But I still wonder why exactly.

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 8, 2024

Anyway, my main goal is iOS apps running in the simulator. And for them, the speed-up is much more noticeable (at least for big apps). Let me know if you'd like me to measure something else.

@jasonmolenda
Copy link
Collaborator

I know your benchmarking numbers are against the main swiftlang branch which is using stable/20230725 - it's a bit older than llvm-project main as you'd guess from the date. They're bringing up the rebranch branch right now to become main at some point (I haven't followed closely enough to know when) and that's a fork of main from stable/20240723, a bit closer to main and it is possible there have been other changes that have an impact on perf. I'll see if I can't build up rebranch and see how that behaves, but as far as github main goes, it's clearly a wallclock improvement for macOS apps.

The other thing we may want to consider (you don't need to take this on in your current work) is the module creation in Target::SetExecutableModule - this happens when you run lldb MyBinary.app - it will create Modules for MyBinary plus all libraries it links to, and all libraries they link to, etc, before it has run or executed anything. For a Simulator app , Xcode requests the app launch and then tells lldb to attach to the pid, so I don't think lldb is given the binary before that and this is not important. But it will be relevant for other workflows.

@jasonmolenda
Copy link
Collaborator

It is a bit interesting that the user and system time are so much higher -- looks like around 40% higher? -- than the single-threaded approach, I wonder if that's thread creation/teardown overhead, or if it's costlier to acquire locks and it's adding up. Wallclock is the right thing to optimize for, but it was a little surprising to see that. I'm not curious enough to look into it more, or ask anyone else to.

Was the setting intended for testing purposes only, or did you intend to include that in a final PR? Sometimes when we land a new feature that may have unintended consequences, we'll have a flag enabling the feature by default as a way for people impacted to have a way to quickly disable it. After the feature/change has had some more widespread use, we drop the setting.

@jasonmolenda
Copy link
Collaborator

I built the swiftlang rebranch sources clean, with the parallel patch, with the parallel plus preload patch. I ran them against the same Slack binary with 934 binary images, 10 of them outside the shared cache. I built them something like

swift/utils/build-script --lldb --release --build-subdir=build.parallel -- --bootstrapping=hosttools --extra-cmake-options="-DPython3_EXECUTABLE=/usr/bin/python3" --libcxx --force-optimized-typechecker --skip-build-benchmarks --no-swift-stdlib-assertions --skip-test-swift --skip-test-cmark

(don't read too much into this set of options, it's just an old command I have that I copy & paste)

The clean sources

% time build/Ninja-ReleaseAssert+stdlib-Release/lldb-macosx-arm64//bin/lldb -x -b -O 'settings set plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det'
6.461u 2.179s 0:10.67 80.8%	0+0k 0+0io 170pf+0w
6.496u 2.166s 0:10.60 81.6%	0+0k 0+0io 152pf+0w
6.455u 2.181s 0:10.62 81.2%	0+0k 0+0io 152pf+0w

and the two different patchsets,

% time build/build.parallel/lldb-macosx-arm64//bin/lldb -x -b -O 'settings set plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det'
9.693u 3.935s 0:05.73 237.6%	0+0k 0+0io 171pf+0w
9.709u 3.941s 0:05.74 237.6%	0+0k 0+0io 152pf+0w
9.645u 3.913s 0:05.69 238.1%	0+0k 0+0io 152pf+0w

% time build/build.parallel-with-preload//lldb-macosx-arm64//bin/lldb -x -b -O 'settings set plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load true' -o 'pro att -n Slack' -o 'det'
9.611u 3.914s 0:05.72 236.3%	0+0k 0+0io 171pf+0w
9.878u 3.871s 0:05.71 240.6%	0+0k 0+0io 152pf+0w
9.634u 3.923s 0:05.78 234.4%	0+0k 0+0io 152pf+0w

We're seeing better parallelism (~170% cpu usage with llvm-project main, ~235% cpu usage with swiftlang rebranch), likely because swiftlang rebranch can demangle swift mangled names, and that's quite expensive, and parallelizable.

One small thing that I'm not thrilled about is that we lose the progress updates currently; I don't know if the progress updates system was really intended to handle this. Instead of seeing a notification for each binary, we see a notification for the first binary that starts, and the other notifications while it is processing are lost. On my system, the llvm threadpool is running 9 worker threads. For a local file load notification, they go so quickly it isn't much of a loss, but if module creation is slower -- if we're reading the binaries out of memory from an iOS device without an expanded shared cache, or we're able to find & load DWARF for all the binaries, the loss of the updates is not great. I don't know what the best approach is here, or if we just accept that difference.

I haven't started looking at the code changes themselves yet - so far I was just trying to play around with them a bit and see how the behavior works.

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 8, 2024

Was the setting intended for testing purposes only, or did you intend to include that in a final PR?

The latter. IMO the risks involved by parallelization are a bit too high to do it without a flag. I'm even thinking about having it opt-in rather than opt-out for some time.

@jasonmolenda
Copy link
Collaborator

Was the setting intended for testing purposes only, or did you intend to include that in a final PR?

The latter. IMO the risks involved by parallelization are a bit too high to do it without a flag. I'm even thinking about having it opt-in rather than opt-out for some time.

I'm fine with having a temporary setting to disable it, which we can remove after it has been in a release or two and we've had time for people to live on it in many different environments. But it should definitely be enabled by default when we're at the point to merge it, and the setting should only be a safety mechanism if this turns out to cause a problem for a configuration we weren't able to test.

We're not at that point yet, but just outlining my thinking on this. I would even put it explicitly under an experimental node (e.g. see target.experimental.inject-local-vars we have currently) so if someone does disable it in their ~/.lldbinit file, and we remove the setting in a year or two, they won't get errors starting lldb, it will be silently ignored.

I was playing with the performance in a couple of different scenarios. For some reason that I haven't looked into, we're getting less parallelism when many of the binaries are in the shared cache in lldb. Maybe there is locking around the code which finds the binary in lldb's own shared cache, so when 9 threads try to do it at the same time, we have additional lock contention. That's why the simulator speedup is better than a macos-native process speedup, and the speedup for a remote iOS debug process with an expanded shared cache on the mac (so all the libraries are in separate mach-o files) was faster still.

@jasonmolenda
Copy link
Collaborator

For what it's worth, this thread pool for parallel processing has been used in another part of lldb - it's used on ELF systems when processing DWARF, when we need to scan the debug info in the individual .o files iirc. So we've had some living-on time with the thread pool approach there, not used on Darwin, but used on other targets.

I was chatting with Jim Ingham and he was a little bummed that we're looking at doing this in a single DynamicLoader plugin, instead of having the DynamicLoader plugin create a list of ModuleSpec's and having a central method in ModuleList or something, create Modules for each of them via a thread pool, and then the DynamicLoader plugin would set the section load addresses in the Target, run any scripting resources (python in .dSYMs), call ModulesDidLoad etc. I don't think you should have to do the more generalized approach in this PR, but a scheme where other targets like Linux can benefit from the same approach would be interesting, without duplicating the thread pool code in their plugins. More like something for future work.

I still haven't looked at the specific code changes yet :) I've been trying to exercise this approach in a few different environments and it seems like a benefit in nearly every case, to varying degrees.

(the only case where it didn't benefit was when lldb had no binaries locally, and had to read them all over the gdb remote serial protocol. In that case all threads were blocked on the communication over the USB cable, reading all the libraries. The multithreaded approach didn't make this slower, and didn't seem to cause a problem, those were the main things I was looking for.)

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 10, 2024

it should definitely be enabled by default when we're at the point to merge it, and the setting should only be a safety mechanism if this turns out to cause a problem for a configuration we weren't able to test.

That's fine with me either.

I would even put it explicitly under an experimental node (e.g. see target.experimental.inject-local-vars we have currently) so if someone does disable it in their ~/.lldbinit file, and we remove the setting in a year or two, they won't get errors starting lldb, it will be silently ignored.

Sure, but can you clarify please should it be named plugin.dynamic-loader.darwin.experimental.enable-parallel-image-load or plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load as you wrote previously? Seems like the first option reads as "'enable-parallel-image-load' is an experimental property of 'plugin.experimental.dynamic-loader.darwin'" and the second one as "'dynamic-loader.darwin.enable-parallel-image-load' is an experimental property of 'plugin'".

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 10, 2024

I was chatting with Jim Ingham and he was a little bummed that we're looking at doing this in a single DynamicLoader plugin, instead of having the DynamicLoader plugin create a list of ModuleSpec's and having a central method in ModuleList or something, create Modules for each of them via a thread pool, and then the DynamicLoader plugin would set the section load addresses in the Target, run any scripting resources (python in .dSYMs), call ModulesDidLoad etc.

I'm looking at the implementation of DynamicLoaderDarwin::FindTargetModuleForImageInfo and to me it seems like moving the parallelization into a common place (let's say Target::GetOrCreateImages) would give us significant complexity.
We can't just make a module_spec and call a method to load it for us, there should be either a complicated description of what to load or some kind of back-and-forth communication between GetOrCreateImages and a dynamic loader.
For darwin:

  1. create a module_spec
  2. check if there's an image in the target images that matches it already
  3. if it's not, and if we can use the shared cache, make a new module_spec and try to load a module with it
  4. if it still fails, try to load it with the original module_spec
  5. if it still fails, try to load it with m_process->ReadModuleFromMemory.

Expressing the same chain of actions in a generalized way might be much more complex than implementing parallelization in each of the dynamic loaders separately.

@jimingham
Copy link
Collaborator

it should definitely be enabled by default when we're at the point to merge it, and the setting should only be a safety mechanism if this turns out to cause a problem for a configuration we weren't able to test.

That's fine with me either.

I would even put it explicitly under an experimental node (e.g. see target.experimental.inject-local-vars we have currently) so if someone does disable it in their ~/.lldbinit file, and we remove the setting in a year or two, they won't get errors starting lldb, it will be silently ignored.

Sure, but can you clarify please should it be named plugin.dynamic-loader.darwin.experimental.enable-parallel-image-load or plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load as you wrote previously? Seems like the first option reads as "'enable-parallel-image-load' is an experimental property of 'plugin.experimental.dynamic-loader.darwin'" and the second one as "'dynamic-loader.darwin.enable-parallel-image-load' is an experimental property of 'plugin'".

The experimental node does two things.

One is that the command interpreter treats:

a.b.experimental.c

and

a.b.c

as aliases for one another. That's to support an experimental setting going from experimental to non-experimental without producing errors if people have used the experimental in scripts.

The other is that if the user issues the command:

settings set a.b.experimental.c whatever

and neither a.b.experimental nor a.b.c exist, that is not reported as an error. That's to support commands that we use to guard experimental features, like in this case, so we can remove them later without causing errors in scripts.

So if you did:

plugin.experimental.dynamic-loader.darwin.enable-parallel-image-load

That would mean you would never get errors for misspelling anything under plugin, which doesn't seem like a great idea.

So best practice is to have the experimental node control only the settings that actually are experimental:

plugin.dynamic-loader.darwin.experimental.enable-parallel-image-load

@DmT021 DmT021 force-pushed the wp/DynamicLoaderDarwin_parallel_image_load_main branch from d6183d6 to 133bcda Compare October 11, 2024 01:25
Copy link

github-actions bot commented Oct 11, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@DmT021 DmT021 force-pushed the wp/DynamicLoaderDarwin_parallel_image_load_main branch from 133bcda to 40b77ca Compare October 11, 2024 01:31
@DmT021
Copy link
Contributor Author

DmT021 commented Oct 11, 2024

I renamed the setting to plugin.dynamic-loader.darwin.experimental.enable-parallel-image-load and set the default to true.
Updated both PRs

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 11, 2024

I was playing with the performance in a couple of different scenarios. For some reason that I haven't looked into, we're getting less parallelism when many of the binaries are in the shared cache in lldb. Maybe there is locking around the code which finds the binary in lldb's own shared cache, so when 9 threads try to do it at the same time, we have additional lock contention.

There is a lock_guard for shared_module_list.m_modules_mutex at the beginning of ModuleList::GetSharedModule. I instrumented loading Slack and that lock_guard is responsible for about 7% of the total time. Most of the time is spent on constructing ConstString.
Here's a flame graph. It's captured with "Record Waiting Threads" enabled so the locks are visible.
lldb_parallel.svg.zip

@jasonmolenda
Copy link
Collaborator

jasonmolenda commented Oct 14, 2024

Thanks for all the work you've done on this, and updating the setting. I looked over the implementations, and they all look like reasonable changes to me - I did laugh a little when I realized that 2/3rds of all the changes were related to adding the setting :) that's always a bit of boilerplate for the first setting in a plugin.

In a process launch scenario, or attaching to a process when it was launched in a stopped state, there will only be two binaries, dyld and the main app binary, and I don't think there would be any much perf benefit to the "preload" approach where we can parallelize special binaries (dyld, main executable) -- dyld is just a little guy. But when we attach to a launched app, where we have the two special binaries and a thousand others, if we load those two special binaries sequentially, all the other binaries are blocked until a possibly-expensive main binary has been parsed, and I think that's where the real perf difference you were measuring kicked in (you showed a 10.5 second versus 13.4 second time difference for preload versus parallel in the beginning), do I have that right? Or were you attaching to a stopped process with just dyld+main binary, I can't imagine doing those two in parallel would lead to the savings you measured.

I think the preload approach where we can parallelize the two special binaries along with all the others on a "fully launched process" attach scenario is the right choice, thanks for investigating both of them and presenting both options. Do you prefer the non-preload approach? I know preload makes for a bigger patch, but I can see how the perf benefit of preload could be significant with attaching to a fully launched process so we're not blocked on parsing the main binary before we begin all the others.

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 14, 2024

I think that's where the real perf difference you were measuring kicked in (you showed a 10.5 second versus 13.4 second time difference for preload versus parallel in the beginning), do I have that right?

That's right.

Do you prefer the non-preload approach?

No, I think we should stick to a more general approach unless we have to perform the loading itself differently for the special modules. And currently, we don't have any difference between loading them and all the other - the calls to FindTargetModuleForImageInfo are all the same. So I'm in favor of the preload variant.

@jasonmolenda
Copy link
Collaborator

Sounds good, I agree. I marked #110646 as approved, please merge at your convenience. Thanks again for spotting this opportunity for multithreading and seeing it through!

@DmT021
Copy link
Contributor Author

DmT021 commented Oct 15, 2024

I'm closing this in favor #110646

@DmT021 DmT021 closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants