-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SR-9223] Inliner exhibits slow compilation time with a large static array #51712
Comments
Please send me a test case. |
Comment by Andreas Wendleder (JIRA) The Makefile is the test case. It basically generates a swift file that looks like this:
|
Comment by Andreas Wendleder (JIRA) Here are some timings and a diagram that shows quadratic behaviour: https://docs.google.com/spreadsheets/d/16VakcOYvcwZbmPRLxO_1HcLUwCx0xEVT5mXWYsYnQjY/edit?usp=sharing which I think comes from SILInstruction.cpp +88: // Update the parent fields in the instructions.
for (; first != last; ++first)
first->ParentBB = ThisParent; I have a static array with >50.000 elements which takes half an hour to compile. With C++ it takes 0.3 seconds. |
That's right. Merging two basic blocks is actually O[n] because of the parent pointer update that happens. I changed the SILCloner API to simplify cloning block terminators, then worked around it in the inliner by merging the block, but that was a mistake. Now I'm adding back the ability of the SILCloner to insert cloned instructions in the middle of a block. |
PR: #20630 |
Comment by Andreas Wendleder (JIRA) Better but still quadratic, see the updated diagram at http://bit.ly/2Dqiwx1. This time the culprit is Here is the stack trace:
I think instead of repeatedly splitting the existing MandatoryInlining.cpp:535: for (auto BI = F->begin(), BE = F->end(); BI != BE; ++BI) {
// ~~While iterating over this block, instructions are inserted and deleted.~~
// Every instruction is copied and every function cloned into this basic block:
auto BB = F->createBasicBlock();
for (auto II = BI->begin(), nextI = II; II != BI->end(); II = nextI) {
if (Callee)
inlineFunctionInto(Callee, BB);
else
copyInstructionInto(BB);
replace(BI, BB);
delete(BB); It uses a little more memory but is O[n] |
That's right. I deviated from the original goal last night and explored making the CFG more canonical in general. I think that will be a good thing, but back to your problem... I see two solutions: (1) Reintroduce SILInliner's special case for single-block callee functions to avoid splitting. This simply skips the `split` operation when it detects a single block callee. This solves the problem of static array initialization. But the complexity bleeds into the SILCloner and SILInliner which need to handle this special case all over the place. More importantly, it doesn't solve the problem that the SILInliner is quadratic in general (and always has been). I think what you're showing above is an alternate implementation of this special case where we emit into a fake block, then move the cloned instructions out of the block after inlining. (2) Run the SILInliner in reverse. See PR: #20630. This simplifies everything quite a bit. I just got this building, haven't tested or profiled it yet, and still need to figure out how to create a compile time test case for the quadratic behavior. |
Comment by Andreas Wendleder (JIRA) Cool. This solves my problem with the test case, see the above updated diagram (except for a SEGV with more than 70.000 elements). I'm still compiling my "real" program and am thinking of the best solution. Thanks, Andreas |
Comment by Andreas Wendleder (JIRA) I can now also confirm that this fixes my application for non-optimising builds. Compilation times went from 9 minutes to 13 seconds. So for me this issue is solved, except for two things:
Thanks anyway for your time and dedication, Andreas |
Comment by Andreas Wendleder (JIRA) Two additional thoughts:
|
> The SEGV for large matrices. Does this refer to another bug? The RLE bug is the only one I saw. > I still think copying/inlining into a new basic block would be the simpler solution, but that's up to you. The problem isn't the inlined instructions, it's the instructions after the call in the same block. In general, they all need to be moved to a new block which means adjusting all their parent pointers. Those moved instructions may themselves be calls that need to be inlined, so we end up moving each original instruction number-of-calls times. Inlining in reverse means we never need to revisit an instruction that was already moved. |
Comment by Andreas Wendleder (JIRA) > Does this refer to another bug? The RLE bug is the only one I saw. Nope, I just applied your patch and tried several element sizes. Up to 70.000 everything was fine, after that I got a SEGV. No optimisation, no RLE at all. |
Comment by Andreas Wendleder (JIRA) > The problem isn't the inlined instructions, it's the instructions after the call in the same block. In general, they all need to be moved to a new block which means adjusting all their parent pointers. Those moved instructions may themselves be calls that need to be inlined, so we end up moving each original instruction number-of-calls times. Inlining in reverse means we never need to revisit an instruction that was already moved. Reversing is fine in my case. I still think allocating a new basic block and copying all instructions there and inlining all functions there might simplify the overall logic. But that's just a thought. I am a) not a compiler engineer and b) not too familiar with the Swift code base, so YMMV. 🙂 |
graydon (JIRA User) I'd like to add a scale test for the mandatory inlining of a giant array initialization. Quadratic behavior arises when the inliner splits basic blocks. Splitting a block visits all subsequent instructions and adjusts their parent block pointer. See // Update the parent fields in the instructions.
for (; first != last; ++first)
first->ParentBB = ThisParent; That seems fairly intrusive. I could add a separate loop in the inliner to count the number of instructions after each split point. That seems wasteful. Can you think of a better way to verify that mandatory inlining isn't quadratic? |
Comment by Andreas Wendleder (JIRA) That is the stack at the SEGV:
|
This is the SILGen assert for the initialization of 70k array elements: Assertion failed: (params.size() == labels.size()), function relabelParams, file /s/sown/swift/lib/AST/ASTContext.cpp, line 3784. This should be filed as a separate bug against SILGen. Someone may have thought it was ok to use 16 bits for a param index. |
Comment by Andreas Wendleder (JIRA) I created #51762 for this. |
Comment by Graydon Hoare (JIRA) @atrick I'm not sure what you mean by intrusive; are you concerned with the overhead of bumping a counter during that loop? |
graydon (JIRA User) I'm not too concerned about adding a pointer increment if that's really the best way to do it. By intrusive I mean:
Still, I think it would be worth it to have a test case. I just want to make sure it's the best way to do it. |
Comment by Andreas Wendleder (JIRA) It's getting worse: Timings with Swift 5: 5.000 elements: 7.5 seconds 10.000 elements: 34 seconds 15.000 elements: 81 seconds |
Comment by Andreas Wendleder (JIRA) Performance numbers with latest snapshot from 2019-04-19 with fixes from @eeckstein: 5.000 elements: 1.4 seconds 10.000 elements: 5.5 seconds 15.000 elements: 12.2 seconds 20.000 elements: 18.9 seconds Better but still quadratic. |
Comment by Andreas Wendleder (JIRA) Swift 5.0.1 on Linux: 10.000 elements: 16 seconds. |
Comment by Andreas Wendleder (JIRA) Swift 5.1 macOS: 10.000 elements: 11 seconds. |
Comment by Andreas Wendleder (JIRA) Swift 5.1.1 on Linux: 10.000 elements: 3.6s Perf trace:
|
Comment by Andreas Wendleder (JIRA) Swift 5.2.4 Linux: 10.000 elements: 4.4s 15.000 elements: 10.1s |
Swift 5.7 Linux: 15.000 elements: 0.8s cc @eeckstein @atrick @gottesmm Updated test case:
|
Swift 5.8 Linux: 15.000 elements: 0.9s |
Is this still an issue with the optimization passes? |
Depends on whether you find 11.3 seconds acceptable. C++ (as every other language I tried) is <0.5 seconds. |
No, definitely not acceptable. Was just wondering if you verified if/how the distribution had changed. |
I ran perf again:
Looks like a hash collision, maybe a better AliasCacheKey? @AnthonyLatsis @eeckstein @atrick |
Current stats make this a duplicate of #50243. Closing in favor of the older issue. |
Attachment: Download
Environment
Makefile
Additional Detail from JIRA
md5: c44b8d1ce0417361133fc33640032956
Issue Description:
Hi!
Please see the attached Makefile and perf report. This time it's the inliner (and verifier).
With 5.000 elements compiling takes less than 3 seconds, with 10.000 more than 15, with 15.000 37 seconds.
It seems that commit bd28b0e (SILCloner and SILInliner rewrite) is the culprit.
The functions inlined are from the builtin
UInt
andArray
classes.The text was updated successfully, but these errors were encountered: