Counterintuitive effects of module import location on compilation performance #97239

kamrann · 2024-06-30T22:35:15Z

I recently experienced an extreme improvement and subsequent regression in compilation performance whilst migrating a library to modules. It turned out the key change had little to do with my gradually encapsulating more into modules, but was down to something that I did by accident. Here's a reduced and much simplified repro.

// lots_of_std.h

#include <chrono>
#include <format>
#include <mutex>
#include <thread>
#include <memory>
#include <ranges>
#include <algorithm>
#include <iostream>
#include <functional>
#include <filesystem>

// pmi_x.ixx

module;

#include "lots_of_std.h"

export module x;

// maybe_import_x.h

#ifdef import_x_in_gmf
import x;
#endif

// ip_y_a.ixx

module;

#include "maybe_import_x.h"
#include "lots_of_std.h"

export module y:a;

import x;

// ip_y_b.ixx

module;

#include "maybe_import_x.h"
#include "lots_of_std.h"

export module y:b;

import x;
import :a;

<ip_y_c.ixx, ip_y_d.ixx, ip_y_e.ixx, ip_y_f.ixx as above, each importing the preceding partition>

// pmi_y.ixx

export module y;

export import :a;
export import :b;
export import :c;
export import :d;
export import :e;
export import :f;

So a simple module including some heavy headers in its GMF, then a second module with a linear chain of interface partitions each including the same headers in their GMF, importing the first module, and importing the preceding partition. Each partition also conditionally (via preprocessor switch) imports the first module at the top of its GMF, on top of doing so in its purview.

Compilation times from -ftime-trace below. Note that in the default (fat BMI) mode, each TU generates two JSON files from the time trace; one appears largely insignificant time-wise, so below numbers are from the significant one only. By 'thin BMI', I'm referring to -fexperimental-modules-reduced-bmi.

In addition to the above, compilation time for pmi_x.ixx was constant at around 3.8s for all cases.

BMI sizes were as follows (MB):

	x [default]	x [import-in-gmf]	y partitions [default]	y partitions [import-in-gmf]
fat	30	30	30-32	14
thin	25	25	10	10

Observations:

Most obviously, including the additional import x; statement at the top of the partition GMFs produces a huge improvement in compilation performance. If the import were moved there from the purview then perhaps I might expect some improvement since if I understand right, that would be a semantic change as the declarations would not be passed on to downstream importing partitions. But since it's just being added there as an addition to the import in the purview, this is very counterintuitive to me. When I noticed I'd done this accidentally I wasn't even sure it was legal to import in the GMF. My understanding was that as all code becomes more and more modularized, the goal would be for the GMF to contain fewer and fewer things, so it was super surprising to find that my compilation times got slashed (by 3x in my real use case) by just dumping an additional, seemingly superfluous import statement into the GMF.
With the fat BMI, TU compilation time increases with depth of the TU in the DAG of module partitions. This is also kind of surprising, since there is nothing new being added at each level, and each partition only imports its directly preceding partition, so it's not clear why compiling f should take any longer than b. Note that from looking at the time traces, the increased time is entirely confined within WriteAST; the preceding frontend times barely change. Also, though in the fat BMI case there is a small variance in partition BMI sizes in the default (no import in GMF) case, it appeared somewhat arbitrary - the size was not increasing from a to f.

One other question is why the BMI contains much of anything at all in the above repro since nothing is exported or even declared in the module purviews. Though I'm guessing this must be known and is perhaps just a pending implementation issue.

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-06-30T23:05:54Z

@llvm/issue-subscribers-clang-modules

Author: Cameron Angus (kamrann)

I recently experienced an extreme improvement and subsequent regression in compilation performance whilst migrating a library to modules. It turned out the key change had little to do with my gradually encapsulating more into modules, but was down to something that I did by accident. Here's a reduced and much simplified repro. ``` // lots_of_std.h

#include <chrono>
#include <format>
#include <mutex>
#include <thread>
#include <memory>
#include <ranges>
#include <algorithm>
#include <iostream>
#include <functional>
#include <filesystem>

// pmi_x.ixx

module;

#include "lots_of_std.h"

export module x;

// maybe_import_x.h

#ifdef import_x_in_gmf
import x;
#endif

// ip_y_a.ixx

module;

#include "maybe_import_x.h"
#include "lots_of_std.h"

export module y:a;

import x;

// ip_y_b.ixx

module;

#include "maybe_import_x.h"
#include "lots_of_std.h"

export module y:b;

import x;
import :a;

<ip_y_c.ixx, ip_y_d.ixx, ip_y_e.ixx, ip_y_f.ixx as above, each importing the preceding partition>

// pmi_y.ixx

export module y;

export import :a;
export import :b;
export import :c;
export import :d;
export import :e;
export import :f;

So a simple module including some heavy headers in its GMF, then a second module with a linear chain of interface partitions each including the same headers in their GMF, importing the first module, and importing the preceding partition. Each partition also conditionally (via preprocessor switch) imports the first module at the top of its GMF, on top of doing so in its purview.

Compilation times from `-ftime-trace` below. Note that in the default (fat BMI) mode, each TU generates two JSON files from the time trace; one appears largely insignificant time-wise, so below numbers are from the significant one only. By 'thin BMI', I'm referring to `-fexperimental-modules-reduced-bmi`.

![clang-modules-perf-chart](https://github.com/llvm/llvm-project/assets/2475657/f6a80fff-6860-49c4-bfc6-968d2eec995a)

In addition to the above, compilation time for `pmi_x.ixx` was constant at around 3.8s for all cases.

BMI sizes were as follows (MB):
| | x [default] | x [import-in-gmf] | y partitions [default] | y partitions [import-in-gmf] |
|--------|--------|--------|--------|--------|
| fat | 30 | 30 | 30-32 | 14 |
| thin | 25 | 25 | 10 | 10 |

Observations:
1. Most obviously, including the additional `import x;` statement at the top of the partition GMFs produces a huge improvement in compilation performance. If the import were moved there from the purview then perhaps I might expect some improvement since if I understand right, that would be a semantic change as the declarations would not be passed on to downstream importing partitions. But since it's just being added there as an addition to the import in the purview, this is very counterintuitive to me. When I noticed I'd done this accidentally I wasn't even sure it was legal to import in the GMF. My understanding was that as all code becomes more and more modularized, the goal would be for the GMF to contain fewer and fewer things, so it was super surprising to find that my compilation times got slashed (by 3x in my real use case) by just dumping an additional, seemingly superfluous import statement into the GMF.
2. With the fat BMI, TU compilation time increases with depth of the TU in the DAG of module partitions. This is also kind of surprising, since there is nothing new being added at each level, and each partition only imports its directly preceding partition, so it's not clear why compiling `f` should take any longer than `b`. Note that from looking at the time traces, the increased time is entirely confined within `WriteAST`; the preceding frontend times barely change. Also, though in the fat BMI case there is a small variance in partition BMI sizes in the default (no import in GMF) case, it appeared somewhat arbitrary - the size was not increasing from `a` to `f`.

One other question is why the BMI contains much of anything at all in the above repro since nothing is exported or even declared in the module purviews. Though I'm guessing this must be known and is perhaps just a pending implementation issue.
</details>

ChuanqiXu9 · 2024-07-01T02:30:38Z

Thanks for looking into this. The key point here is that we should try to reduce redeclarations between module units: https://clang.llvm.org/docs/StandardCPlusPlusModules.html#performance-tips. In short, it is worse to include <vector> in every module units than import std;. The actual reason is related with the implementation. And I don't think we can fix that fundamentally. Maybe we can mitigate that, but we can't fix the root problem as far as I see.

We can view this as a suggestion to get rid of things in GMF.

When I noticed I'd done this accidentally I wasn't even sure it was legal to import in the GMF. My understanding was that as all code becomes more and more modularized, the goal would be for the GMF to contain fewer and fewer things, so it was super surprising to find that my compilation times got slashed (by 3x in my real use case) by just dumping an additional, seemingly superfluous import statement into the GMF.

So let's look into this:

module;
#include "a_lot.h"
export module y:c;
import x;
import y:b;

In this case, the compiler can see the GMF from module X, the GMF from y:a and the GMF from y:b. The compiler has no idea for the relationships between the (re)-declarations in the GMFs, so the compiler have to put additional effort to calculate the redeclarations. Then here is the reduced time you observed.

Why import in GMFs can mitigate the impact?

Since import that in the front of including makes the number of declarations in the GMF to go down. But the compiler may still need to put the effort to calculate something. So the best practice may be: https://clang.llvm.org/docs/StandardCPlusPlusModules.html#providing-a-header-to-skip-parsing-redundant-headers

One other question is why the BMI contains much of anything at all in the above repro since nothing is exported or even declared in the module purviews. Though I'm guessing this must be known and is perhaps just a pending implementation issue.

Yeah, this is implementation details and I guess reduced BMI can help here.

In the end, I think the suggestion is at the first: don't (or try to reduce) introduce redeclarations in different module units, or in another word, try to reduce the duplicated include in seperate module units into a standalone module and then try to import that one instead of including the headers all the time.

kamrann · 2024-07-01T14:00:05Z

Thanks for the response and insights.

So first off, perhaps using std includes in the example was a bad idea, since indeed using import std is surely going to be much better. But no doubt this performance issue is general, and there will be many people tying to migrate to modules using non-modularized third party libraries. While it indeed seems that producing a basic module wrapper would be the best approach, I think most people would not think to do this for libraries which they depend on but the implementation of which they perhaps know very little, and will instead default to #includes in the GMF, since in most of the literature that seems to be the canonical example of what to use the GMF for. So it's unfortunate that there is potentially significant performance being left on the table which may have a trivial yet unintuitive fix. Note also that this example is contrary to the advice to avoid including headers after import - in the above case, doing so not only works, but makes things much faster.

If merely duplicating an import from the purview up to the top of the GMF can be such a big win, it would be nice if the compiler could do this automatically. But I guess an import preceding #includes could potentially change the semantics of the included text, so it's not possible for the compiler to assume it would be safe. Which is a pity because in the vast majority of such cases, it's likely that in practice there would be no semantic difference.

In this case, the compiler can see the GMF from module X, the GMF from y:a and the GMF from y:b. The compiler has no idea for the relationships between the (re)-declarations in the GMFs, so the compiler have to put additional effort to calculate the redeclarations

Perhaps this is just a distinction between my understanding of modules from a user's perspective, and clang's implementation details (and I have zero knowledge of how all this is implemented), but I thought the idea with the GMF was that it wasn't propagated to downstream importers. I would have assumed any deduplication of declarations was done for a TU, and then the deduplicated results went into the BMI, such that when the compiler came to compile an importing TU, say y:c above, it wasn't dealing with the GMFs of either y:a or y:b. But from what you're saying (and the timing results) it looks as if it's repeating the same deduplication work at every level?

Anyway I realize there are no doubt major technical difficulties implementing this and I don't have enough knowledge to say anything more specific. I just wanted to bring it up, as it's already a lot of work to migrate to modules and if people are going to encounter significant compilation time regressions while they're part way through the process, many will likely give up and revert, which would be a pity.

ChuanqiXu9 · 2024-07-02T01:53:38Z

Thanks for the response and insights.

So first off, perhaps using std includes in the example was a bad idea, since indeed using import std is surely going to be much better. But no doubt this performance issue is general, and there will be many people tying to migrate to modules using non-modularized third party libraries. While it indeed seems that producing a basic module wrapper would be the best approach, I think most people would not think to do this for libraries which they depend on but the implementation of which they perhaps know very little, and will instead default to #includes in the GMF, since in most of the literature that seems to be the canonical example of what to use the GMF for. So it's unfortunate that there is potentially significant performance being left on the table which may have a trivial yet unintuitive fix.

Yeah, generally agreed. What I tried is to put this in the clang's document.

Note also that this example is contrary to the advice to avoid including headers after import - in the above case, doing so not only works, but makes things much faster.

The reason why we don't suggest to include headers after import is about the potential compiler bugs. It is error prone. And as far as I know, MSVC has similar problems.

If merely duplicating an import from the purview up to the top of the GMF can be such a big win, it would be nice if the compiler could do this automatically. But I guess an import preceding #includes could potentially change the semantics of the included text, so it's not possible for the compiler to assume it would be safe. Which is a pity because in the vast majority of such cases, it's likely that in practice there would be no semantic difference.

Also it is true that, include headers after import may change the semantics. Here are two examples: #96147 and #96423

In this case, the compiler can see the GMF from module X, the GMF from y:a and the GMF from y:b. The compiler has no idea for the relationships between the (re)-declarations in the GMFs, so the compiler have to put additional effort to calculate the redeclarations

Perhaps this is just a distinction between my understanding of modules from a user's perspective, and clang's implementation details (and I have zero knowledge of how all this is implemented), but I thought the idea with the GMF was that it wasn't propagated to downstream importers. I would have assumed any deduplication of declarations was done for a TU, and then the deduplicated results went into the BMI, such that when the compiler came to compile an importing TU, say y:c above, it wasn't dealing with the GMFs of either y:a or y:b. But from what you're saying (and the timing results) it looks as if it's repeating the same deduplication work at every level?

The compiler have to store all the things and distinguish them. Maybe it can be cheaper but it can't be skipped. From the perspective of the specification, the entities in GMFs are reachable but not visible to consumers. So they have to live in the BMI.

Anyway I realize there are no doubt major technical difficulties implementing this and I don't have enough knowledge to say anything more specific. I just wanted to bring it up, as it's already a lot of work to migrate to modules and if people are going to encounter significant compilation time regressions while they're part way through the process, many will likely give up and revert, which would be a pity.

Agreed. Life are never easy. From the perspective of the compiler, we can't do such things automatically since it is not technical correct. But I think it might be helpful to provide a script in the user level to detect such things. I think you can try it if you want.

kamrann · 2024-07-02T13:54:33Z

Sounds good. I'm trying my best to migrate in a way that maintains ability to compile the code without modules, and various combinations in between, so the aim is to eventually be able to run a series of builds with various configurations and generate comparisons of time traces. If I get there, I may have some more useful insights/numbers to provide.

One last comment. Regarding

The reason why we don't suggest to include headers after import is about the potential compiler bugs. It is error prone. And as far as I know, MSVC has similar problems.

I was actually surprised when I came across that documented advice. I initially started the migration using MSVC, but promptly gave up when faced with that problem - since I wanted to migrate bit by bit and have non-modular third party dependencies, guaranteeing include before import was just too difficult. I switched to clang and have had no problem at all in that respect. So from my experience at least, clang has come pretty far along in dealing with that particular issue and maybe it's just some very specific corner cases that remain.

ChuanqiXu9 · 2024-07-03T01:41:13Z

I was actually surprised when I came across that documented advice. I initially started the migration using MSVC, but promptly gave up when faced with that problem - since I wanted to migrate bit by bit and have non-modular third party dependencies, guaranteeing include before import was just too difficult. I switched to clang and have had no problem at all in that respect. So from my experience at least, clang has come pretty far along in dealing with that particular issue and maybe it's just some very specific corner cases that remain.

It is a QoI problem. In clang, we have two different code paths to deal with include before import and import before include and we can't merge them. So here is the root of the problem and this is the reason why we said it is not suggested. And this disconverge makes me think it is problematic. But it is good if we can make it after all.

I'll close this one as it looks all problem are addressed. And I suggest to open such disucssions in LLVM Discourse (https://discourse.llvm.org/) to have a wider visiblity.

Also, if you have new conclusions or codes ready to be public, I'll suggest to contribute to https://arewemodulesyet.org/ to help the ecosystem.

github-actions bot added the new issue label Jun 30, 2024

EugeneZelenko added clang:modules C++20 modules and Clang Header Modules and removed new issue labels Jun 30, 2024

ChuanqiXu9 closed this as completed Jul 3, 2024

EugeneZelenko added the question A question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead! label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counterintuitive effects of module import location on compilation performance #97239

Counterintuitive effects of module import location on compilation performance #97239

kamrann commented Jun 30, 2024 •

edited

Loading

llvmbot commented Jun 30, 2024

ChuanqiXu9 commented Jul 1, 2024

kamrann commented Jul 1, 2024

ChuanqiXu9 commented Jul 2, 2024

kamrann commented Jul 2, 2024

ChuanqiXu9 commented Jul 3, 2024

Counterintuitive effects of module import location on compilation performance #97239

Counterintuitive effects of module import location on compilation performance #97239

Comments

kamrann commented Jun 30, 2024 • edited Loading

llvmbot commented Jun 30, 2024

ChuanqiXu9 commented Jul 1, 2024

kamrann commented Jul 1, 2024

ChuanqiXu9 commented Jul 2, 2024

kamrann commented Jul 2, 2024

ChuanqiXu9 commented Jul 3, 2024

kamrann commented Jun 30, 2024 •

edited

Loading