We use c++ modules at Waymo, inside the google monorepo. The Google toolchain team did all the hard work, but we applied it more aggressively than any team I know of. The results have been fantastic, with our largest compilation units getting a 30-40% speedup. It doesn't make a huge difference in a clean build, as that's massively distributed. But it makes an enormous difference for iterative compilation. It also has the benefit of avoiding recompilation entirely in some cases.
Every once in a while something breaks, usually around exotic use of templates. But on the whole we love it, and we'd have to do so much ongoing refactoring to keep things workable without them.
Update: I now recall those numbers are from a partial experiment, and the full deployment was even faster, but I can't recall the exact number. Maybe a 2/3 speedup?
How much of the speedup you're seeing is modules, versus the inherent speedup of splitting and organizing your codebase/includes in a cleaner way? It doesn't sound like your project is actually compiling faster than before, but rather it is REcompiling faster, suggesting that your real problem was that too much code was being recompiled on every change (which is commonly due to too-large translation units and too many transitive dependencies between headers).
This was in place of reorganizing the codebase, which would have been the alternative. I've done such work in the past, and I've found it's a pretty rare skillet to optimize compilation speed. There's just a lot less input for the compiler to look at, as the useless transitive text is dropped.
And to be clear, it also speeds up the original compilation, but that's not as noticeable because when you're compiling zillions of separate compilation units with massive parallelism, you don't notice how long any given file takes to compile.
Clang modules are nothing like what got standardized. Clang modules are basically a cleaned up and standardized form of precompiled headers and they absolutely speed up builds, in fact that is primarily their function.
> Yes, we can. C++20 Modules are usable in a Linux + Clang environment. There are also examples showing that C++20 Modules are usable in a Windows environment with MSVC. I have not yet heard of GCC’s C++20 Modules being used in non-trivial projects.
People keep saying this and yet I do not know of a good example from a real life project which did this which I can test. This seems very much still an experimental thing.
It is just beyond experimental now, and finally in the early adopter phase. Those early adopters are trying things are trying to develop best practices - which is to say as always: they will be trying things that future us will laugh at how stupid it was to do.
There are still some features that are missing from compilers, but enough is there that you can target all 3 major compilers and still get most of modules and benefit from them. However if you do this remember you are an early adopter and you need to be prepared to figure out the right way to do things - including fixing things that you get wrong once you figure out what is right.
Also, if you are writing a library you cannot benefit from modules unless you are willing to force all your consumers to adopt modules. This is not reasonable for major libraries used by many so they will be waiting until more projects adopt modules.
Still modules need early adopters and they show great promise. If you write C++ you should spend a little time playing with them in your current project even if you can't commit anything.
I think it still is in a "well technically its possible" state. And I fear it'll remain that way for a bit longer.
A while ago I made a small example to test how it would work in an actual project and that uses cmake (https://codeberg.org/JulianGmp/cpp-modules-cmake-example).
And while it works™, you can't use any compiler provided modules or header modules. Which means that
1) so you'll need includes for anything from the standard library, no import std
2) you'll also need includes for any third party library you want to use
When I started a new project recently I was considering going with modules, but in the end I chose against it because I dont want to mix modules and includes in one project.
> C++ 26 reflections have now been voted in. This would get rid of moc entirely, but I really do not see how this will become widely available in the next 5-10 Years+. This would require Qt to move to C++ 26, but only if compiler support is complete for all 3 compilers AND older Linux distros that ship these compilers. For example, MSVC still has no native C++ 23 flag (In CMake does get internally altered to C++ latest aka. C++ 26) , because they told me that they will only enable it is considered 100% stable. So I guess we need to add modules support into moc now, waiting another 10 years is not an option for me .
We are building a document rendering tool using them. It’s a pretty large project, and there have been some really good improvements in Clang’s implementation of C++20 modules in the past few versions.
Did you create your own dialect of C++ along the way? I see co_try$(...) and co_trya$(...) whose definitions I can't find, and I assume they're macros that work either with or without coroutines... did you measure the performance overhead with coroutines?
Why is something which shall makes things easy and secure so complicated?
I'm used to:
g++ -o hello hello.cpp
It can use headers. Or doesn't use headers. I doesn't matter. That's the decision of the source file. To be fair, the option -std=c++20 probably isn't necessary in future.
> Why is something which shall makes things easy and secure so complicated?
> I'm used to:
>
> g++ -o hello hello.cpp
That is simple, because C++ inherited C's simplistic, primitive, and unsafe compilation and abstraction model of brute-force textual inclusion. When you scale this to a large project with hundreds of thousands of translation units, every command-line invocation becomes a huge list of flag soup that plain Makefiles become intractable.
Almost all other reasonably-recent programming languages have all of the following:
- a strong coupling of dependency management, building, installation, and publishing tools
- some description of a directed acyclic graph of dependencies, whether it be requirements.txt, cargo.toml, Maven, dotnet and Nuget .csproj files, Go modules, OPAM, PowerShell gallery, and more
- some way to describe the dependencies within the source code itself
C++20 modules are a very good thing, and enforce a stronger coupling between compiler and build tool; it's no longer just some weekend project chucking flags at g++/clang++/cl.exe but analysing source code, realising it needs a, b, c, x, y, z modules, ensuring those modules are built and export the necessary symbols, and then compiling the source at hand correctly. That is what `clang-scan-deps` does: https://clang.llvm.org/docs/StandardCPlusPlusModules.html#di...
I concede there are two problems with C++20 modules: we didn't have a working and correct compiler implementation before the paper was accepted into C++20, and secondly, the built/binary module interface specification is not fixed, so BMIs aren't (yet) portable across compilers.
The Meson developer is notorious for stirring the pot with respect to both the build system competition, and C++20 modules. The Reddit thread on his latest blog post provides a searing criticism for why he is badly mistaken: https://www.reddit.com/r/cpp/comments/1n53mpl/we_need_to_ser...
Universal source package management would have been better time spent.
This doesn’t solve any problem that wasn’t self-inflicted.
I agree on your points about having a working implementation before the paper was accepted, this is why C++ is a mess and will never be cleaned up. I love C++ but man, things like this are plenty.
Your comparison is not apples to apples as you would in practice have a seperation of compilation and linking in your non-modules example. The status quo is more like:
With that said, the the modules command is mostly complex due to the -fmodule-file=Hello=Hello.pcm argument.
When modules were being standardized, there was a discussion on whether there should be any sort of implicit mapping between modules and files. This was rejected, so the build system must supply the information about which module is contained in which file. The result is more flexible, but also more complex and maybe less efficient.
Because hello is so simple you don't need complicated. When you are doing something complicated though you have to accept that it is complicated. I could concatenate all 20 million lines of C++ (round number) I work with into one file and building would be as simple as your hello example - but that simple building comes at great cost (you try working with a 20 million line file, and then merging it with changes someone else made) and so I'm willing to accept more complex builds.
Thank you. That's right and where usually issues arise and tools challenged. If the hello worlds case starts that complicated already I'm going to being careful.
I'm eager to gather info but the weak spots of headers (and macros) are obvious. Probably holding a waiting position for undefined time. At least as long Meson doesn't support them.
PS: I'm into new stuff when it looks stable and the benefits are obvious. But this looks complicated and backing out of complicated stuff is painful, when necessary.
> If the hello worlds case starts that complicated already I'm going to being careful.
If the tool is intended for complex things I'm not sure I agree. It is nice when hello is simple, but if you can make the complex cases a little easier at the expense of making the simple things nobody does harder I don't know if I care. (note that the example needed the -o parameter to the command line - gcc doesn't have a good default... maybe it should?)
> The data I have obtained from practice ranges from 25% to 45%, excluding the build time of third-party libraries, including the standard library.
> Online, this number varies widely. The most exaggerated figure I recall is a 26x improvement in project compilation speed after a module-based refactoring.
> Furthermore, if a project uses extensive template metaprogramming and stores constexpr variable values in Modules, the compilation speed can easily increase by thousands of times, though we generally do not discuss such cases.
> Apart from these more extreme claims, most reports on C++20 Modules compilation speed improvements are between 10% and 50%.
I'd like to see references to those claims and experiments, size of the codebase etc. I find it hard to believe the figures since the bottleneck in large codebases is not a compute, e.g. headers preprocessing, but it's a memory bandwidth.
This is making the assumption that source is read once and that there is no intermediate data to write and read. Unless the working set fits in cache, you'll have I/O and can be I/O bound.
On 40-core or 64-core machine there's more compute than you will ever need for a compilation process. Compilation is a heavy I/O workload not a heavy compute workload, in most cases, where it actually matters.
Linux is ~1.5GB of source text and the output is typically a binary less than 100MB.
That should take a few hundred milliseconds to read in from an SSD or be basically instant from RAM cache, and then a few hundred ms to write out the binary.
So why does it take minutes to compile?
Compilation is entirely compute bound, the inputs and outputs are minuscule data sizes, in the order of megabytes for typical projects - maybe gigabytes for multi million line projects, but that is still only a second or two from an SSD.
I don't build linux from source, but in my tests with large machines (and my C++ work project with more than 10 million lines of code) somewhere between 40 and 50 cores compile speed starts decreasing as you add more cores. When I moved my source files to a ramdisk the speed got even worse so I know disk IO isn't the issue (there was a lot of RAM on this machine so I don't expect to run low on RAM even with that many cores in use). I don't know how to find the truth, but all signs point to memory bandwidth being the issue.
Of course the above is specific to the machines I did my testing on. A different machine may have other differences from my setup. Still my experience matches the claim: at 40 cores memory bandwidth is the bottleneck not CPU speed.
Most people don't have 40+ core machines to play with, and so will not see those results. The machines I tested on cost > $10,000 so most would argue that is not affordable.
One of the biggest reasons why people see so much compilation improvement speed on Apple M chips - massive bandwidth improvement in contrast to other machines, even some older servers. 100G/s single core main memory. It starts to drop, e.g. it doesn't scale linearly, when you add more and more cores to the workload, due to L3 contention I'd say, but it goes up to 200G/s IIRC.
The fact that something doesn’t scale past X cores doesn’t mean that it is I/O bound! For most C++ toolchains, any given translation unit can only be compiled on a single core. So if you have a big project, but there’s a few files that alone take 1+ minute to compile, the entire compilation can’t possibly take any less than 1 minute even if you had infinite cores. That’s not even getting into linking, which is also usually at least partially if not totally a serial process. See also https://en.m.wikipedia.org/wiki/Amdahl%27s_law
Output as a result is 100mb. Process of compilation accumulates magnitudes more data. Evidence is the constant memory pressure you have in 32G or 64G or even 128G systems. Now given that the process of compilation on even such high end systems take non trivial amount of time, tens of minutes, what do you think how much data bounces from and in memory? It accumulates to a lot more than what you suggest.
Indeed! Compilation is notorious for being a classing pointer chasing load that is hard to brute force and a good way to benchmark overall single-thread core performance. It is more likely to be memory latency bound than memory bandwidth bound.
It is not wildly wrong, be more respectful please since I am speaking from my own experience. Nowhere in my comment have I used Linux kernel as an example. It's not a great example neither since it's mostly trivial to compile in comparison to the projects I had experience with.
Core can be 100% busy but as I see you're a database kernel developer you must surely know that this can be an artifact of a stall in a memory backend of the CPU. I rest my case.
> Nowhere in my comment have I used Linux kernel as an example. It's not a great example neither since it's mostly trivial to compile in comparison to the projects I had experience with.
It's true across a wide range of projects. I build a lot of stuff from source and I routinely look at performance counters and other similar metrics to see what the bottlenecks are (I'm almost clinically impatient).
Building e.g. LLVM, a project with much longer per-translation unit build times, shows that memory bandwidth is even less of a bottleneck. Whereas fetch latency increased as a bottleneck.
> Core can be 100% busy but as I see you're a database kernel developer you must surely know that this can be an artifact of a stall in a memory backend of the CPU. I rest my case.
Hence my reference to doing a topdown analysis with perf. That provides you with a high-level analysis of what the actual bottlenecks are.
Typical compiler work (with typical compiler design) has lots of random memory accesses. Due to access latencies being what they are, that prevents you from actually doing enough memory accesses to reach a particularly high memory bandwidth.
How many cores on that workstation? The claim is you need 40 cores to observe that - very few people have access to such a thing - they exist, but they are expensive.
That workstation has 2x10 cores / 20 threads. I also executed the test on a newer workstation with 2x24 cores with similar results, but I thought the older workstation is more interesting, as the older workstation has a much worse memory bandwidth.
Sorry, but compilation is simply not memory bandwidth bound. There are significant memory latency effects, but bandwidth != latency.
I doubt you can saturate the bandwidth with dual-socket configuration with each having 10 cores. Perhaps if you have very recent cores, which I believe you don't, but Intel design hasn't been that good. What you're also measuring in your experiment, and needs to be taken into account, is the latency across the NUMA nodes which is ridiculously high, 1.5x to 2x to the local node, amounting to usually ~130ns. Because of this, in NUMA configurations, you usually need more (Intel) cores to saturate the bw. I know because I have one sitting at my desk. Memory bandwidth saturation usually begins at ~20 cores with the Intel design that is roughly ~5 year old. I might be off with that number but it's roughly something like that. Other cores if you have them burning the cycles are just sitting there and waiting in the line for the bus to become free.
At 48 cores you are right about at the point where memory bandwidth becomes the limit. I suspect you are over the line, but by so little it is impossible to measure with all the ther noise. Get a larger machine and report back.
> On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s
LLVM peak is suspiciously low since building LLVM is heavier than the kernel? Anyway, on my machine, which is dual-socket 2x22-core skylake-x, for pure release build without debug symbols (less memory pressure), I get ~60GB/s.
For release build with debug symbols, which is much heavier, and what I normally use during the development, so my experience is probably more biased towards that workload, is >50% larger - ~98GB/s.
Now this was peak accumulated but I was also interested in what is the single highest read/write bw measured. For LLVM/clang release with debug symbols this is what I get ~32GB/s for write bw and ~52GB/s for read bw.
This is btw very close to what my socket can handle, store bandwidth is ~40GB/s, load bandwidth is ~80GB/s, and combined load-store bandwidth is 65G/s.
So, I think it is not unreasonable to say that there are compiler workloads that can be limited by the memory bandwidth. I for sure worked with heavier codebases even than LLVM, and even though I did not do the measurements back then, the gut feeling I was having is that the bw is consumed. Some translation units would literally stay for few minutes "compiling" but no progress would have been made.
I agree that random access memory patterns and the latency those patterns incur are also a cost that need to be added to this cost function.
My initial comment on this topic was - I don't really believe that the bottleneck in compilation for larger codebases, of course not on _any_ given machine, is on the compute side, and therefore I don't see how modules are going to fix any of this.
> I'd like to see references to those claims and experiments, size of the codebase etc. I find it hard to believe the figures since the bottleneck in large codebases is not a compute, e.g. headers preprocessing, but it's a memory bandwidth.
Edit: I think I misunderstood what you meant by memory bandwidth at first?
Modules reduce the amount of work being done by the compiler in parsing and interpreting C++ code (think constexpr). Even if your compilation infrastructure is constrained by RAM access, modules replace a compute+RAM heavy part with a trivial amount of loading a module into compiler memory so it's a win.
> I find it hard to believe the figures since the bottleneck in large codebases is not a compute, e.g. headers preprocessing, but it's a memory bandwidth.
source? language? what exactly does memory bandwidth have to do with compilation times in your example?
Chill out. Compiler is a heavily multithreaded program that is utilizing all of the cores in C and C++ compilation model. Since each thread is doing the work, it will obviously also consume memory, no? Computing 101. Total amount of data being touched R/W we call a dataset. A dataset in cases of larger codebases does not fit into the cache. When dataset does not fit into the cache then the data starts to live in main memory. Accessing the data in main memory consumes memory bandwidth of the system. Try running 64 threads and 64-core system touching the data in memory and you will see for yourself.
Compilers are typically not multithreaded. llvm certainly isn’t, although its linker is. C++ builds are usually many single threaded compilation processes running in parallel.
You're nitpicking, that's what I meant. Many processes in parallel or many threads in parallel, former will achieve better utilization of memory. Regardless, it doesn't invalidate what I said
I was going to reply directly to you; but the re-reply is fine. I don't think your conclusion is wrong, but your analysis is bogus AF. Compiler transforms are usually strongly superpolynomial (quadratic or cubic or some NP-hard demon); a Knuth fast pass is going to traverse the entire IR tree under observation. The thing is, the IR tree under observation is usually pretty small; while it won't fit in the localest cache, it's almost certainly not in main memory after the first sweep. Subsequent trees will be somewhere in the far reaches of memory... but there's an awful lot of work between fetching trees.
The two main parts of a typical C++ compiler are the front-end, which handles language syntax and semantic analysis, and the back-end, which handles code generation. C++ makes it difficult to implement the front-end as a multithreaded program because it has context‑sensitive syntax (as does C). The meaning of a construct can change depending on whether a name encountered during parsing refers to an existing declaration or not.
As a result, parsing and semantic analysis cannot be easily divided into independent parts to run in parallel, so they must be performed serially. A modern implementation will typically carry out semantic analysis in phases, for example binding names first, then analyzing types, and so on, before lowering the resulting representation to a form suitable for code generation.
Generally speaking, declarations that introduce names into non‑local scopes must be compiled serially. This also makes the symbol table a limiting factor for parallelism, since it must be accessed in a mutually exclusive manner. _Some_ constructs can be compiled in parallel, such as function bodies and function template instantiations, but given that build systems already implement per‑translation‑unit parallelism, the additional effort is often not worthwhile.
In contrast, a language like C# is designed with context‑free syntax. This allows a top‑level fast parse to break up the source file (there are no #include's in C#) into declarations that can, in principle, be processed in parallel. There will still be dependencies between declarations, and these will limit parallelism. But given that C# source files are a tiny fraction of the size of a typical C++ translation unit, even here parallel compilation is probably not a big win.
The C++ back-end can take advantage of multithreading far more than the front end. Once global optimizations are complete, the remaining work can be queued in parallel for code generation. MSVC works in exactly this way and provides options to control this parallelism. However, parallelism is limited by Amdahl’s Law, specifically the need to read in the IR generated by the front-end and to perform global optimizations.
Hes not nitpicking at all. Every workload is different. How do you even know the compiler is memory bound like you say it is? Youre espousing general wisdom that doesnt apply in specific cases
It isn't memory utilization it is bandwidth. The CPU can only get so many bytes in and out from main memory and only has so much cache. Eventually the cores are fighting each other for access to the main memory they need. There is plenty of memory in the system, the CPU just can't get at enough of it.
NUMA (non-unifrom memory access - basically give each CPU a serpate bank of RAM, and if you need something that is in the other bank of RAM you need to ask the other CPU) exists because of this. I don't have access to a NUMA to see how they compare. My understanding (which could be wrong) is OS designers are still trying to figure out how to use them well, and they are not expected to do well for all problems.
We use c++ modules at Waymo, inside the google monorepo. The Google toolchain team did all the hard work, but we applied it more aggressively than any team I know of. The results have been fantastic, with our largest compilation units getting a 30-40% speedup. It doesn't make a huge difference in a clean build, as that's massively distributed. But it makes an enormous difference for iterative compilation. It also has the benefit of avoiding recompilation entirely in some cases.
Every once in a while something breaks, usually around exotic use of templates. But on the whole we love it, and we'd have to do so much ongoing refactoring to keep things workable without them.
Update: I now recall those numbers are from a partial experiment, and the full deployment was even faster, but I can't recall the exact number. Maybe a 2/3 speedup?
How much of the speedup you're seeing is modules, versus the inherent speedup of splitting and organizing your codebase/includes in a cleaner way? It doesn't sound like your project is actually compiling faster than before, but rather it is REcompiling faster, suggesting that your real problem was that too much code was being recompiled on every change (which is commonly due to too-large translation units and too many transitive dependencies between headers).
This was in place of reorganizing the codebase, which would have been the alternative. I've done such work in the past, and I've found it's a pretty rare skillet to optimize compilation speed. There's just a lot less input for the compiler to look at, as the useless transitive text is dropped.
And to be clear, it also speeds up the original compilation, but that's not as noticeable because when you're compiling zillions of separate compilation units with massive parallelism, you don't notice how long any given file takes to compile.
Are those actually the C++20 modules or clang modules (-fmodules)?
Clang modules. Sorry, didn't realize the distinction!
Clang modules are nothing like what got standardized. Clang modules are basically a cleaned up and standardized form of precompiled headers and they absolutely speed up builds, in fact that is primarily their function.
Were you using pre-compiled headers before?
> Yes, we can. C++20 Modules are usable in a Linux + Clang environment. There are also examples showing that C++20 Modules are usable in a Windows environment with MSVC. I have not yet heard of GCC’s C++20 Modules being used in non-trivial projects.
People keep saying this and yet I do not know of a good example from a real life project which did this which I can test. This seems very much still an experimental thing.
It is just beyond experimental now, and finally in the early adopter phase. Those early adopters are trying things are trying to develop best practices - which is to say as always: they will be trying things that future us will laugh at how stupid it was to do.
There are still some features that are missing from compilers, but enough is there that you can target all 3 major compilers and still get most of modules and benefit from them. However if you do this remember you are an early adopter and you need to be prepared to figure out the right way to do things - including fixing things that you get wrong once you figure out what is right.
Also, if you are writing a library you cannot benefit from modules unless you are willing to force all your consumers to adopt modules. This is not reasonable for major libraries used by many so they will be waiting until more projects adopt modules.
Still modules need early adopters and they show great promise. If you write C++ you should spend a little time playing with them in your current project even if you can't commit anything.
Here, Raytracing in a Weekend, using modules,
https://github.com/pjmlp/RaytracingWeekend-CPP
Also shows how to use static libraries alongside modules.
lol
> can haz real life project?
> sure, here's X in a Weekend
Unfortunately I lack the Office source code to share with you.
I think it still is in a "well technically its possible" state. And I fear it'll remain that way for a bit longer.
A while ago I made a small example to test how it would work in an actual project and that uses cmake (https://codeberg.org/JulianGmp/cpp-modules-cmake-example). And while it works™, you can't use any compiler provided modules or header modules. Which means that 1) so you'll need includes for anything from the standard library, no import std 2) you'll also need includes for any third party library you want to use
When I started a new project recently I was considering going with modules, but in the end I chose against it because I dont want to mix modules and includes in one project.
For me it is simply because Qt moc does not support modules yet. This is my last comment from the tracking issue https://bugreports.qt.io/browse/QTBUG-86697
> C++ 26 reflections have now been voted in. This would get rid of moc entirely, but I really do not see how this will become widely available in the next 5-10 Years+. This would require Qt to move to C++ 26, but only if compiler support is complete for all 3 compilers AND older Linux distros that ship these compilers. For example, MSVC still has no native C++ 23 flag (In CMake does get internally altered to C++ latest aka. C++ 26) , because they told me that they will only enable it is considered 100% stable. So I guess we need to add modules support into moc now, waiting another 10 years is not an option for me .
We are building a document rendering tool using them. It’s a pretty large project, and there have been some really good improvements in Clang’s implementation of C++20 modules in the past few versions.
https://github.com/odoo/paper-muncher/blob/main/src/main.cpp
Did you create your own dialect of C++ along the way? I see co_try$(...) and co_trya$(...) whose definitions I can't find, and I assume they're macros that work either with or without coroutines... did you measure the performance overhead with coroutines?
Theses macros are defined in our framework, the coroutine code is not that hot so we didn't measure the overhead
https://github.com/skift-org/karm
I'd like to understand your build tool. I see references to 'cutekit'. Is this it: https://pypi.org/project/cutekit/
Very cool.
Yes it is
Looking at the examples from using Clang (I use GCC btw.):
Why is something which shall makes things easy and secure so complicated?I'm used to:
It can use headers. Or doesn't use headers. I doesn't matter. That's the decision of the source file. To be fair, the option -std=c++20 probably isn't necessary in future.I recommend skimming over this issue from Meson:
https://github.com/mesonbuild/meson/issues/5024
Reading the last few blog posts from a developer of Meson, providing some insights why Meson doesn't support modules until now:
https://nibblestew.blogspot.com/
> Why is something which shall makes things easy and secure so complicated? > I'm used to: > > g++ -o hello hello.cpp
That is simple, because C++ inherited C's simplistic, primitive, and unsafe compilation and abstraction model of brute-force textual inclusion. When you scale this to a large project with hundreds of thousands of translation units, every command-line invocation becomes a huge list of flag soup that plain Makefiles become intractable.
Almost all other reasonably-recent programming languages have all of the following:
- a strong coupling of dependency management, building, installation, and publishing tools
- some description of a directed acyclic graph of dependencies, whether it be requirements.txt, cargo.toml, Maven, dotnet and Nuget .csproj files, Go modules, OPAM, PowerShell gallery, and more
- some way to describe the dependencies within the source code itself
C++20 modules are a very good thing, and enforce a stronger coupling between compiler and build tool; it's no longer just some weekend project chucking flags at g++/clang++/cl.exe but analysing source code, realising it needs a, b, c, x, y, z modules, ensuring those modules are built and export the necessary symbols, and then compiling the source at hand correctly. That is what `clang-scan-deps` does: https://clang.llvm.org/docs/StandardCPlusPlusModules.html#di...
I concede there are two problems with C++20 modules: we didn't have a working and correct compiler implementation before the paper was accepted into C++20, and secondly, the built/binary module interface specification is not fixed, so BMIs aren't (yet) portable across compilers.
The Meson developer is notorious for stirring the pot with respect to both the build system competition, and C++20 modules. The Reddit thread on his latest blog post provides a searing criticism for why he is badly mistaken: https://www.reddit.com/r/cpp/comments/1n53mpl/we_need_to_ser...
Universal source package management would have been better time spent.
This doesn’t solve any problem that wasn’t self-inflicted.
I agree on your points about having a working implementation before the paper was accepted, this is why C++ is a mess and will never be cleaned up. I love C++ but man, things like this are plenty.
Thank you.
Your comparison is not apples to apples as you would in practice have a seperation of compilation and linking in your non-modules example. The status quo is more like:
With that said, the the modules command is mostly complex due to the -fmodule-file=Hello=Hello.pcm argument.When modules were being standardized, there was a discussion on whether there should be any sort of implicit mapping between modules and files. This was rejected, so the build system must supply the information about which module is contained in which file. The result is more flexible, but also more complex and maybe less efficient.
Thanks :)
> I'm used to: > g++ -o hello hello.cpp
That is an unfair comparison. Nobody uses direct compiler command line approach to compile more than a simple hello world test project.
have you heard of linux?
Because hello is so simple you don't need complicated. When you are doing something complicated though you have to accept that it is complicated. I could concatenate all 20 million lines of C++ (round number) I work with into one file and building would be as simple as your hello example - but that simple building comes at great cost (you try working with a 20 million line file, and then merging it with changes someone else made) and so I'm willing to accept more complex builds.
Thank you. That's right and where usually issues arise and tools challenged. If the hello worlds case starts that complicated already I'm going to being careful.
I'm eager to gather info but the weak spots of headers (and macros) are obvious. Probably holding a waiting position for undefined time. At least as long Meson doesn't support them.
Wikipedia contains false info about toolings: https://en.wikipedia.org/wiki/Modules_(C%2B%2B)#Tooling_supp...
Meson doesn't support modules as of 2025-09-11.
PS: I'm into new stuff when it looks stable and the benefits are obvious. But this looks complicated and backing out of complicated stuff is painful, when necessary.
> If the hello worlds case starts that complicated already I'm going to being careful.
If the tool is intended for complex things I'm not sure I agree. It is nice when hello is simple, but if you can make the complex cases a little easier at the expense of making the simple things nobody does harder I don't know if I care. (note that the example needed the -o parameter to the command line - gcc doesn't have a good default... maybe it should?)
> The data I have obtained from practice ranges from 25% to 45%, excluding the build time of third-party libraries, including the standard library.
> Online, this number varies widely. The most exaggerated figure I recall is a 26x improvement in project compilation speed after a module-based refactoring.
> Furthermore, if a project uses extensive template metaprogramming and stores constexpr variable values in Modules, the compilation speed can easily increase by thousands of times, though we generally do not discuss such cases.
> Apart from these more extreme claims, most reports on C++20 Modules compilation speed improvements are between 10% and 50%.
I'd like to see references to those claims and experiments, size of the codebase etc. I find it hard to believe the figures since the bottleneck in large codebases is not a compute, e.g. headers preprocessing, but it's a memory bandwidth.
>since the bottleneck in large codebases is not a compute, e.g. headers preprocessing, but it's a memory bandwidth.
SSD bandwidth: 4-10GB/s RAM bandwidth: 5-10x that, say 40GB/s.
If compute was not a bottleneck, the entire linux kernel should compile in less than 1 second.
This is making the assumption that source is read once and that there is no intermediate data to write and read. Unless the working set fits in cache, you'll have I/O and can be I/O bound.
On 40-core or 64-core machine there's more compute than you will ever need for a compilation process. Compilation is a heavy I/O workload not a heavy compute workload, in most cases, where it actually matters.
Linux is ~1.5GB of source text and the output is typically a binary less than 100MB. That should take a few hundred milliseconds to read in from an SSD or be basically instant from RAM cache, and then a few hundred ms to write out the binary.
So why does it take minutes to compile?
Compilation is entirely compute bound, the inputs and outputs are minuscule data sizes, in the order of megabytes for typical projects - maybe gigabytes for multi million line projects, but that is still only a second or two from an SSD.
I don't build linux from source, but in my tests with large machines (and my C++ work project with more than 10 million lines of code) somewhere between 40 and 50 cores compile speed starts decreasing as you add more cores. When I moved my source files to a ramdisk the speed got even worse so I know disk IO isn't the issue (there was a lot of RAM on this machine so I don't expect to run low on RAM even with that many cores in use). I don't know how to find the truth, but all signs point to memory bandwidth being the issue.
Of course the above is specific to the machines I did my testing on. A different machine may have other differences from my setup. Still my experience matches the claim: at 40 cores memory bandwidth is the bottleneck not CPU speed.
Most people don't have 40+ core machines to play with, and so will not see those results. The machines I tested on cost > $10,000 so most would argue that is not affordable.
One of the biggest reasons why people see so much compilation improvement speed on Apple M chips - massive bandwidth improvement in contrast to other machines, even some older servers. 100G/s single core main memory. It starts to drop, e.g. it doesn't scale linearly, when you add more and more cores to the workload, due to L3 contention I'd say, but it goes up to 200G/s IIRC.
> So why does it take minutes to compile?
I’m not claiming anything about it being I/O or compute bound, but you are missing some sources of I/O:
- the compiler reads many source files (e.g. headers) multiple times
- the compiler writes and then reads lots of intermediate data
- the OS may have to swap out memory
Also, there may be resource contention that makes the system do neither I/O nor compute for part of the build.
Tried building sqlite amalgamation just now.
Input: single .c file 8.5MB.
Output: 1.8MB object file.
Debug build took 1.5s.
Release build (O2) took about 6s.
That is about 3 orders of magntiude slower than what this machine is capable of in terms of IO from disk.
The fact that something doesn’t scale past X cores doesn’t mean that it is I/O bound! For most C++ toolchains, any given translation unit can only be compiled on a single core. So if you have a big project, but there’s a few files that alone take 1+ minute to compile, the entire compilation can’t possibly take any less than 1 minute even if you had infinite cores. That’s not even getting into linking, which is also usually at least partially if not totally a serial process. See also https://en.m.wikipedia.org/wiki/Amdahl%27s_law
Output as a result is 100mb. Process of compilation accumulates magnitudes more data. Evidence is the constant memory pressure you have in 32G or 64G or even 128G systems. Now given that the process of compilation on even such high end systems take non trivial amount of time, tens of minutes, what do you think how much data bounces from and in memory? It accumulates to a lot more than what you suggest.
This is just wildly wrong.
On an older 2 socket workstation, with relatively poor memory bandwidth, I ran a linux kernel compile.
indicates that memory bandwidth is not a bottleneck. Fetch latency, branch mispredicts and the frontend are.I also analyzed the memory bandwidth using
and it never gets anywhere close to the memory bandwidth the system can trivially utilize (it barely reaches the bandwidth a single core can utilize).iostat indicates there are pretty much no reads/writes happening on the relevant disks.
Every core is 100% busy.
> This is just wildly wrong.
Indeed! Compilation is notorious for being a classing pointer chasing load that is hard to brute force and a good way to benchmark overall single-thread core performance. It is more likely to be memory latency bound than memory bandwidth bound.
It is not wildly wrong, be more respectful please since I am speaking from my own experience. Nowhere in my comment have I used Linux kernel as an example. It's not a great example neither since it's mostly trivial to compile in comparison to the projects I had experience with.
Core can be 100% busy but as I see you're a database kernel developer you must surely know that this can be an artifact of a stall in a memory backend of the CPU. I rest my case.
> Nowhere in my comment have I used Linux kernel as an example. It's not a great example neither since it's mostly trivial to compile in comparison to the projects I had experience with.
It's true across a wide range of projects. I build a lot of stuff from source and I routinely look at performance counters and other similar metrics to see what the bottlenecks are (I'm almost clinically impatient).
Building e.g. LLVM, a project with much longer per-translation unit build times, shows that memory bandwidth is even less of a bottleneck. Whereas fetch latency increased as a bottleneck.
> Core can be 100% busy but as I see you're a database kernel developer you must surely know that this can be an artifact of a stall in a memory backend of the CPU. I rest my case.
Hence my reference to doing a topdown analysis with perf. That provides you with a high-level analysis of what the actual bottlenecks are.
Typical compiler work (with typical compiler design) has lots of random memory accesses. Due to access latencies being what they are, that prevents you from actually doing enough memory accesses to reach a particularly high memory bandwidth.
How many cores on that workstation? The claim is you need 40 cores to observe that - very few people have access to such a thing - they exist, but they are expensive.
That workstation has 2x10 cores / 20 threads. I also executed the test on a newer workstation with 2x24 cores with similar results, but I thought the older workstation is more interesting, as the older workstation has a much worse memory bandwidth.
Sorry, but compilation is simply not memory bandwidth bound. There are significant memory latency effects, but bandwidth != latency.
I doubt you can saturate the bandwidth with dual-socket configuration with each having 10 cores. Perhaps if you have very recent cores, which I believe you don't, but Intel design hasn't been that good. What you're also measuring in your experiment, and needs to be taken into account, is the latency across the NUMA nodes which is ridiculously high, 1.5x to 2x to the local node, amounting to usually ~130ns. Because of this, in NUMA configurations, you usually need more (Intel) cores to saturate the bw. I know because I have one sitting at my desk. Memory bandwidth saturation usually begins at ~20 cores with the Intel design that is roughly ~5 year old. I might be off with that number but it's roughly something like that. Other cores if you have them burning the cycles are just sitting there and waiting in the line for the bus to become free.
At 48 cores you are right about at the point where memory bandwidth becomes the limit. I suspect you are over the line, but by so little it is impossible to measure with all the ther noise. Get a larger machine and report back.
On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s.
The system has well over 450GB/s of memory bandwidth.
> On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s
LLVM peak is suspiciously low since building LLVM is heavier than the kernel? Anyway, on my machine, which is dual-socket 2x22-core skylake-x, for pure release build without debug symbols (less memory pressure), I get ~60GB/s.
For release build with debug symbols, which is much heavier, and what I normally use during the development, so my experience is probably more biased towards that workload, is >50% larger - ~98GB/s. I repeated the experiment with linux kernel, and I get almost the same figure as you do - ~48GB/s. Now this was peak accumulated but I was also interested in what is the single highest read/write bw measured. For LLVM/clang release with debug symbols this is what I get ~32GB/s for write bw and ~52GB/s for read bw. This is btw very close to what my socket can handle, store bandwidth is ~40GB/s, load bandwidth is ~80GB/s, and combined load-store bandwidth is 65G/s.So, I think it is not unreasonable to say that there are compiler workloads that can be limited by the memory bandwidth. I for sure worked with heavier codebases even than LLVM, and even though I did not do the measurements back then, the gut feeling I was having is that the bw is consumed. Some translation units would literally stay for few minutes "compiling" but no progress would have been made.
I agree that random access memory patterns and the latency those patterns incur are also a cost that need to be added to this cost function.
My initial comment on this topic was - I don't really believe that the bottleneck in compilation for larger codebases, of course not on _any_ given machine, is on the compute side, and therefore I don't see how modules are going to fix any of this.
> I'd like to see references to those claims and experiments, size of the codebase etc. I find it hard to believe the figures since the bottleneck in large codebases is not a compute, e.g. headers preprocessing, but it's a memory bandwidth.
Edit: I think I misunderstood what you meant by memory bandwidth at first? Modules reduce the amount of work being done by the compiler in parsing and interpreting C++ code (think constexpr). Even if your compilation infrastructure is constrained by RAM access, modules replace a compute+RAM heavy part with a trivial amount of loading a module into compiler memory so it's a win.
> I find it hard to believe the figures since the bottleneck in large codebases is not a compute, e.g. headers preprocessing, but it's a memory bandwidth.
source? language? what exactly does memory bandwidth have to do with compilation times in your example?
Chill out. Compiler is a heavily multithreaded program that is utilizing all of the cores in C and C++ compilation model. Since each thread is doing the work, it will obviously also consume memory, no? Computing 101. Total amount of data being touched R/W we call a dataset. A dataset in cases of larger codebases does not fit into the cache. When dataset does not fit into the cache then the data starts to live in main memory. Accessing the data in main memory consumes memory bandwidth of the system. Try running 64 threads and 64-core system touching the data in memory and you will see for yourself.
Compilers are typically not multithreaded. llvm certainly isn’t, although its linker is. C++ builds are usually many single threaded compilation processes running in parallel.
You're nitpicking, that's what I meant. Many processes in parallel or many threads in parallel, former will achieve better utilization of memory. Regardless, it doesn't invalidate what I said
I was going to reply directly to you; but the re-reply is fine. I don't think your conclusion is wrong, but your analysis is bogus AF. Compiler transforms are usually strongly superpolynomial (quadratic or cubic or some NP-hard demon); a Knuth fast pass is going to traverse the entire IR tree under observation. The thing is, the IR tree under observation is usually pretty small; while it won't fit in the localest cache, it's almost certainly not in main memory after the first sweep. Subsequent trees will be somewhere in the far reaches of memory... but there's an awful lot of work between fetching trees.
The two main parts of a typical C++ compiler are the front-end, which handles language syntax and semantic analysis, and the back-end, which handles code generation. C++ makes it difficult to implement the front-end as a multithreaded program because it has context‑sensitive syntax (as does C). The meaning of a construct can change depending on whether a name encountered during parsing refers to an existing declaration or not.
As a result, parsing and semantic analysis cannot be easily divided into independent parts to run in parallel, so they must be performed serially. A modern implementation will typically carry out semantic analysis in phases, for example binding names first, then analyzing types, and so on, before lowering the resulting representation to a form suitable for code generation.
Generally speaking, declarations that introduce names into non‑local scopes must be compiled serially. This also makes the symbol table a limiting factor for parallelism, since it must be accessed in a mutually exclusive manner. _Some_ constructs can be compiled in parallel, such as function bodies and function template instantiations, but given that build systems already implement per‑translation‑unit parallelism, the additional effort is often not worthwhile.
In contrast, a language like C# is designed with context‑free syntax. This allows a top‑level fast parse to break up the source file (there are no #include's in C#) into declarations that can, in principle, be processed in parallel. There will still be dependencies between declarations, and these will limit parallelism. But given that C# source files are a tiny fraction of the size of a typical C++ translation unit, even here parallel compilation is probably not a big win.
The C++ back-end can take advantage of multithreading far more than the front end. Once global optimizations are complete, the remaining work can be queued in parallel for code generation. MSVC works in exactly this way and provides options to control this parallelism. However, parallelism is limited by Amdahl’s Law, specifically the need to read in the IR generated by the front-end and to perform global optimizations.
Hes not nitpicking at all. Every workload is different. How do you even know the compiler is memory bound like you say it is? Youre espousing general wisdom that doesnt apply in specific cases
It isn't memory utilization it is bandwidth. The CPU can only get so many bytes in and out from main memory and only has so much cache. Eventually the cores are fighting each other for access to the main memory they need. There is plenty of memory in the system, the CPU just can't get at enough of it.
NUMA (non-unifrom memory access - basically give each CPU a serpate bank of RAM, and if you need something that is in the other bank of RAM you need to ask the other CPU) exists because of this. I don't have access to a NUMA to see how they compare. My understanding (which could be wrong) is OS designers are still trying to figure out how to use them well, and they are not expected to do well for all problems.