It's an interesting exploration of ideas, but there are some issues with this article. Worth noting that it does describe it's approach as "simple and naive", so take my comments below to be corrections and/or pointers into the practical and complex issues on this topic.
- The article says adjacency matrices are "usually dense" but that's not true at all, most graph are sparse to very sparse. In a social network with billions of people, the average out degree might be 100. The internet is another example of a very sparse graph, billions of nodes but most nodes have at most one or maybe two direct connections.
- Storing a dense matrix means it can only work with very small graphs, a graph with one million nodes would require one-million-squared memory elements, not possible.
- Most of the elements in the matrix would be "zero", but you're still storing them, and when you do matrix multiplication (one step in a BFS across the graph) you're still wasting energy moving, caching, and multiplying/adding mostly zeros. It's very inefficient.
- Minor nit, it says the diagonal is empty because nodes are already connected to themselves, this isn't correct by theory, self edges are definitely a thing. There's a reason the main diagonal is called "the identity".
- Not every graph algebra uses the numeric "zero" to mean zero, for tropical algebras (min/max) the additive identity is positive/negative infinity. Zero is a valid value in those algebras.
I don't mean to diss on the idea, it's a good way to dip a toe into the math and computer science behind algebraic graph theory, but in production or for anything but the smallest (and densest) graphs, a sparse graph algebra library like SuiteSparse would be the most appropriate.
SuiteSparse is used in MATLAB (A .* B calls SuiteSparse), FalkorDB, python-graphblas, OneSparse (postgres library) and many other libraries. The author Tim Davis from TAMU is a leading expert in this field of research.
(I'm a GraphBLAS contributor and author of OneSparse)
You seem to do a lot of work on sparse graphs, as most people do, but if you re-read the opening line carefully:
> In graph theory, an adjacency matrix is a square matrix used to represent a finite (and usually dense) graph.
Many of these issues evaporate on the realization the article sets out to talk talk to the use of adjacency matrices for dense graphs, which it's trying to point out are the ones you'd commonly use an adjacency matrix for, rather than trying to claim all graphs are dense so you should always use an adjacency matrix.
E.g. a dense graph of 1,000,000 nodes would usually be considered "a pretty damn large dense graph" and so on. These are probably good things to have mentioned here though, as pulling in an article about adjacency matrices for conversation without context of knowing why you're using one already can lead to bad conclusions by folks.
This is very much a nitpick but a million million bits is 116GB and you can squeeze 192GB of RAM into a desktop these days, let alone a workstation or a server. (Even if mdspan forces bytes, you can fit a million^2 elements into a server.)
One important thing to keep in mind when using std::mdspan: There is no stable version of GCC with official support. Not even version 15.2. You need to use the latest trunk. I discovered this after I had already written a significant amount of code using std::mdspan that compiled in Clang and MSVC.
These approaches may be nice to demonstrate the concept in brief but I'm a bit sad the article didn't take the opportunity to go into a design that only stores the triangular data since it's pretty trivial to overload operators in C++. If this is meant to be a demonstration of the performance advantage of mdspan over nested vector creation (which certainly is the case for large multidimensional arrays) it'd be good to dial that up.
When the article mentioned “more efficient”, I was expecting some actual measurements.
Instead, it seems to just assert that allocating (dense) matrices in a big block is better than the usual array of pointers that you would get in older C/C++.
FWIW it definitely is and that's why you'd really probably just get a function or macro converting coordinates into a single-dimension offset in older C/C++.
It depends on hardware but I've seen improvements of 100 times faster. There's data available for a number of examples if you search.
An adjacency matrix seems like they gave the example most likely to irritate the mathematically inclined and confuse the outsider. Nested array access isn't that unusual.
Yep, for any decent sized graph, sparse is an absolute necessity, since a dense matrix will grow with the square of the node size, sparse matrices and sparse matrix multiplication is complex and there are multiple kernel approaches depending on density and other factors. SuiteSparse [1] handles these cases, has a kernel JIT compiler for different scenarios and graph operations, and supports CUDA as well. Worth checking out if you're into algebraic graph theory.
Using SuiteSparse and the standard GAP benchmarks, I've loaded graphs with 6 billion edges into 256GB of RAM, and can BFS that graph in under a second. [2]
I can see a use for this. It would be nice to not have to write the typical indexing boilerplate when dealing with multidimensional data. One less area to make a mistake. Feels less kludgy.
I wonder if this has any benefit of row vs column memory access, which I always forget to bother with unless suddenly my performance crawls.
Note that GCC/libstdc++ (as of v15.2) does not yet implement std::mdspan [1], so it needs to be imported from another reference implementation like Kokkos [2].
It's an interesting exploration of ideas, but there are some issues with this article. Worth noting that it does describe it's approach as "simple and naive", so take my comments below to be corrections and/or pointers into the practical and complex issues on this topic.
- The article says adjacency matrices are "usually dense" but that's not true at all, most graph are sparse to very sparse. In a social network with billions of people, the average out degree might be 100. The internet is another example of a very sparse graph, billions of nodes but most nodes have at most one or maybe two direct connections.
- Storing a dense matrix means it can only work with very small graphs, a graph with one million nodes would require one-million-squared memory elements, not possible.
- Most of the elements in the matrix would be "zero", but you're still storing them, and when you do matrix multiplication (one step in a BFS across the graph) you're still wasting energy moving, caching, and multiplying/adding mostly zeros. It's very inefficient.
- Minor nit, it says the diagonal is empty because nodes are already connected to themselves, this isn't correct by theory, self edges are definitely a thing. There's a reason the main diagonal is called "the identity".
- Not every graph algebra uses the numeric "zero" to mean zero, for tropical algebras (min/max) the additive identity is positive/negative infinity. Zero is a valid value in those algebras.
I don't mean to diss on the idea, it's a good way to dip a toe into the math and computer science behind algebraic graph theory, but in production or for anything but the smallest (and densest) graphs, a sparse graph algebra library like SuiteSparse would be the most appropriate.
SuiteSparse is used in MATLAB (A .* B calls SuiteSparse), FalkorDB, python-graphblas, OneSparse (postgres library) and many other libraries. The author Tim Davis from TAMU is a leading expert in this field of research.
(I'm a GraphBLAS contributor and author of OneSparse)
You seem to do a lot of work on sparse graphs, as most people do, but if you re-read the opening line carefully:
> In graph theory, an adjacency matrix is a square matrix used to represent a finite (and usually dense) graph.
Many of these issues evaporate on the realization the article sets out to talk talk to the use of adjacency matrices for dense graphs, which it's trying to point out are the ones you'd commonly use an adjacency matrix for, rather than trying to claim all graphs are dense so you should always use an adjacency matrix.
E.g. a dense graph of 1,000,000 nodes would usually be considered "a pretty damn large dense graph" and so on. These are probably good things to have mentioned here though, as pulling in an article about adjacency matrices for conversation without context of knowing why you're using one already can lead to bad conclusions by folks.
You're right, I did read the article before commenting, but I see your point that I didn't completely understand the intent.
This is very much a nitpick but a million million bits is 116GB and you can squeeze 192GB of RAM into a desktop these days, let alone a workstation or a server. (Even if mdspan forces bytes, you can fit a million^2 elements into a server.)
Fair enough, showing my age with "impossible".
But still true that dense growth is not linear but quadratic to the number of nodes.
One important thing to keep in mind when using std::mdspan: There is no stable version of GCC with official support. Not even version 15.2. You need to use the latest trunk. I discovered this after I had already written a significant amount of code using std::mdspan that compiled in Clang and MSVC.
Everything moves pretty slowly
https://en.cppreference.com/w/cpp/compiler_support.html#cpp2...
It's kind of shocking when you finally find the features you heard about in conferences used in the wild.
These approaches may be nice to demonstrate the concept in brief but I'm a bit sad the article didn't take the opportunity to go into a design that only stores the triangular data since it's pretty trivial to overload operators in C++. If this is meant to be a demonstration of the performance advantage of mdspan over nested vector creation (which certainly is the case for large multidimensional arrays) it'd be good to dial that up.
On a related note, there's currently a lot of work towards adding graph structures to the upcoming c++ standard. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/
When the article mentioned “more efficient”, I was expecting some actual measurements.
Instead, it seems to just assert that allocating (dense) matrices in a big block is better than the usual array of pointers that you would get in older C/C++.
FWIW it definitely is and that's why you'd really probably just get a function or macro converting coordinates into a single-dimension offset in older C/C++.
It depends on hardware but I've seen improvements of 100 times faster. There's data available for a number of examples if you search.
Are the adjacency matrices in graph theory really usually dense?
For a powerful sparse adjacently matrix C library check out SuiteSparse GraphBLAS, there are binding for Python, Julia and Postgres.
https://github.com/DrTimothyAldenDavis/GraphBLAS
An adjacency matrix seems like they gave the example most likely to irritate the mathematically inclined and confuse the outsider. Nested array access isn't that unusual.
Same thing caught my eye. They are usually sparse.
Yep, for any decent sized graph, sparse is an absolute necessity, since a dense matrix will grow with the square of the node size, sparse matrices and sparse matrix multiplication is complex and there are multiple kernel approaches depending on density and other factors. SuiteSparse [1] handles these cases, has a kernel JIT compiler for different scenarios and graph operations, and supports CUDA as well. Worth checking out if you're into algebraic graph theory.
Using SuiteSparse and the standard GAP benchmarks, I've loaded graphs with 6 billion edges into 256GB of RAM, and can BFS that graph in under a second. [2]
[1] https://github.com/DrTimothyAldenDavis/GraphBLAS
[2] https://onesparse.com/
The article doesn't say "graphs are usually dense".
It just asserts that if you use an adjacency list, the graph is likely dense. Otherwise it makes little sense to use for a sparse graph.
Technically the article is saying the graphs are dense. Which might make sense, but using sparse matrices to represent sparse graphs is not unusual.
I can see a use for this. It would be nice to not have to write the typical indexing boilerplate when dealing with multidimensional data. One less area to make a mistake. Feels less kludgy.
I wonder if this has any benefit of row vs column memory access, which I always forget to bother with unless suddenly my performance crawls.
It looks like the LayoutPolicy template parameter lets you choose
Note that GCC/libstdc++ (as of v15.2) does not yet implement std::mdspan [1], so it needs to be imported from another reference implementation like Kokkos [2].
[1] Merged in for v16: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107761
[2] https://github.com/kokkos/mdspan