Seven is enough
I went down the rabbit hole once, trying to figure out why numpy's matrix multiplication is so fast. The trail ends up at a library called BLAS, and following that name reveals one axis of the answer. Hardware-side optimisations like cache-tiling, SIMD, parallelism. There's another axis tho. The algorithmic cost model itself turns out to be lower than what we were first taught. Matrix multiplication is not .
More precisely, the best algorithm of our era runs in . The era when looked like the obvious answer lasted until 1969. After that, fifty years of human work shaved roughly 0.6 off the exponent. Then in 2022, an RL agent left a fresh footprint along the trail. We still haven't reached the end of it.
This essay walks back to where the trail begins. To Volker Strassen's short 1969 discovery. The discovery was the act of removing just one multiplication, and that one removal shaped fifty years of the landscape that followed. The starting point is a handful of algebra you can verify by hand.
Say we want to multiply two matrices. By definition, each entry of the result is the sum of products from one row of and one column of .
Count the scalar multiplications and you get 4 entries . The natural answer. Open any textbook and you'll find the same number, because the definition of matrix multiplication forces it.
A short 1969 paper breaks that naturalness. Volker Strassen's Gaussian Elimination is not Optimal, three pages long. He shows how to multiply two matrices using only seven multiplications.
First he constructs seven intermediate products.
Each uses one multiplication. Additions and subtractions are cheap relative to multiplication, so we don't count them. This follows a hardware model in which multiplication is the dominant cost for large matrices.
The four entries of the result are then assembled from these using additions and subtractions.
Seven multiplications. That's it.
The first time you see these seven formulas, they look arbitrary. Why does even show up? Why pair with ? There's no natural intuition that leads here.
And yet it works. The careful cancellation across these seven products yields the correct result. That's the elegance of Strassen's discovery. It isn't intuitive but it's right. And once you have the result, you can trace why it's right by hand.
Checking the algebra
Let's actually expand one entry by hand to see if the seven formulas really produce . Take .
Substituting each by its definition.
Expanding all the terms.
Now collect like terms to see what cancels.
: appears once in line 1 → Survives
: +1 in line 1, -1 in line 3. → Cancels
: +1 in line 1, -1 in line 2 → Cancels
: +1 in line 1, -1 in line 4 → Cancels
: +1 in line 2, -1 in line 4 → Cancels
: -1 in line 3, +1 in line 4 → Cancels
: appears once in line 4 → Survives
Collecting only the survivors.
This matches the definition of exactly.
The other three entries verify the same way. What used to be expressed with eight multiplications is reconstructed exactly from combinations of seven, through addition and subtraction. Follow the hand calculation through and you start to see how this cancellation works.
But hand calculation only shows correctness. It doesn't show the path to discovery. How Strassen found these seven formulas is a separate question. There's no intuitive derivation, no accident of luck. From the tensor rank perspective, there's a theorem that the rank of the matrix multiplication tensor is exactly 7. Hopcroft and Kerr proved it in 1971, and Landsberg and others extended the result later. But knowing the rank and finding the decomposition that achieves it are different problems. The path of discovery hides behind the algebra.
Doing all four entries by hand gets tedious, so let's also confirm with code.
agree: 100 / 100
For all 100 random matrices, the result of the seven-product Strassen formula agrees with the definition. Doing it by hand showed the cancellation algebraically for one entry. The code confirms empirically over random matrices that the same cancellation holds throughout.
The weight of recursion
At the level the reduction from 8 to 7 is just one. , about 12.5% faster. Interesting, but it doesn't really change the bigger picture of matrix multiplication.
Things shift once you apply it recursively.
Cut an matrix (assume is power of 2) into four blocks.
Each and is now a smaller matrix. Apply Strassen's seven-product formulas at the block level. For example, in , the is a matrix addition between small matrices, and the product of that result with is a matrix multiplication between small matrices.
The key is that we can call Strassen recursively. Computing each requires one matrix multiplication at half the size, and that smaller multiplication can also be done by Strassen, yielding seven even smaller multiplications.
A recurrence falls out naturally. Let be the number of scalar multiplications needed to multiply matrices.
Seven smaller matrix multiplications plus additions and subtractions. Asymptotically, the dominant term is the multiplication count.
The Master theorem solves this.
Compared to the naive , the exponent drops from 3 to 2.807.
How big is this one-line change? At , the naive count is roughly multiplications, while Strassen drops it to roughly . Roughly 2.5x difference. As grows the gap widens exponentially. A single removed multiplication, amplified through recursion.
Let's actually count the multiplications in code.
n naive strassen ratio
------ ------------ ------------ --------
2 8 7 1.14x
4 64 49 1.31x
8 512 343 1.49x
16 4096 2401 1.71x
32 32768 16807 1.95x
64 262144 117649 2.23x
128 2097152 823543 2.55x
256 16777216 5764801 2.91xEach time doubles, the ratio grows by roughly . A small one-time reduction blooms across an exponential lattice. That's the weight of recursion.
One practical caveat. In real libraries, Strassen typically starts outperforming the naive algorithm only around . For smaller , the addition overhead and poor cache locality make the naive method faster. The table above counts multiplications only, not wall time. For wall time, hardware-side optimisations in BLAS like cache-tiling, SIMD, and parallelism matter more than the multiplication count.
Fifty years of chasing
After Strassen's discovery, mathematicians start chasing the matrix multiplication complexity exponent . That's the smallest exponent such that some algorithm runs in . Fifty years of tightening the upper bound.
Year | Discoverer(s) | Bound | Key technique | |
|---|---|---|---|---|
1969 | Strassen | 7 product for 2*2 | ||
1978 | Pan | 2.781 | partial matrix multiplication | |
1979 | Bini et al. | 2.78 | approximate matrix mult, border rank | |
1981 | Schönhage | 2.522 | τ-theorem, partial decomposition | |
1986 | Strassen | 2.479 | early laser method | |
1990 | Coppersmith, Winograd | 2.376 | arithmetic progressions + laser(held 30 years) | |
2014 | Le Gall | 2.3728639 | tighter laser method analysis | |
2020 | Alman, Williams | 2.3728596 | refined laser method(SODA 2021) | |
2022 | Duan, Wu, Zhou | 2.371866 | asymmetric hashing | |
2024 | Williams, Xu, Xu, Zhou | 2.371552 | new laser method variant(SODA 2024) | |
2024 | Alman, Duan, Williams, Xu, Xu, Zhou | 2.371339 | further asymmetry(SODA 2025) |
Each improvement isn't a simple subtraction like 7 → 6. It's a refinement of asymptotic analysis. Accumulated algebra about how to construct tensor decompositions better in the large limit. Coppersmith-Winograd's 2.376 holding as the best result for 24 years speaks to the depth of that analysis.
A particularly interesting moment is the silence between 1990 and 2014. The reading at the time was that Coppersmith-Winograd's laser method had given everything it could. Then Le Gall 2014 shaved off another 0.003 with a finer analysis of the same method. After 24 years of quiet, this was the discovery that more could be squeezed out within the method's limits. After that came 2020's refined laser by Alman and Williams, 2022's asymmetric hashing by Duan, Wu, and Zhou, 2024's new variant by Williams, Xu, Xu, and Zhou, and the same year's further asymmetry by Alman, Duan, and four others. Each shaved a little more, landing us at the current best of .
The mathematical grain of this march is an accumulation of asymptotic function analysis pointing toward the true value of . Not intuitive algebra, but precise tracking of limits. Fifty years of human work, poured into a 0.43 improvement in the exponent.
One note on the lower-bound side. The trivial lower bound is . Every entry has to be read at least once, so at least operations are required. Since Winograd's 1979 result, no tighter lower bound has been found. So we've been shaving the upper bound while the lower bound has stayed put for 50 years.
AlphaTensor 2022, the moment AI becomes a discoverer
In October 2022, DeepMind publishes a paper in Nature titled Discovering faster matrix multiplication algorithms with reinforcement learning.
The core idea is to reduce matrix multiplication to a tensor decomposition problem. An matrix multiplication is represented as a 3-dimensional tensor with entries, and the rank of this tensor equals the minimum number of multiplications needed for that multiplication. Strassen's seven-product discovery is exactly equivalent to the fact that the rank of the matrix multiplication tensor is at most 7.
Finding a way to make the tensor's rank lower than what's known is the same as making a faster algorithm. But finding tensor rank is NP-hard even for fixed nn. Brute-force search doesn't work because the solution space grows exponentially.
DeepMind's approach treats tensor decomposition as a game. They train an RL agent from the AlphaZero family, called AlphaTensor, to play it. Each turn, the agent picks one rank-1 component and subtracts it from the tensor. The game ends when the tensor reaches zero. Fewer moves mean a lower-rank decomposition.
Here's what AlphaTensor found, in the paper's Fig. 3 and Supplementary.
Case | Previous best | AlphaTensor | Note |
|---|---|---|---|
(standard) | 49 (Strassen recursion) | 47 | record broken after 50 years |
(standard) | 80 | 76 | |
(mod 2) | 98 | 96 | |
(mod 2) | 49 | 47 |
Three nuances to mark here.
First, mod 2 arithmetic. Matrix multiplication over the boolean field is not the same as multiplication over the reals. In booleans, , so signs disappear. Over the reals, signs matter. So some of the paper's mod 2 results can't be ported straight into practical matrix multiplication libraries. The 47-product formula works over standard arithmetic, but a few cases hold only over mod 2.
Next, AlphaTensor didn't crack the exponent itself. It found better decompositions for fixed (, , and so on), but the asymptotic upper bound on stays where humans left it in 2024, at 2.371339. What AlphaTensor found were concrete decompositions at small , not improvements to the asymptotic estimate of . Two kinds of discovery on different layers.
Finally, the discovered decompositions don't show algebraic patterns a human can read. Strassen's seven-product formula can be hand-traced for why it works, through cancellation we walked through earlier. The human-found 49-product Strassen-Smirnov is similar, with the pattern visible. AlphaTensor's 47-product formula has no such visible pattern. It's just what the RL agent found, and the intuition for why 47 is enough stays opaque. Verification still works. Plug in random mod 2 matrices and you get exactly . But the algebra of discovery is buried inside the training curve.
Still, AI broke a 50-year human record. The 49-product decomposition for dates back to Strassen 1969, his trick applied recursively, with Smirnov refining it in 1986. For over fifty years, no human found a faster way. The RL agent pinned down a 47-product formula after a few days of training. The 47 formulas themselves are open in the paper's supplementary, so anyone can verify them over mod 2.
What we don't know
The true value of stays unknown.
Lower bound: , the naive lower bound. Every entry has to be read at least once, so at least operations are required. Winograd formalised this in 1979, and no tighter lower bound has been found since.
Upper bound is
Gap is about 0.371339.
We don't know what lives inside that gap. The question of how fast matrix multiplication can intrinsically be has stayed open for over 50 years.
The march of the upper bound may converge at some point, or it may keep dropping. If turns out to be the true answer, the strong conjecture, it means every matrix multiplication runs as fast as just reading the entries. A hard conclusion to accept, but 50 years of falling upper bounds point that way.
If , meaning a strict lower bound exists, that lower bound has to be proved. We don't even know what such a proof would look like. The existing tools of circuit complexity and communication complexity don't apply to this problem directly.
AlphaTensor pinned down one thing in particular. Even at small , the decompositions we believed were best may not be optimal. The 49-product formula for was best for over fifty years, but it wasn't actually the best. We just hadn't found better. What we call the upper bound may, in fact, be lower than we know.
The path toward 's true value is no longer a human-only path. What 50 years of pursuit have shown is the shape of the path, not its endpoint. The endpoint is still somewhere we don't know.