In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"
So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?
The catch is that orthogonal polynomial bases (like Legendre) implicitly regularize by controlling the Riesz representer norm, effectively implementing a form of spectral filtering that penalizes high-frequency components.
The degree-10000 polynomial is definitely overfitting - every data point has its own little spike. The truncated curves look kind of nice, but the data points aren't shown on those plots, and the curves aren't very close to them.
There are also some enormous numbers hidden away here too, with associated floating point precision problems; the articles show the coefficients are small, but that's because the polynomial basis functions themselves have gigantic numbers in them. The Bernstein basis for degree-100 involves 100 choose 50, which is already > 10^29. You have to be careful calculating these polynomials or bits of your calculation exceed 64-bit floating point range, e.g. factorials of 10000, 2^10000. See the formulas and table in this section: https://en.wikipedia.org/wiki/Legendre_polynomials#Rodrigues...
No. All the textbooks know that polynomials of high degree are numerically dangerous and you need to be careful when handling them.
The articles examples only work because the interval 0 to 1 (or -1 to 1) were chosen. For whatever reason the author does not point that out or even acknowledges the fact that had he chosen a larger interval the limitations of floating point arithmetic would have ruined the argument he was trying to make.
10^100 is a very large number and numerically difficult to treat. For whatever reason the author pretends this is not a valid reason to be cautious about high degree polynomials.
This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.
This paragraph has nothing to do with numerics. It is about the fact that continuous functions can not be approximated globally by polynomials. So you need to restrict to intervals for reasons of mathematical theory. This is totally unrelated to the numerical issues, which are nowhere even acknowledged.
All polynomials "work" globally. That some polynomials form an orthonormal basis over certain intervals is essentially irrelevant.
The author does not address the single most important reason why high degrees of polynomials are dangerous. Which is pretty insane to be honest, obviously to be honest you have to at least mention why people tell you to be cautious about high degree polynomials AND point out why your examples circumvent the problem. Anything else is extremely dishonest and misleading.
Neural network training is harder when the input range is allowed to deviate from [-1, 1]. The only reason why it sometimes works for neural networks is because the first layer has a chance to normalize it.
This paper shows that polynomials show most features of deep neural nets, including double descent and ability to memorize entire dataset.
It connects dots there - polynomials there are regularized to be as simple as possible and author argues that hundredths of billions of parameters in modern neural networks work as a regularizers too, they attenuate decisions that "too risky."
I really enjoyed that paper, a gem that puts light everywhere.
Well, one catch is that you're doing degree ten thousand. That alone already increases your computational costs and can run afoul of numerical error as your numbers start losing precision. You'll have to start worrying about overflow and underflow. Horner's method or another representation of your polynomial might start to become important. You'll also start to notice your CPU spiking more. In essence, this is kind of fitting more and more points until you get rid of every oscillation you don't want.
The other catch, which the author mentions, is that extrapolation is still essentially impossible with polynomials. This is easy to see. The highest-degree term will dominate all the other terms by an order of magnitude once you step out of the interval of interest. Every non-constant polynomial grows without bound.
Honestly, though, what the author is doing isn't much different than a Taylor approximation. If you're okay fitting any polynomial in a small neighbourhood of the data, then go whole hog and fit the best possible polynomial: a Taylor polynomial. You wouldn't usually need to go to that high degree to get what you want in a neighbourhood either.
> Horner's method or another representation of your polynomial might start to become important.
Horner's is the default way to evaluate a polynomial - and I think you're overstating the cost of evaluation.
> what the author is doing isn't much different than a Taylor approximation
Yes it is. The Taylor approximation is only valid around a point - and it's based on matching the nth order derivative. There are no error metrics. And it requires differentiation to find it.
Taylor is for points. For functions on interval you want Chebyshev interpolation and probably want to throw in some poles to do better if L1 not L2 matters.
The original paper about the bias variance tradeoff, that the double decent papers targeted, had some specific constraints.
1) data availability and computer limited training set sizes.
2) they could simulate infinite datasets.
While challenging for our minds, training set sizes today make it highly likely that the patterns in your test set are similar to concept classes in your training set.
This is very different than saying procedure or random generated test sets, both of which can lead to problems like over fitting with over parameterized networks.
When the chances are that similar patterns exist, the cost of some memorization goes down and is actually somewhat helpful for generalization.
There are obviously more factors at play here, but go look at the double decent papers and their citations to early 90's papers and you will see this.
The low sensitivity of transformers also dramatically helps, with UHAT without CoT only having the expressiveness of TC0, and with log space scratch space having PTIME expressability.
You can view this from autograd requiring a smooth manifold with the ability to approximate global gradient too if that works better for you.
But yes all intros have to simplify concepts, and there are open questions.
Numerical analysis is ALL about picking the right tool for the job. Unfortunately, there aren't super methods and you need to know how to classify problems.
> So... are _all_ introductions to machine learning just extremely wrong here?
It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2 (since excel doesn't do a holdout), so it's a very common mistake new students make.
It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.
There's something I'm fundamentally missing here--if the standard basis and the Berenstain basis describe exactly the same set of polynomials of degree n, then surely the polynomial of degree n that minimizes the mean square error is singular (and independent of the basis--the error is the samples vs the approximation, the coefficients/basis are not involved) so both the standard basis and Berenstain basis solution are the same (pathological, overfitted, oscillating) curve?
Like I understand how the standard basis is pathological because the higher degree powers diverge like mad so given "reasonable" components the Berenstain basis is more likely to give "reasonable" curves but if you're already maximizing I don't understand how you arrive at a different curve.
The minimization is regularized, meaning you add a penalty term for large coefficients. The coefficients will be different for the two bases, meaning the regularization will work differently.
Ok, yeah, doing a little googling that makes sense. I kind of feel that the article author was burying the lede by saying this was about ML optimization where apparently regularization is the norm(so to speak lol) and basis selection is the whole ball game indirectly through the way it influences convex optimization
A well-known related paper that I didn't see mentioned in the article (although Trefethen was mentioned) is "Six Myths of Polynomial Interpolation and Quadrature".
I'm exactly in the category of AI majors who are not familiar with numerical methods. Can you broadly explain where the gap in AI pedagogy is and how students can fill it?
The series of articles posted here are interesting and I plan to review them in more detail. But I'm concerned about what the "unknown-unknowns" are.
Sure. The original normalizing flows used a fixed number of layers. Someone at UToronto recognized that, as the number of layers gets very deep, this is essentially an ordinary differential equation (ODE). Why?
Suppose you have n residual layers that look like:
x_0 = input
x_{i+1} = x_i + f(x_i)
x_n = output
If you replace them with an infinite number of layers, and use "time" t instead of "layer" i, you get
x(t+dt) = x(t) + dt f(x(t)) <=> x'(t) = f(x, t)
so to find the output, you just need to solve an ODE. It gets better! The goal of normalizing flows is to "flow" your probability distribution from a normal distribution to some other (e.g. image) distribution. This is usually done by trying to maximize the probability the training images should show up, according to your model, i.e.
loss(model) = product model^{-1}(training image)
Notice how you need the model to be reversible, which is pretty annoying to implement in the finite-layer case, but with some pretty lenient assumptions is guaranteed to be true for an ODE. Also, when you're inverting the model, the probabilities will change according to the derivative; since you have more than one dimension, this means you need to calculate the determinant of the Jacobian for every layer, which is decently costly in the finite-layer case. There are some tricks that can bring this down to O(layer size^2) (Hutchinson++), but the ODE case is trivial to compute (just exp(trace)).
So, turning the model into an ODE makes it blazing fast, and since you can use any ODE solver, you can train at different levels of precision based on the learning rate (i.e. the real log canonical threshold from singular learning theory). I haven't seen any papers that do this exactly, but it's common to use rougher approximations at the beginning of training. Probably the best example of this is the company Liquid AI.
Finally, this all turns out to be very similar to diffusion models. Someone realized this, and combined the two ideas into flow-matching.
-----
This is one place it's super useful to know numerical methods, but here are a couple others:
1. Weight initialization --> need to know stability analysis
2. Convolutions --> the Winograd algorithm, which is similar to ideas in the FFT and quadrature
> Can you broadly explain where the gap in AI pedagogy is and how students can fill it?
Machine learning can be made much more effective and efficient, by first modeling the problem, and then optimizing that tailored representation. This is an alternative to throwing a bunch of layers of neurons, or copy pasting an architecture, and hoping something works out.
One of the most successful applications of ML is convolutional neural networks and is based on this principle. Image processing algorithms come from an optical theory which can be modeled with convolution - what if we used optimization to find those convolution kernels.
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.
Also you need to know when a problem is NOT optimization - for example solving equations via the bisection method.
Right? I'm no stranger to ML, but this feels like magic, so cool! It clearly explains why normalizing features can be important (some polynomials blow up outside their range), provides a simple example of double descent and fits extremely high-degree polynomials without loss of generalization - amazing stuff!
The problem with Fourier approximation is that it works terribly for relatively "simple" functions. E.g. fitting a linear relationship is extremely hard.
Yeah, I thought the whole point of a polynomial approximation was that it is really only useful for the first couple of powers (quick and cheap) or because you have a particular process that you know a priori has a particular form (non-convergent, non-conservative, higher powered, etc.).
The "particular form" is important--if you don't know that then there is no reason for choosing x^10000 over x^5 (other than computational complexity). But there is also no reason for choosing x^5 over x^10000! Maybe that function really is flat and maybe it isn't. You really just don't know.
If you don't have anything too weird, Fourier is pretty much optimal in the limit--if a bit expensive to calculate. In addition, since you can "bandwidth limit" it, you can very easily control overfitting and oscillation. Even more, Fourier often reflects something about the underlying process (epicycles, for example, were correct--they were pointing at the fact that orbits were ellipses).
Pick a metric. Fourier is probably better in general.
With respect to machine learning, probably the fact that Fourier is bounded and gives coherent (if completely random) results even very far from the interval of interest.
However, this is like saying that an O(n log n) algorithm is better than O(n^2). Sure, it's true in the limit, by the constant terms can be quite large and O(n^2) can remain better up to remarkably useful values of n.
As I pointed out, the advantage that polynomial basis generally has is that you can be accurate enough with lower powers that are much, much faster to calculate (we use matrices of linear equations precisely for that reason). Or, you can match a particular process because you know specifically that it follows some particular function (we know that road damage follows the fourth power of load--it would be counterproductive to model that with Fourier).
Using high powers for polynomial basis is almost always worse than any other choice.
I don't disagree, but I do not think it's meaningful to call something optimal if it is clearly unusable under certain circumstances.
If you know that some relationship is close to polynomial, obviously a polynomial basis is more suitable. E.g. a line performs terribly for a Fourier transformation.
>Using high powers for polynomial basis is almost always worse than any other choice.
It should also be noted that this kind of fitting is extremely closely related to integration or in other words calculating the mean.
By using the "right" kind of polynomial basis you can get a polynomial approximation which also tells you the mean and variance of the function under a random variable.
They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.
If you treat the ring of polynomials as a vector space in their coefficients, then the unit monomials 1, x, x^2, etc. form the standard basis of that vector space. Wikipedia has an example of "standard basis" in this sense [0].
I completely disagree with the conclusion of the article. The reason the examples worked so well is because of an arbitrary choice, which went completely uncommented.
The interval was chosen as 0 to 1. This single fact was what made this feasible. Had the interval been chosen as 0 to 10. A degree 100 polynomial would have to calculate 10^100, this would have lead to drastic numerical errors.
The article totally fails to give any of the totally legitimate and very important reason why high degree polynomials are dangerous. It is absurd to say that well known numerical problems do not exist because you just found one example where they did not occur.
The article specifically points out that these polynomials only work well on specific intervals (emphasis copied from the article):
"The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features."
As I understand it, one of the main ideas of this series of posts is that normalizing features to very specific intervals is important when fitting polynomials. I don't think this "went completely uncommented".
Yes! And the next articles in the series double down on this:
"Any polynomial basis has a “natural domain” where its approximation properties are well-known. Raw features must be normalized to that domain. The natural domain of the Bernstein basis is the interval [0,1][0,1]."
The quote has absolutely nothing to do with my point.
The scaling to an interval in the quote is about formal mathematical reasons,in particular that polynomials do not approximate continuous functions globally. This is totally unrelated to numerics.
The issue is that in particular the interval 0 to 1 has to be chosen, as otherwise the numerics totally fall apart. The message of the article is that high degree polynomials pose no danger, but that is wrong. All the examples in the article only work because of a specific choice of interval. All the major numerical issues are totally ignored, which would immediately invalid the core thesis of the article. If you calculate 10^100 in 64 bit floating point you will run into trouble. The article pretends that will not be the case.
However, if you normalize your data to [0,1], you'll never have to compute 10^100 and thus never face any numerical issues. "Never" assumes no distribution shift.
Indeed, the examples work thanks to this choice of the interval, but this comes with the choice of the basis. Of course Bernstein basis functions explode outside [0,1], but I think the point is that high-degree polynomials pose no danger if you scale the data *according to the polynomial* (use [0,1] for Bernstein and [-1,1] for Chebyshev, for example). So the "magic combo" is polynomial + scaling to its interval. Otherwise all bets are off, of course.
The article totally ignores this and does not even mention the numerical issues at all, which is pretty insane.
Surely at least naming THE ONE reason high degree polynomials are dangerous has to be done. Writing an article arguing that something is not a problem, while not even acknowledging the single most important reason why people believe the problem exist is totally disingenuousness and pretty terrible scholarship.
At least include that the choice of 0 to 1 is necessary for this to work. Not including it makes the author look either clueless or malicious.
One thought I had while looking into regression recently is to consider the model created with a given regularization coefficient not as a line on a 2 dimensional graph but as a slice of a surface on a 3 dimensions graph where the third dimension is the regularization coefficient.
In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.
More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.
In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"
So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?
The catch is that orthogonal polynomial bases (like Legendre) implicitly regularize by controlling the Riesz representer norm, effectively implementing a form of spectral filtering that penalizes high-frequency components.
The degree-10000 polynomial is definitely overfitting - every data point has its own little spike. The truncated curves look kind of nice, but the data points aren't shown on those plots, and the curves aren't very close to them.
There are also some enormous numbers hidden away here too, with associated floating point precision problems; the articles show the coefficients are small, but that's because the polynomial basis functions themselves have gigantic numbers in them. The Bernstein basis for degree-100 involves 100 choose 50, which is already > 10^29. You have to be careful calculating these polynomials or bits of your calculation exceed 64-bit floating point range, e.g. factorials of 10000, 2^10000. See the formulas and table in this section: https://en.wikipedia.org/wiki/Legendre_polynomials#Rodrigues...
You can probably calculate the value of P_n(x) using the equation:
P_n(x) = (2 - 1/n) x P_{n-1}(x) - (1 - 1/n) P_{n-2}(x)
This is numerically stable if x is in [0,1], but I haven't validated that you can get to very high n using floating point.
No. All the textbooks know that polynomials of high degree are numerically dangerous and you need to be careful when handling them.
The articles examples only work because the interval 0 to 1 (or -1 to 1) were chosen. For whatever reason the author does not point that out or even acknowledges the fact that had he chosen a larger interval the limitations of floating point arithmetic would have ruined the argument he was trying to make.
10^100 is a very large number and numerically difficult to treat. For whatever reason the author pretends this is not a valid reason to be cautious about high degree polynomials.
He seems reasonably explicit about this:
""
This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.
""
No.
This paragraph has nothing to do with numerics. It is about the fact that continuous functions can not be approximated globally by polynomials. So you need to restrict to intervals for reasons of mathematical theory. This is totally unrelated to the numerical issues, which are nowhere even acknowledged.
But what's the point in acknowledging numerical issues outside of [-1,1] if polynomials do not even work there, as author explicitly notes?
All polynomials "work" globally. That some polynomials form an orthonormal basis over certain intervals is essentially irrelevant.
The author does not address the single most important reason why high degrees of polynomials are dangerous. Which is pretty insane to be honest, obviously to be honest you have to at least mention why people tell you to be cautious about high degree polynomials AND point out why your examples circumvent the problem. Anything else is extremely dishonest and misleading.
Is there a mathematical proof or basis to back what you’re saying?
That polynomials do not approximate continuous functions globally??
That is pretty obvious. Consider that every polynomial grows arbitrarily large, but not every continuous function does.
Neural network training is harder when the input range is allowed to deviate from [-1, 1]. The only reason why it sometimes works for neural networks is because the first layer has a chance to normalize it.
If you have a set of points, why can't you just map them to an interval [0, 1] before fitting the polynomial?
https://arxiv.org/pdf/2503.02113
This paper shows that polynomials show most features of deep neural nets, including double descent and ability to memorize entire dataset.
It connects dots there - polynomials there are regularized to be as simple as possible and author argues that hundredths of billions of parameters in modern neural networks work as a regularizers too, they attenuate decisions that "too risky."
I really enjoyed that paper, a gem that puts light everywhere.
Well, one catch is that you're doing degree ten thousand. That alone already increases your computational costs and can run afoul of numerical error as your numbers start losing precision. You'll have to start worrying about overflow and underflow. Horner's method or another representation of your polynomial might start to become important. You'll also start to notice your CPU spiking more. In essence, this is kind of fitting more and more points until you get rid of every oscillation you don't want.
The other catch, which the author mentions, is that extrapolation is still essentially impossible with polynomials. This is easy to see. The highest-degree term will dominate all the other terms by an order of magnitude once you step out of the interval of interest. Every non-constant polynomial grows without bound.
Honestly, though, what the author is doing isn't much different than a Taylor approximation. If you're okay fitting any polynomial in a small neighbourhood of the data, then go whole hog and fit the best possible polynomial: a Taylor polynomial. You wouldn't usually need to go to that high degree to get what you want in a neighbourhood either.
> Horner's method or another representation of your polynomial might start to become important.
Horner's is the default way to evaluate a polynomial - and I think you're overstating the cost of evaluation.
> what the author is doing isn't much different than a Taylor approximation
Yes it is. The Taylor approximation is only valid around a point - and it's based on matching the nth order derivative. There are no error metrics. And it requires differentiation to find it.
Taylor is for points. For functions on interval you want Chebyshev interpolation and probably want to throw in some poles to do better if L1 not L2 matters.
The original paper about the bias variance tradeoff, that the double decent papers targeted, had some specific constraints.
1) data availability and computer limited training set sizes. 2) they could simulate infinite datasets.
While challenging for our minds, training set sizes today make it highly likely that the patterns in your test set are similar to concept classes in your training set.
This is very different than saying procedure or random generated test sets, both of which can lead to problems like over fitting with over parameterized networks.
When the chances are that similar patterns exist, the cost of some memorization goes down and is actually somewhat helpful for generalization.
There are obviously more factors at play here, but go look at the double decent papers and their citations to early 90's papers and you will see this.
The low sensitivity of transformers also dramatically helps, with UHAT without CoT only having the expressiveness of TC0, and with log space scratch space having PTIME expressability.
You can view this from autograd requiring a smooth manifold with the ability to approximate global gradient too if that works better for you.
But yes all intros have to simplify concepts, and there are open questions.
Numerical analysis is ALL about picking the right tool for the job. Unfortunately, there aren't super methods and you need to know how to classify problems.
> So... are _all_ introductions to machine learning just extremely wrong here?
It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2 (since excel doesn't do a holdout), so it's a very common mistake new students make.
It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.
It's not like they would sabotage competitors by training them wrong, as a joke.
There's something I'm fundamentally missing here--if the standard basis and the Berenstain basis describe exactly the same set of polynomials of degree n, then surely the polynomial of degree n that minimizes the mean square error is singular (and independent of the basis--the error is the samples vs the approximation, the coefficients/basis are not involved) so both the standard basis and Berenstain basis solution are the same (pathological, overfitted, oscillating) curve?
Like I understand how the standard basis is pathological because the higher degree powers diverge like mad so given "reasonable" components the Berenstain basis is more likely to give "reasonable" curves but if you're already maximizing I don't understand how you arrive at a different curve.
What am I missing?
The minimization is regularized, meaning you add a penalty term for large coefficients. The coefficients will be different for the two bases, meaning the regularization will work differently.
Ok, yeah, doing a little googling that makes sense. I kind of feel that the article author was burying the lede by saying this was about ML optimization where apparently regularization is the norm(so to speak lol) and basis selection is the whole ball game indirectly through the way it influences convex optimization
A well-known related paper that I didn't see mentioned in the article (although Trefethen was mentioned) is "Six Myths of Polynomial Interpolation and Quadrature".
https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf
This article is much better, more informative, and factual than I'd have expected from the title. Note that it's part of an 8-article series.
Worth a read if you're ever fitting functions.
This is why I believe a numerical methods course should be a requirement for any AI majors.
I'm exactly in the category of AI majors who are not familiar with numerical methods. Can you broadly explain where the gap in AI pedagogy is and how students can fill it?
The series of articles posted here are interesting and I plan to review them in more detail. But I'm concerned about what the "unknown-unknowns" are.
Sure. The original normalizing flows used a fixed number of layers. Someone at UToronto recognized that, as the number of layers gets very deep, this is essentially an ordinary differential equation (ODE). Why?
Suppose you have n residual layers that look like:
x_0 = input x_{i+1} = x_i + f(x_i) x_n = output
If you replace them with an infinite number of layers, and use "time" t instead of "layer" i, you get
x(t+dt) = x(t) + dt f(x(t)) <=> x'(t) = f(x, t)
so to find the output, you just need to solve an ODE. It gets better! The goal of normalizing flows is to "flow" your probability distribution from a normal distribution to some other (e.g. image) distribution. This is usually done by trying to maximize the probability the training images should show up, according to your model, i.e.
loss(model) = product model^{-1}(training image)
Notice how you need the model to be reversible, which is pretty annoying to implement in the finite-layer case, but with some pretty lenient assumptions is guaranteed to be true for an ODE. Also, when you're inverting the model, the probabilities will change according to the derivative; since you have more than one dimension, this means you need to calculate the determinant of the Jacobian for every layer, which is decently costly in the finite-layer case. There are some tricks that can bring this down to O(layer size^2) (Hutchinson++), but the ODE case is trivial to compute (just exp(trace)).
So, turning the model into an ODE makes it blazing fast, and since you can use any ODE solver, you can train at different levels of precision based on the learning rate (i.e. the real log canonical threshold from singular learning theory). I haven't seen any papers that do this exactly, but it's common to use rougher approximations at the beginning of training. Probably the best example of this is the company Liquid AI.
Finally, this all turns out to be very similar to diffusion models. Someone realized this, and combined the two ideas into flow-matching.
-----
This is one place it's super useful to know numerical methods, but here are a couple others:
1. Weight initialization --> need to know stability analysis
2. Convolutions --> the Winograd algorithm, which is similar to ideas in the FFT and quadrature
> Can you broadly explain where the gap in AI pedagogy is and how students can fill it?
Machine learning can be made much more effective and efficient, by first modeling the problem, and then optimizing that tailored representation. This is an alternative to throwing a bunch of layers of neurons, or copy pasting an architecture, and hoping something works out.
One of the most successful applications of ML is convolutional neural networks and is based on this principle. Image processing algorithms come from an optical theory which can be modeled with convolution - what if we used optimization to find those convolution kernels.
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.
Also you need to know when a problem is NOT optimization - for example solving equations via the bisection method.
>But I'm concerned about what the "unknown-unknowns" are.
Try the examples in the article with the interval 0 to 10.
That problem's a known-known for most people interested in any of this, and for anyone who's made it as far as the first picture.
It obviously is something the author lacks any understanding of. As it is the most obvious reason why polynomials of high degree are dangerous.
If you are arguing that they aren't, of course you have to at least mention that objection and point out how to circumvent it.
This part about double decent is really good: https://alexshtf.github.io/2025/03/27/Free-Poly.html#fnref:2
Right? I'm no stranger to ML, but this feels like magic, so cool! It clearly explains why normalizing features can be important (some polynomials blow up outside their range), provides a simple example of double descent and fits extremely high-degree polynomials without loss of generalization - amazing stuff!
In this case wouldn't a Fourier-type approach work better? At least there's no risk the function blows up and it possibly needs fewer parameters?
The problem with Fourier approximation is that it works terribly for relatively "simple" functions. E.g. fitting a linear relationship is extremely hard.
Ah good point. There ought to be a way to get the best of both worlds.
Yeah, I thought the whole point of a polynomial approximation was that it is really only useful for the first couple of powers (quick and cheap) or because you have a particular process that you know a priori has a particular form (non-convergent, non-conservative, higher powered, etc.).
The "particular form" is important--if you don't know that then there is no reason for choosing x^10000 over x^5 (other than computational complexity). But there is also no reason for choosing x^5 over x^10000! Maybe that function really is flat and maybe it isn't. You really just don't know.
If you don't have anything too weird, Fourier is pretty much optimal in the limit--if a bit expensive to calculate. In addition, since you can "bandwidth limit" it, you can very easily control overfitting and oscillation. Even more, Fourier often reflects something about the underlying process (epicycles, for example, were correct--they were pointing at the fact that orbits were ellipses).
>If you don't have anything too weird, Fourier is pretty much optimal
"Optimal" by what metric? Why is projecting on the Fourier basis "better" than projecting on a polynomial basis?
Pick a metric. Fourier is probably better in general.
With respect to machine learning, probably the fact that Fourier is bounded and gives coherent (if completely random) results even very far from the interval of interest.
However, this is like saying that an O(n log n) algorithm is better than O(n^2). Sure, it's true in the limit, by the constant terms can be quite large and O(n^2) can remain better up to remarkably useful values of n.
As I pointed out, the advantage that polynomial basis generally has is that you can be accurate enough with lower powers that are much, much faster to calculate (we use matrices of linear equations precisely for that reason). Or, you can match a particular process because you know specifically that it follows some particular function (we know that road damage follows the fourth power of load--it would be counterproductive to model that with Fourier).
Using high powers for polynomial basis is almost always worse than any other choice.
I don't disagree, but I do not think it's meaningful to call something optimal if it is clearly unusable under certain circumstances.
If you know that some relationship is close to polynomial, obviously a polynomial basis is more suitable. E.g. a line performs terribly for a Fourier transformation.
>Using high powers for polynomial basis is almost always worse than any other choice.
For some value of "high", yes.
It should also be noted that this kind of fitting is extremely closely related to integration or in other words calculating the mean.
By using the "right" kind of polynomial basis you can get a polynomial approximation which also tells you the mean and variance of the function under a random variable.
For large polynomial degree, this seems to me to be very similar to the P-spline method with a very large number of parameters.
Wanted to add my voice to the chorus of appreciation for this article (actually a series of 8). Very informative and engaging.
Great article! Very curious how this orthogonalization + regularization idea could be extended to other kinds of series, such as Fourier series.
This is the Fourier projection. Just on a different basis,the mathematics are near identical, you just change the subspace.
The Fourier basis is already orthonormal.
See also https://en.wikipedia.org/wiki/Chebyshev_polynomials
They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.
good article, but I am very bother by the "standard basis"... it's called canonical in math. I don't think standard is the right name in any context.
If you treat the ring of polynomials as a vector space in their coefficients, then the unit monomials 1, x, x^2, etc. form the standard basis of that vector space. Wikipedia has an example of "standard basis" in this sense [0].
[0] https://en.wikipedia.org/wiki/Standard_basis#Generalizations
I completely disagree with the conclusion of the article. The reason the examples worked so well is because of an arbitrary choice, which went completely uncommented.
The interval was chosen as 0 to 1. This single fact was what made this feasible. Had the interval been chosen as 0 to 10. A degree 100 polynomial would have to calculate 10^100, this would have lead to drastic numerical errors.
The article totally fails to give any of the totally legitimate and very important reason why high degree polynomials are dangerous. It is absurd to say that well known numerical problems do not exist because you just found one example where they did not occur.
The article specifically points out that these polynomials only work well on specific intervals (emphasis copied from the article):
"The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features."
As I understand it, one of the main ideas of this series of posts is that normalizing features to very specific intervals is important when fitting polynomials. I don't think this "went completely uncommented".
Yes! And the next articles in the series double down on this:
"Any polynomial basis has a “natural domain” where its approximation properties are well-known. Raw features must be normalized to that domain. The natural domain of the Bernstein basis is the interval [0,1][0,1]."
Totally irrelevant. The natural domain,if compact, can always be scaled, it has nothing to do with the numerical problems.
Also the hermite polynomials have an unbounded natural domain.
The quote has absolutely nothing to do with my point.
The scaling to an interval in the quote is about formal mathematical reasons,in particular that polynomials do not approximate continuous functions globally. This is totally unrelated to numerics.
The issue is that in particular the interval 0 to 1 has to be chosen, as otherwise the numerics totally fall apart. The message of the article is that high degree polynomials pose no danger, but that is wrong. All the examples in the article only work because of a specific choice of interval. All the major numerical issues are totally ignored, which would immediately invalid the core thesis of the article. If you calculate 10^100 in 64 bit floating point you will run into trouble. The article pretends that will not be the case.
However, if you normalize your data to [0,1], you'll never have to compute 10^100 and thus never face any numerical issues. "Never" assumes no distribution shift.
Indeed, the examples work thanks to this choice of the interval, but this comes with the choice of the basis. Of course Bernstein basis functions explode outside [0,1], but I think the point is that high-degree polynomials pose no danger if you scale the data *according to the polynomial* (use [0,1] for Bernstein and [-1,1] for Chebyshev, for example). So the "magic combo" is polynomial + scaling to its interval. Otherwise all bets are off, of course.
The article totally ignores this and does not even mention the numerical issues at all, which is pretty insane.
Surely at least naming THE ONE reason high degree polynomials are dangerous has to be done. Writing an article arguing that something is not a problem, while not even acknowledging the single most important reason why people believe the problem exist is totally disingenuousness and pretty terrible scholarship.
At least include that the choice of 0 to 1 is necessary for this to work. Not including it makes the author look either clueless or malicious.
Yes and it fails to talk about boundary conditions or predicates or whatever.
Great article and clever use of linkbait!
And the discussion here is pretty good.
One thought I had while looking into regression recently is to consider the model created with a given regularization coefficient not as a line on a 2 dimensional graph but as a slice of a surface on a 3 dimensions graph where the third dimension is the regularization coefficient.
In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.
More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.