From reading that, I'm not quite sure if they have anything figured out.
I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience.
Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.
I wrote this before the huge LLM explosion and I still personally believe it is the path forward.
> From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain.
But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.
Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.
I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.
There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds.
How's that coming along? And what's the internal world model, assuming we even know?
I share your surprise regarding LLM, is it fair to say that it's because language - especially formalised, written language - is a self-describing system.
A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".
> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.
Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.
>There's a whole giant gap between grid cells and intelligence.
Please check this recent article on the state machine in the hippocampus based on learning [1]. The findings support the long-standing proposal that sparse orthogonal representations are a powerful mechanism for memory and intelligence.
[1] Learning produces an orthogonalized state machine in the hippocampus:
This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?
> if they have anything figured out besides "collect spatial data" like imagenet
I mean she launched her whole career with imagenet so you can hardly blame her for thinking that way. But on the other hand, there's something bitter lesson-pilled about letting a model "figure out" spatial relationships just by looking at tons of data. And tbh the recent progress [1] of worldlabs.ai (Dr Fei Fei Li's startup) looks quite promising for a model that understands stuff including reflections and stuff.
> looks quite promising for a model that understands stuff including reflections and stuff.
I got the opposite impression when trying their demo...[0]. Even in their examples some of these issues exist like how objects stay a constant size despite moving. Like missing the parallax or depth information. Not to mention that they show it walking on water lol
As for reflections, I don't get that impression either. They seem extremely brittle to movement.
Just had a fantastic experience applying agentic coding to CAD. I needed to add some threads to a few blanks in a 3d print. I used computational geometry to give the agent a way to "feel" around the model. I had it convolve a sphere of the radius of the connector across the entire model. It was able to use this technique to find the precise positions of the existing ports and then add threads to them. It took a few tries to get right, but if I had the technique in mind before it would be very quick. The lesson for me is that the models need to have a way to feel. In the end, the implementation of the 3d model had to be written in code, where it's auditable. Perhaps if the agent were able to see images directly and perfectly, I never would have made this discovery.
Generative CAD has incredible potential. I've had some decent results with OpenSCAD, but it's clear that current models don't have much "common sense" when it comes to how shapes connect.
If code-based CAD tools were more common, and we had a bigger corpus to pull from, these tools would probably be pretty usable. Without this, however, it seems like we'll need to train against simulations of the physical world.
This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.
While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.
A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.
Genie 3 (at a prototype level) achieves the goal she describes: a controllable world model with consistency and realistic physics. Its sibling Veo 3 even demonstrates some [spatial problem-solving ability](https://video-zero-shot.github.io/). Genie and Veo are definitely closer to her vision than anything World Labs has released publicly.
However, she does not mention Google's models at all. This omission makes the blog feel very much like an ad for her company rather than a good-faith guide for the field.
I think I perceive a massive bottleneck. Today's incarnation of AI learns from the web, not from the interaction with the humans it talks to. And for sure there is a lot of value there, it is just pointless to see that interaction lost a few hundred or thousand words of context later. For humans their 'context' is their life and total memory capacity, that's why we learn from the interaction with other, more experienced humans. It is always a two way street. But with AI as it is, it is a one way street, one that means that your interaction and your endless corrections when it gets stuff wrong (again) is lost. Allowing for a personalized massive context would go a long way towards improving the value here, at least like that you - hopefully - only have to make the same correction once.
My understanding is at the moment you train something like ChatGPT on the web, setting weights with backpropagation till it works well, but if you give some more info and do more backprop it can forget other stuff it's learned, called 'catastrophic forgetting'. The nested learning approach is to split things into a number of smaller models so you can retrain one without mucking up the other ones.
Spatial AI will for sure be a thing, I am not sure if will be next frontier.
The main problem that I still see is: we are not able to fully understand how much can we scale the current models. How much data do we need? Do we have the data for this kind of training? Can the current models generalize the world?
Probably before seeing something really interesting we need another AI winter, where researchers can be researcher and not soldiers of companies.
The data is out there if we give at least wheels to a robot and let it bump into things like we did when we were little. We didn't need a billion pictures or videos. Only trial and error, then we developed a mental map of our home and our close neighborhood and discovered that the rest of the world obeys the same rules. Training AIs doesn't work like that now.
I think that they want to follow the same route of LLMs: no understanding of the real world, but finding a brute force approach that's good enough in the most useful scenarios. Same as airplanes: they can't fly in a bird like way and they can't do bird things (land on a branch) but they are crazily useful to go to the other side of the world in a day. They need a lot of brute force to do that.
And yes, maybe an AI winter is what is needed to have the time to stop and have some new ideas.
This article has me thinking about “the human capacity to outthink nature and the scalability of this.” The wheel is sort of the first time I think man outthought nature: Nature is inherently bumpy and noisy, rolling is certainly a great form of locomotion, but it’s not reliable. When man figured out how to make long tracts of flat land (roads), we outthought nature. In some sense you could argue that our entire tranjectory through science and technology, supported by the scientific method, is another example: nature sort of sucks at persisting high level pattern intuition between one generation to the next, basically anything beyond genes.
I keep going back and forth on whether I think “super-intelligence” is achievable in any form other than speed-super-intelligence, but I definitely think that being able to think 3-dimensionally capably will be a major building block to AI outthinking man, and outthinking nature.
The human body is an organised system of cells contributing to a greater whole - is there much difference between blood vessels designed for the efficient transport of key resources and messengers across the body and roads that carry key resources and messengers across a landmass?
In that sense has nature just replicated its ability to organise but at the species level on a planetary (interplanetary soon) scale?
Fair, i mean i also love the argument that there’s really no difference between “the manmade world” and “the natural world”because the former is entirely composed of parts stripped from or chemically altered from the latter. So yes, nature has absolutely replicated its ability to organize at a species level through human ingenuity.
Humans are maybe separate from nature primarily on the basis of our attempts (of varying success) to steer, structure, and optimize the organization of nature around us, and knowing how to do so is not an explicit aspect of reality, or at least did not make itself known to early humans so it’s reasonable to believe it’s not explicit. By that, I mean you’re not born with any inherent knowledge of the inner workings of quantum gravity, or of the navies stokes equation, or any of the tooling that supports it, but clearly these models exist and evolve tangibly around us in every moment. We found something nature hid from DNA-based biological tree of life, and exploited it to great effect.
I think a lot of people are really bad at evaluating world models. Feifei is right here that they are multimodal but really they must codify a physics. I don't mean "physics" but "a physics". I also think it's naïve to think this can be done through data alone. I mean just ask a physicist...[0].
But why people are really bad at evaluating them is because the details dominate. What matters here is consistency. We need invariance to some things and equivariance to others. As evaluators we tend to be hopeful so the subtle changes frame to frame are overlooked though thats kinda the most important part. It can't just be similar to the last frame, but needs be exactly the same. You need equivariance to translation, yet that's still not happening in any of these models (and it's not a limitation of attention or transformers). You're just going to have a really hard time getting all this data even though by doing that you'll look like you're progressing because you're better fitting it. But in the end the models will need to create some compact formulation representing concepts such as motion. Or in other words, a physics. And it's not like physicists aren't know for being detail oriented and nitpicky over nuances. That is breed into then with good reason
I think spatial tokens could help, but they're not really necessary. Lots of physics/physical tasks can be solved with pencil and paper.
On the other hand, it's amazing that a 512x512 image can be represented by 85 tokens (as in OAI's API), or 263 tokens per second for video (with Gemini). It's as if the memory vs compute tradeoff has morphed into a memory vs embedding question.
This dichotomy reminds me of the "Apple Rotators - can you rotate the Apple in your head" question. The spatial embeddings will likely solve dynamics questions a lot more intuitively (ie, without extended thinking).
We're also working on this space at FlyShirley - training pilots to fly then training Shirley to fly - where we benefit from established simulation tools. Looking forward to trying Fei Fei's models!
Isn’t this what all the ai companies are doing now? This is what is needed to enable robotics with llms and deep mind and others are all actively working on it afaik
she's done pretty important work but since then obsessed with the vague term `spatial intelligence`. what does it mean? there isn't a clear definition in the piece. it seems very intuitive & fundamental but tbh not *rigorous*, nor insightful.
It's rare for one person to achieve many things. Her ImageNet was certainly HUGE. But she is a researcher. I think the true power of researchers is to persist. I also often think that researchers are too much absorbed into their topics. But that is just their purpose.
It could be a dead end for sure. I just hope that someone figures out the `spatial` part for AIs and brings us closer to better ones.
I do wonder if this will meaningfully move the needle on agent assistants (coding, marketing, schedule my vacation, etc...) considering how much more compute (I would imagine) is needed for video / immersive environments during training and inference
I suspect the calculus is more favorable for robotics
I would argue that some would add time to that as well, a lot of our data are missing spatial and temporal information. But if we're able to take text2text models and add in audio/vision then I suspect we can apply the same technique to add in spatial and temporal intelligence. However the data for those are non existent unlike audio and visual data.
My take, after working on some algos to detect geometry from pointclouds, is that its solvable with current ML techniques, but we lack early stage VC funding for startups working on this :
we've discovered some kind of differentiable computer[1] and as with all computers, people have their own interests and hobbies they use them for. but unlike computers, everyone pitches their interest or hobby as being the only one that matters.
Personally, I think the direction AI will go towards is having an AI brain with something like a LLM at its core augmented with various abilities like spatial intelligence, rather than models being designed with spatial reasoning at its core. Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning. Similar to how image generation models have struggled with generating the right number of fingers on hands, I would expect a world model designed to model physical space to not generalize the understanding of simple human ideas.
> Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning
I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).
Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.
There's a difference between what a model is trained on and the inductive biases a model uses to generalize. It isn't as simple as saying training natively on everything. All existing models have certain things they generalize better and certain things they don't generalize due to their model architecture, and the architecture of world models I've seen don't seem as capable of universally generalizing as LLMs.
Also good context here is Friston’s Free Energy Principle: A unified theory suggesting that all living systems, from simple organisms to the brain, must minimize "surprise" to maintain their form and survive. To do this, systems act to minimize a mathematical quantity called variational free energy, which is an upper bound on surprise. This involves constantly making predictions about the world, updating internal models based on sensory data, and taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors.
Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.
It should have survival and preservation as a fundamental architectural feature.
> taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors
Since you can't change reality itself, and you can only take actions to reduce variational free energy, doesn't this make everything into a self-fulfilling prophecy?
I guess there must be some base level of instinct that overrides this; in the case of "I think that sabertooth tiger is going to eat me" you want to make sure the "don't get eaten" instinct counters "minimizing prediction errors".
Yep. Essentially take risks, expand your world model, but above all, don’t die. There’s a tension there - like “what happens if I poke the bear” vs “this might get me killed.”
Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.
Curating a dataset is vastly different than introducing a new architectural approach. ImageNet is a database. Its not like inventing the convolutions for CNNs or the LSTM or a Transformer.
It's true that these are very different activities, but I think most ML researchers would agree that it's actually the creation of ImageNet that sparked the deep learning revolution. CNNs were not a novel method in 2012; the novelty was having a dataset big and sophisticated enough that it was actually possible to learn a good vision model from without needing to hand-engineer all the parts. Fei-fei saw this years in advance and invested a lot of time and career capital setting up the conditions for the bitter lesson to kick in. Building the dataset was 'easy' in a technical sense, but knowing that a big dataset was what the field needed, and staking her career on it when no one else was doing or valuing this kind of work, was her unique contribution, and took quite a bit of both insight and courage.
"CNNs and Transformers are both really simple and intuitive" and labeling a bunch of images you downloaded is not simple and intuitive? It was a team effort and I would hardly call a single dataset what drove modern ML. Most of currently deployed modern ML wasn't trained on that dataset and didn't come from models trained on it.
Exactly right. Neatly said by the author in the linked article.
> I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs).
Datasets + NNs + GPUs. Three "vastly different" advances that came together. ImageNet was THE dataset.
I'd imagine Tesla's and Waymo's AI are at the forefront of spatial cognition... this is what has made me hesitant to dismiss the AI hype as a bubble. Once spatial cognition is solved to the extent that language has been solved, a range of applications currently unavailable will drive a tidal wave of compute demand. Beyond self driving, think fully autonomous drone swarms... Militaries around the world certainly are and they're salivating.
The automotive AIs are narrow pseudo-spatial models that are good at extracting spatial features from the environment to feed fairly simple non-spatial models. They don't really reason spatially in the same sense that an animal does. A tremendous amount of human cognitive effort goes into updating the maps that these systems rely on.
Help me understand - my mental model of how auto AI work is that they're using neural nets to process visual information and output a decision on where to move in relation to objects in the world around them. Yes they are moving in a constrained 2D space but is that not fundamentally what animals do?
What you're describing is what's known as an "end to end" model that takes in image pixels and outputs steering and throttle commands. What happens in an AV is that a bunch of ML models produce input for software written by human engineers, and so the output doesn't come from an entirely ML system, it's a mix of engineered and trained components for various identifiable tasks (perception, planning, prediction, controls).
100% agree but not just military. Self-driving vehicles will become the norm, robots to mow the lawn, clean the house, eventually humanoids that can interact like LLMs and be functional robots that help out around the house.
Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:
If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.
Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.
I enjoy Fei-fei li's communication style. It's straight and to the point in a way that I find very easy to parse. She's one of my primary idols in the AI space these days.
From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
There are actually a lot of people trying to figure out spatial intelligence, but those groups are usually in neuroscience or computational neuroscience. Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information. I wrote this before the huge LLM explosion and I still personally believe it is the path forward.
> From reading that, I'm not quite sure if they have anything figured out. I actually agree, but her notes are mostly fluff with no real info in there and I do wonder if they have anything figured out besides "collect spatial data" like imagenet.
Right. I was thinking about this back in the 1990s. That resulted in a years-long detour through collision detection, physically based animation, solving stiff systems of nonlinear equations, and a way to do legged running over rough terrain. But nothing like "AI". More of a precursor to the analytical solutions of the early Boston Dynamics era.
Work today seems to throw vast amounts of compute at the problem and hope a learning system will come up with a useful internal representation of the spatial world. It's the "bitter lesson" approach. Maybe it will work. Robotic legged locomotion is pretty good now. Manipulation in unstructured situations still sucks. It's amazing how bad it is. There are videos of unstructured robot manipulation from McCarthy's lab at Stanford in the 1960s. They're not that much worse than videos today.
I used to make the comment, pre-LLM, that we needed to get to mouse/squirrel level intelligence rather than trying to get to human level abstract AI. But we got abstract AI first. That surprised me.
There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?
I share your surprise regarding LLM, is it fair to say that it's because language - especially formalised, written language - is a self-describing system.
A machine can infer the right (or expected) answer based on data, I'm not sure that the same is true for how living things navigate the physical world - the "right" answer, such as one exists for your squirrel, is arguably Darwinian: "whatever keeps the little guy alive today".
Thanks for your article. The references section was interesting.
I'll add to the discussion a 2018 Nature letter: "Vector-based navigation using grid-like representations in artificial agents" https://www.nature.com/articles/s41586-018-0102-6
and a 2024 Nature article "Modeling hippocampal spatial cells in rodents navigating in 3D environments" https://www.nature.com/articles/s41598-024-66755-x
And a simulation in Github from 2018 https://github.com/google-deepmind/grid-cells
People have been looking at spacial awareness in neurology for quite a while. (In terms of the timeframe of recent developments in LLMs).
> Here is a summary paper I wrote discussing how the entorhinal cortex, grid cells, and coordinate transformation may be the key: https://arxiv.org/abs/2210.12068 All animals are able to transform coordinates in real time to navigate their world and humans have the most coordinate representations of any known living animal. I believe human level intelligence is knowing when and how to transform these coordinate systems to extract useful information.
Yes, you and the Mosers who won the Nobel Prize all believe that grid cells are the key to animals understanding their position in the world.
https://www.nobelprize.org/prizes/medicine/2014/press-releas...
It's not enough by a long shot. Placement isn't related directly to vicarious trial and error, path integrations, sequence generation.
There's a whole giant gap between grid cells and intelligence.
>There's a whole giant gap between grid cells and intelligence.
Please check this recent article on the state machine in the hippocampus based on learning [1]. The findings support the long-standing proposal that sparse orthogonal representations are a powerful mechanism for memory and intelligence.
[1] Learning produces an orthogonalized state machine in the hippocampus:
https://www.nature.com/articles/s41586-024-08548-w
This is super cool and I want to read up more on this as I think you are right insofar as it is the basis for reasoning. However it does seem more complex than just that. So how do we go from coordinate system transformations to abstract reasoning with symbolic representations?
There is research showing that the grid cells also represent abstract reasoning: https://pmc.ncbi.nlm.nih.gov/articles/PMC5248972/
Deep Mind also did a paper with grid cells a while ago: https://deepmind.google/blog/navigating-with-grid-like-repre...
> if they have anything figured out besides "collect spatial data" like imagenet
I mean she launched her whole career with imagenet so you can hardly blame her for thinking that way. But on the other hand, there's something bitter lesson-pilled about letting a model "figure out" spatial relationships just by looking at tons of data. And tbh the recent progress [1] of worldlabs.ai (Dr Fei Fei Li's startup) looks quite promising for a model that understands stuff including reflections and stuff.
[1] https://www.worldlabs.ai/blog/rtfm
As for reflections, I don't get that impression either. They seem extremely brittle to movement.
[0] http://0x0.st/K95T.png
Just had a fantastic experience applying agentic coding to CAD. I needed to add some threads to a few blanks in a 3d print. I used computational geometry to give the agent a way to "feel" around the model. I had it convolve a sphere of the radius of the connector across the entire model. It was able to use this technique to find the precise positions of the existing ports and then add threads to them. It took a few tries to get right, but if I had the technique in mind before it would be very quick. The lesson for me is that the models need to have a way to feel. In the end, the implementation of the 3d model had to be written in code, where it's auditable. Perhaps if the agent were able to see images directly and perfectly, I never would have made this discovery.
Generative CAD has incredible potential. I've had some decent results with OpenSCAD, but it's clear that current models don't have much "common sense" when it comes to how shapes connect.
If code-based CAD tools were more common, and we had a bigger corpus to pull from, these tools would probably be pretty usable. Without this, however, it seems like we'll need to train against simulations of the physical world.
CadQuery? Would be appreciated if you're inclined to do writeup of your lessons learned.
Thanks for sharing, I'm interested to know more about how you did this if you have a longer write up somewhere? (or are considering writing one!)
I'd love to hear more about this -- I'm messing around with a generative approach to 3D objects
Unlike an LLM prompt it's REALLY hard to describe the end result of a geometric object in text.
"No put the thingy over there. Not that thingy!"
I’m not really suggesting it’s the right approach for CAD but prompting UI changes using sketches or mockup images works great.
OpenSCAD or something like it?
This is essentially a simulation system for operating on narrowly constrained virtual worlds. It is pretty well-understood that these don't translate to learning non-trivial dynamics in the physical world, which is where most of the interesting applications are.
While virtual world systems and physical world systems look similar based on description, a bit like chemistry and chemical engineering, they are largely unrelated problems with limited theory overlap. A virtual world model is essentially a special trivial case that becomes tractable because it defines away most of the hard computer science problems in physical world models.
A good argument could be made that spatial intelligence is a critical frontier for AI, many open problems are reducible to this. I don't see any evidence that this company is positioned to make material progress on it.
Genie 3 (at a prototype level) achieves the goal she describes: a controllable world model with consistency and realistic physics. Its sibling Veo 3 even demonstrates some [spatial problem-solving ability](https://video-zero-shot.github.io/). Genie and Veo are definitely closer to her vision than anything World Labs has released publicly.
However, she does not mention Google's models at all. This omission makes the blog feel very much like an ad for her company rather than a good-faith guide for the field.
I think I perceive a massive bottleneck. Today's incarnation of AI learns from the web, not from the interaction with the humans it talks to. And for sure there is a lot of value there, it is just pointless to see that interaction lost a few hundred or thousand words of context later. For humans their 'context' is their life and total memory capacity, that's why we learn from the interaction with other, more experienced humans. It is always a two way street. But with AI as it is, it is a one way street, one that means that your interaction and your endless corrections when it gets stuff wrong (again) is lost. Allowing for a personalized massive context would go a long way towards improving the value here, at least like that you - hopefully - only have to make the same correction once.
There was stuff on a possible way around that from Google Research out the other day called Nested Learning https://research.google/blog/introducing-nested-learning-a-n...
My understanding is at the moment you train something like ChatGPT on the web, setting weights with backpropagation till it works well, but if you give some more info and do more backprop it can forget other stuff it's learned, called 'catastrophic forgetting'. The nested learning approach is to split things into a number of smaller models so you can retrain one without mucking up the other ones.
> For humans their 'context' is their life and total memory capacity
And some number of billions of years of evolutionary progress.
Whatever spacial understanding we have could be thought of as a simulation at a quantum level, the size of the universe, for billions of years.
And what can we simulate completely at a quantum level today? Atoms or single cells?
Idealized atoms and very, very simplified single cells.
So Dr.Li starts writing a blog! I just subscribed to it. I cannot wait for other articles!
Spatial AI will for sure be a thing, I am not sure if will be next frontier.
The main problem that I still see is: we are not able to fully understand how much can we scale the current models. How much data do we need? Do we have the data for this kind of training? Can the current models generalize the world?
Probably before seeing something really interesting we need another AI winter, where researchers can be researcher and not soldiers of companies.
The data is out there if we give at least wheels to a robot and let it bump into things like we did when we were little. We didn't need a billion pictures or videos. Only trial and error, then we developed a mental map of our home and our close neighborhood and discovered that the rest of the world obeys the same rules. Training AIs doesn't work like that now.
I think that they want to follow the same route of LLMs: no understanding of the real world, but finding a brute force approach that's good enough in the most useful scenarios. Same as airplanes: they can't fly in a bird like way and they can't do bird things (land on a branch) but they are crazily useful to go to the other side of the world in a day. They need a lot of brute force to do that.
And yes, maybe an AI winter is what is needed to have the time to stop and have some new ideas.
This article has me thinking about “the human capacity to outthink nature and the scalability of this.” The wheel is sort of the first time I think man outthought nature: Nature is inherently bumpy and noisy, rolling is certainly a great form of locomotion, but it’s not reliable. When man figured out how to make long tracts of flat land (roads), we outthought nature. In some sense you could argue that our entire tranjectory through science and technology, supported by the scientific method, is another example: nature sort of sucks at persisting high level pattern intuition between one generation to the next, basically anything beyond genes.
I keep going back and forth on whether I think “super-intelligence” is achievable in any form other than speed-super-intelligence, but I definitely think that being able to think 3-dimensionally capably will be a major building block to AI outthinking man, and outthinking nature.
Sort of a shitpost.
The human body is an organised system of cells contributing to a greater whole - is there much difference between blood vessels designed for the efficient transport of key resources and messengers across the body and roads that carry key resources and messengers across a landmass?
In that sense has nature just replicated its ability to organise but at the species level on a planetary (interplanetary soon) scale?
Why are humans above nature...?
Fair, i mean i also love the argument that there’s really no difference between “the manmade world” and “the natural world”because the former is entirely composed of parts stripped from or chemically altered from the latter. So yes, nature has absolutely replicated its ability to organize at a species level through human ingenuity.
Humans are maybe separate from nature primarily on the basis of our attempts (of varying success) to steer, structure, and optimize the organization of nature around us, and knowing how to do so is not an explicit aspect of reality, or at least did not make itself known to early humans so it’s reasonable to believe it’s not explicit. By that, I mean you’re not born with any inherent knowledge of the inner workings of quantum gravity, or of the navies stokes equation, or any of the tooling that supports it, but clearly these models exist and evolve tangibly around us in every moment. We found something nature hid from DNA-based biological tree of life, and exploited it to great effect.
Again, this is a colossal shitpost.
I think a lot of people are really bad at evaluating world models. Feifei is right here that they are multimodal but really they must codify a physics. I don't mean "physics" but "a physics". I also think it's naïve to think this can be done through data alone. I mean just ask a physicist...[0].
But why people are really bad at evaluating them is because the details dominate. What matters here is consistency. We need invariance to some things and equivariance to others. As evaluators we tend to be hopeful so the subtle changes frame to frame are overlooked though thats kinda the most important part. It can't just be similar to the last frame, but needs be exactly the same. You need equivariance to translation, yet that's still not happening in any of these models (and it's not a limitation of attention or transformers). You're just going to have a really hard time getting all this data even though by doing that you'll look like you're progressing because you're better fitting it. But in the end the models will need to create some compact formulation representing concepts such as motion. Or in other words, a physics. And it's not like physicists aren't know for being detail oriented and nitpicky over nuances. That is breed into then with good reason
[0] https://m.youtube.com/watch?v=hV41QEKiMlM
I think spatial tokens could help, but they're not really necessary. Lots of physics/physical tasks can be solved with pencil and paper.
On the other hand, it's amazing that a 512x512 image can be represented by 85 tokens (as in OAI's API), or 263 tokens per second for video (with Gemini). It's as if the memory vs compute tradeoff has morphed into a memory vs embedding question.
This dichotomy reminds me of the "Apple Rotators - can you rotate the Apple in your head" question. The spatial embeddings will likely solve dynamics questions a lot more intuitively (ie, without extended thinking).
We're also working on this space at FlyShirley - training pilots to fly then training Shirley to fly - where we benefit from established simulation tools. Looking forward to trying Fei Fei's models!
Isn’t this what all the ai companies are doing now? This is what is needed to enable robotics with llms and deep mind and others are all actively working on it afaik
she's done pretty important work but since then obsessed with the vague term `spatial intelligence`. what does it mean? there isn't a clear definition in the piece. it seems very intuitive & fundamental but tbh not *rigorous*, nor insightful.
I bet it's a dead end.
It's rare for one person to achieve many things. Her ImageNet was certainly HUGE. But she is a researcher. I think the true power of researchers is to persist. I also often think that researchers are too much absorbed into their topics. But that is just their purpose.
It could be a dead end for sure. I just hope that someone figures out the `spatial` part for AIs and brings us closer to better ones.
Holy marketing
I do wonder if this will meaningfully move the needle on agent assistants (coding, marketing, schedule my vacation, etc...) considering how much more compute (I would imagine) is needed for video / immersive environments during training and inference
I suspect the calculus is more favorable for robotics
I would argue that some would add time to that as well, a lot of our data are missing spatial and temporal information. But if we're able to take text2text models and add in audio/vision then I suspect we can apply the same technique to add in spatial and temporal intelligence. However the data for those are non existent unlike audio and visual data.
My take, after working on some algos to detect geometry from pointclouds, is that its solvable with current ML techniques, but we lack early stage VC funding for startups working on this :
https://quantblog.wordpress.com/2025/10/29/digital-twins-the...
I have no doubt FeiFei and her well funded team will make rapid progress.
We think alike. Have you tried to replace point cloud of white wall with a generic white wall automatically?
we've discovered some kind of differentiable computer[1] and as with all computers, people have their own interests and hobbies they use them for. but unlike computers, everyone pitches their interest or hobby as being the only one that matters.
[1] https://x.com/karpathy/status/1582807367988654081
her company world labs is at the forefront of building spatial intelligence models
So she says
Not sure I want a robot that hallucinates around the home but okay if it folds my laundry and cleans the house and so on!
Personally, I think the direction AI will go towards is having an AI brain with something like a LLM at its core augmented with various abilities like spatial intelligence, rather than models being designed with spatial reasoning at its core. Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning. Similar to how image generation models have struggled with generating the right number of fingers on hands, I would expect a world model designed to model physical space to not generalize the understanding of simple human ideas.
> Human language and reasoning seems flexible enough to form some kind of spatial understanding, but I'm not so sure about the converse of having spatial intelligence derive human reasoning
I believe the zero hypothesis would be that a model natively understanding both would work best/come closest to human intelligence (and possibly other different modalities are also needed).
Also, as a complete laymen, our language having several interconnections with spatial concepts would also point towards a multi-modal intelligence. (topic: place, subject: lying under or near, respect/prospect: look back/ahead, etc). In my understanding these connections only secondarily make their way into LLM's representations.
There's a difference between what a model is trained on and the inductive biases a model uses to generalize. It isn't as simple as saying training natively on everything. All existing models have certain things they generalize better and certain things they don't generalize due to their model architecture, and the architecture of world models I've seen don't seem as capable of universally generalizing as LLMs.
hype buzz malarky that is going to further lobotomise our children and return more value for shareholders
Also good context here is Friston’s Free Energy Principle: A unified theory suggesting that all living systems, from simple organisms to the brain, must minimize "surprise" to maintain their form and survive. To do this, systems act to minimize a mathematical quantity called variational free energy, which is an upper bound on surprise. This involves constantly making predictions about the world, updating internal models based on sensory data, and taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors.
Key distinction: Constant and continuous updating. I.e. feedback loops with observation, prediction, action (agency), and once more, observation.
It should have survival and preservation as a fundamental architectural feature.
> taking actions that reduce the difference between predictions and reality, effectively minimizing prediction errors
Since you can't change reality itself, and you can only take actions to reduce variational free energy, doesn't this make everything into a self-fulfilling prophecy?
I guess there must be some base level of instinct that overrides this; in the case of "I think that sabertooth tiger is going to eat me" you want to make sure the "don't get eaten" instinct counters "minimizing prediction errors".
Yep. Essentially take risks, expand your world model, but above all, don’t die. There’s a tension there - like “what happens if I poke the bear” vs “this might get me killed.”
Sutton: Reinforcement Learning
LeCun: Energy Based Self-Supervised Learning
Chollet: Program Synthesis
Fei-Fei: ???
Are there any others with hot takes on the future architectures and techniques needed for of A-not-quite-G-I?
> Fei-Fei: ???
Underrated and unsung. Fei Fei Li first launched ImageNet way back in 2007, a hugely influential move sparking much of the computer vision deep learning that followed since. I remember in a lecture about 7 years ago jph00 saying "text is just waiting for its imagenet moment" -> then came the gpt explosion. Fei Fei was massively instrumental in where we are today.
Curating a dataset is vastly different than introducing a new architectural approach. ImageNet is a database. Its not like inventing the convolutions for CNNs or the LSTM or a Transformer.
It's true that these are very different activities, but I think most ML researchers would agree that it's actually the creation of ImageNet that sparked the deep learning revolution. CNNs were not a novel method in 2012; the novelty was having a dataset big and sophisticated enough that it was actually possible to learn a good vision model from without needing to hand-engineer all the parts. Fei-fei saw this years in advance and invested a lot of time and career capital setting up the conditions for the bitter lesson to kick in. Building the dataset was 'easy' in a technical sense, but knowing that a big dataset was what the field needed, and staking her career on it when no one else was doing or valuing this kind of work, was her unique contribution, and took quite a bit of both insight and courage.
CNNs and Transformers are both really simple and intuitive so I don't think there is any stroke of genius in how they were devised.
Their success is due to datasets and the tooling that allowed models to be trained on large amounts of data, sufficiently fast using GPU clusters.
"CNNs and Transformers are both really simple and intuitive" and labeling a bunch of images you downloaded is not simple and intuitive? It was a team effort and I would hardly call a single dataset what drove modern ML. Most of currently deployed modern ML wasn't trained on that dataset and didn't come from models trained on it.
Exactly right. Neatly said by the author in the linked article.
> I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs).
Datasets + NNs + GPUs. Three "vastly different" advances that came together. ImageNet was THE dataset.
Getting basic trivial shit right is AI's next frontier.
I'd imagine Tesla's and Waymo's AI are at the forefront of spatial cognition... this is what has made me hesitant to dismiss the AI hype as a bubble. Once spatial cognition is solved to the extent that language has been solved, a range of applications currently unavailable will drive a tidal wave of compute demand. Beyond self driving, think fully autonomous drone swarms... Militaries around the world certainly are and they're salivating.
The automotive AIs are narrow pseudo-spatial models that are good at extracting spatial features from the environment to feed fairly simple non-spatial models. They don't really reason spatially in the same sense that an animal does. A tremendous amount of human cognitive effort goes into updating the maps that these systems rely on.
Help me understand - my mental model of how auto AI work is that they're using neural nets to process visual information and output a decision on where to move in relation to objects in the world around them. Yes they are moving in a constrained 2D space but is that not fundamentally what animals do?
What you're describing is what's known as an "end to end" model that takes in image pixels and outputs steering and throttle commands. What happens in an AV is that a bunch of ML models produce input for software written by human engineers, and so the output doesn't come from an entirely ML system, it's a mix of engineered and trained components for various identifiable tasks (perception, planning, prediction, controls).
Tesla's problems with their multi camera non-Lidar system is precisely because they don't have any spacial cognition.
100% agree but not just military. Self-driving vehicles will become the norm, robots to mow the lawn, clean the house, eventually humanoids that can interact like LLMs and be functional robots that help out around the house.
Spatial cognition really means "autonomous robot," and nobody thinks Tesla or Waymo have the most advanced robots.
"Invest in my startup"
Before the music stops
Far too much marketing speech, far too little math or theory, and completely misses the mark on the 'next frontier'. Maybe four years ago, spatial reasoning was the problem to solve, but by 2022 it was solved. All that remained was scaling up. The actual three next problems to solve (in order of when they will be solved) are:
- Reinforcement Learning (2026)
- General Intelligence (2027)
- Continual Learning (2028)
EDIT: lol, funny how the idiots downvote
Combinatorial search is also a solved problem. We just need a couple of Universes to scale it up.
If there isn't a path humans know how to take with their current technology, it isn't a solved problem. It's much different than people training an image model for research purposes, and knowing that $100m in compute is probably enough for a basic video model.
Hasn't RLHF and with LLM feedback been around for years now
Large latent flow models are unbiased. On the other hand, if you purely use policy optimization, RLHF will be biased towards short horizons. If you add in a value network, the value has some bias (e.g. MSE loss on the value --> Gaussian bias). Also, most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly. So, basically, there's a lot of biases that show up in RL training which can make it both hard to train, and even if successful, not necessarily optimizing what you want.
We might not even need RL as DPO has shown.
> if you purely use policy optimization, RLHF will be biased towards short horizons
> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly
What do you consider "General Intelligence" to be?
A good start would be:
1. Robust to adversarial attacks (e.g. in classification models or LLM steering).
2. Solving ARC-AGI.
Current models are optimized to solve the current problem they're presented, not really find the most general problem-solving techniques.
I like to think I'm generally intelligent, but I am not robust to adversarial attacks.
Edit: I'm trying arc-agi tests now and it's looking bad for me: https://arcprize.org/play?task=e3721c99
In my thinking what AI lacks is a memory system
That has been solved with RAG, OCR-ish image encoding (deepseek recently) and just long context windows in general.
RAG is like constantly reading your notes instead of integrating experiences into your processes.
Not really. For example we still can’t get coding agents to work reliably, and I think it’s a memory problem, not a capabilities problem.
On the other hand, test-time weight updates would make model interpretability much harder.
I enjoy Fei-fei li's communication style. It's straight and to the point in a way that I find very easy to parse. She's one of my primary idols in the AI space these days.