Ultra-large skyscrapers. Credit: Author via Midjourney
I don’t mean ‘over’ as in “you won’t see a new large AI model ever again” but as in “AI companies have reasons to not pursue them as a core research goal — indefinitely.”
Don’t get me wrong. This article isn’t a critique of the past years — even if I don’t buy the “scale is all you need” argument, I acknowledge just how far scaling has advanced the field.
Parallelism can be drawn between the 2020–2022 scaling race and — keeping the distance — the 50s-70s space race. Both advanced science significantly as a byproduct of other intentions.
But there’s a key distinction.
While space exploration was innovative in nature, the quest for novelty isn’t present in the “bigger is better” AI trend: To conquer space, the US and USSR had to design novel paths toward a clear goal. In contrast, AI companies have blindly followed a predefined path without knowing why or whether it’d lead us anywhere.
You can’t put the cart before the horse.
That makes all the difference and explains why and how we’ve got here.
The scaling laws of large models
Some companies use AI to automate processes, improve efficiency, and reduce costs. Others want to advance scientific understanding or improve people’s life and well-being. And yet others want to build the “last invention” we’ll make — or so they think.
Call it AGI, superintelligence, human-level AI, or true AI.
In any case, it’s been a recurring goal since the field’s birth in 1956. But the idea got tangible in 2012, then more in 2017, and finally exploded in 2020.
The last milestone was OpenAI’s discovery and application of the strongest version of scaling laws for large language models (LLMs).
They accepted, earlier than anyone else, that sheer model size (and thus data and computing power) was key to advancing the field. OpenAI’s faith in the scaling hypothesis was reflected in the Jan 2020 empirical paper “Scaling Laws for Neural Language Models.”
In May 2020, OpenAI announced GPT-3, a direct result of applying the scaling laws. The unprecedentedly big 175-billion-parameter LLM put OpenAI ahead of everyone else.
The belief that making models bigger would yield emerging properties — like true intelligence — was suddenly a tangible reality.
The race toward AGI… or something
From 2020 to 2022 — until very recently — most high-repercussion, newsworthy announcements in AI were on LLMs (AlphaFold is one of the few notable exceptions).
It was during this period that phrases like “AGI is coming” and “scale is all you need” became super popular.
OpenAI set a precedent. Google, Meta, Nvidia, DeepMind, Baidu, Alibaba… the major players in the field lost no time. Their priority was surpassing GPT-3. It wasn’t a competition with OpenAI but an attempt at corroborating the rumors: Did scale work so well? Could AGI really be around the corner?
Big tech companies bought the scale argument and wanted to signal their presence in the AI race. Here’s a brief, incomplete list of how the landscape changed in one year, from mid-2021 to mid-2022 [company: model (size, release date)]:
- Google: LaMDA (137B, May 2021), and PaLM (540B, Apr 2022)
- Meta: OPT (175B, May 2022), and BlenderBot 3 (175B, Aug 2022)
- DeepMind: Gopher (280B, Dec 2021), and Chinchilla (70B, Apr 2022)
- Microsoft-Nvidia: MT-NLG (530B, Oct 2021)
- BigScience: BLOOM (176B, June 2022)
- Baidu: PCL-BAIDU Wenxin (260B, Dec 2021)
- Yandex: YaLM (100B, June 2022)
- Tsinghua: GLM (130B, July 2022)
- AI21 labs: Jurassic-1 (178B, Aug 2021)
- Aleph Alpha: Luminous (200B, Nov 2021)
Credit: Alan D. Thompson (Life Archillect)
A pretty dramatic picture — cherry-picked to back my argument, yes, but quite revealing regardless. Companies were running away from small-scale AI.
But, what were they looking for in between hundreds of billions of parameters? They didn’t know.
Scale proved to improve performance, but were benchmark results translatable to real-world performance? They didn’t know.
Could they really reach AGI with sheer size? Could scale alone lead us to intelligence?
They also didn’t know.
Every few months a company released a new largest model. But they were escaping forward from having to think about the limitations. They didn’t have a plan. They didn’t know where they were going or, most importantly, why.
The title of “largest model ever” changed hands so much that it was hard to keep track. In April 2022 Google released PaLM — now, 6 months later, no other has claimed the throne.
Are they done?
The Algorithmic Bridge is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
They’re done — taking shots in the dark
At some point, the excitement of getting yet again another “largest” model faded. The news felt irrelevant. Larger models yielded only incremental advances (even if the media was portraying them as breakthroughs). Building huge models to get nothing in return felt greedy — even gross.
The scaling race was intense and then stopped.
Key lessons on why large AI models aren’t the thing now stem from research conducted by those very companies. The scaling hypothesis is extremely attractive in its simplicity — but not everyone believes it.
And then there’s the bigger picture — which I so frequently try to paint in my articles.
The AI field doesn’t exist in a vacuum. Yet, one of the most prevalent aspects I notice when I read discussions on AI, scale, AGI, superintelligence, etc. is people looking exclusively at a carefully selected subset of reality.
Purely technical — and sometimes scientific — aspects matter. Everything else? Irrelevant. Key aspects that have direct influences on technological development are often ignored.
The perfect recipe for failing predictions.
Below is a broad — although likely incomplete — compilation of reasons why ultra-large AI models are over. This isn’t to say companies have stopped for these reasons, but they should consider them before resuming a pointless race.
New scaling laws
One of the true breakthroughs in large-scale AI since GPT-3 was DeepMind’s Chinchilla. It proved the then-universally accepted OpenAI’s scaling laws were incomplete. Training data was as important as model size.
Compute-optimal models, like Chinchilla, needed more data, not more parameters.
At “just” 70B parameters in size, Chinchilla instantly became the second most performant model of all across benchmarks, only behind PaLM (surpassing GPT-3, Gopher, and MT-NLG).
DeepMind found that all super-large models are “significantly undertrained.” They’re unnecessarily big.
Why make models larger when there’s room for improvement at lower sizes?
Prompt engineering limitations
Benchmark performance doesn’t necessarily translate to real-life settings. A user that wants to get the most out of GPT-3 has to have great prompting skills.
And those skills don’t only depend on the effort they’ve put into learning them, but also on the collective knowledge we have about good prompting practices (people are still coming up with new methods: let’s think step by step, chain of thought prompting, ask me anything, and self-ask, etc.)
Unless our collective prompting knowledge gets to the theoretical maximum (which we ignore, and wouldn’t know even if we were already there), we may never know what LLMs are actually capable of.
Prompting resembles searching for an object in a dark room — when you don’t know what object you’re looking for. If we haven’t explored the latent space deep enough, why build a larger model?
Suboptimal training settings
Training LLMs is so expensive that companies have to make cost-accuracy trade-offs. This results in underoptimized models. A single training run of GPT-3 cost OpenAI $5–12M.
OpenAI and Microsoft realized they could improve GPT-3 further by using the best set of initial hyperparameters during training. They found a technique to do it virtually costless simply by transferring those hyperparameters from a smaller, equivalent model.
They proved how a 6.7B version of GPT-3 performed better than its 13B brother.
The deep learning revolution owes a lot to the gaming industry. Nvidia GPUs were intended for graphics, not convolutional operations — but they worked just fine.
As models got larger, chips’ memory became increasingly insufficient to host them.
Engineers applied parallelization techniques (data, model, pipeline) to make training work across entire hubs.
However, as models grow even further, they became almost intractable.
Parallelization is a band-aid. AI hardware companies like Cerebras Systems are tackling this, but universal solutions are yet to be found.
Biological neurons >>> artificial neurons
A study published in Jan 2020 in Science found that dendrites can simulate the behavior of entire ANNs — that’s two orders of magnitude greater complexity in biological neurons than artificial ones.
Another study published in Sep 2021 in Neuron proved artificial neurons are too simple. You need ~1000 of them — built into a whole neural network — to represent a biological neuron accurately.
This is evidence that the foundation under which deep learning and neural networks are built is too simplistic.
In turn, this renders pointless comparisons between large models and human brains. It’s not an apples-to-apples comparison. More like comparing apples to apple tree forests.
Let’s see how these findings change the too-typical comparison:
The human brain has ~100 billion neurons x ~10,000 connections. That’s 1,000 trillion synapses. If we accept the complexity delta of two orders of magnitude, we’d need a model with 100 quadrillion parameters to reach the scale of the human brain.
That’s 500,000 times the largest AI model in existence today.
Maybe pursuing AGI mindlessly through pure scaling isn’t so reasonable after all.
Wait, but why?
Science is built on disprovable hypotheses. Applying scale as a road to AGI is a leap of faith. Escaping forward isn’t the best approach to scientific inquiry.
Without a clear goal, and moved by dubious empirical evidence, plans break down.
What was the scientific purpose of making models larger if companies didn’t know why they were doing it or what they were looking for?
Dubious construct validity and reliability
Is benchmarking the best way to test AI’s ability?
Construct validity refers to how well a test measures a concept we want to measure when it’s not directly measurable. For instance, do IQ tests measure intelligence? Do AI language benchmarks measure AI models’ linguistic ability?
Reliability refers to the consistency of a test: whether it always gives the same results under the same conditions or not.
Because of how AI benchmarks are designed, it’s often difficult to ensure adequate validity and reliability (this isn’t true for all benchmarks).
The world is multimodal
Is it reasonable to expect AGI to emerge from LLMs, which are solely focused on language?
The world is multimodal and our brain is multisensory. Exploring other modes of information and how they interact inside neural networks makes more sense than further enlarging text-only models.
DeepMind’s generalist agent, Gato, is 1.2B parameters. That’s 100x smaller than GPT-3.
The AI art revolution
OpenAI’s DALL·E 2, Midjourney, Stable Diffusion, Meta’s Make-A-Scene, Google’s Imagen, and Parti… AI art models comprise 2022’s AI revolution.
Yet, there’s not a single generative visual AI model that comes anywhere close to the typical LLM size (Google’s Parti is the largest at 20B). And high-quality models abound in the 1–5B parameter range (Stable Diffusion is 1.2B and DALL·E 2 is 3.5B).
This makes AI art models easier to build, train, and deploy.
As I wrote in my last article, “the attractiveness of the visual component, the easiness with which anyone could leverage the models, and the relatively smaller size compared to their language counterparts” makes them more appealing to both companies and consumers.
What is AGI anyway?
AGI, human-level AI, human-like AI, superintelligence, true AI, strong AI… Imprecise terminology is a symptom we don’t know much about what we’re talking about.
The lack of definitions and measuring tools creates an insurmountable gap between our knowledge of reality and reality itself.
How can we draw conclusions if AI models are uninterpretable and we rely on blind prompting to prove the presence or absence of those presumed emergent properties?
The LaMDA sentience/consciousness debate made this apparent. Saying it was sentient wasn’t even wrong, as the question was empty to start with because we lack adequate definitions and tools.
Human cognitive limits
We unquestionably assume that, in the case we manage to build AGI, we have the cognitive capacity to know we’ve done it.
What if we’re not that intelligent?
What if human intelligence is a very narrow form of intelligence, instead of general, and our capabilities aren’t great enough?
If we assume that to be true, and assume AGI has human-level intelligence for every task it does — the obvious conclusion is that AGI would surpass us greatly.
Couldn’t it simply hide from us before we notice?
That’s precisely what some people worry about. AI becoming much more intelligent than us, too soon.
Shouldn’t we ensure that AI is friendly before attempting at building it?
If AGI turns out to be “bad” or misaligned, it may be too late for us to do anything to revert the situation.
It’s weird that it’s precisely those most concerned with existential risks who are willing to work in scaling AI models boundlessly.
Their reasoning is circular: Because everyone is working to advance AI and progress is inevitable, we should work to advance AI to get there first. That way, we can ensure it’s done correctly.
But it’s only because they follow this reasoning that they’re working more and more to advance the field, making progress inevitable.
Aligned AI, how?
Those who take existential risks seriously work in parallel in the alignment problem: How to make truly intelligent AI align with the values and wants of humanity — even after it gets much more intelligent than the whole of humanity.
If the problem sounds not just hard, but intangible, it’s because it is. Not even people working on alignment know how to do it — mainly because the issues it tries to pre-solve are so distant we can’t begin to correctly conceptualize them.
The open-source revolution
Open source is eating AI and not only in the art space (where models are smaller and therefore cheaper). EleutherAI (GPT-NeoX-20B) and BigScience (BLOOM) are examples of non-profit initiatives that have built LLMs to be available for everyone.
This disincentivizes companies to devote many resources to training and deploying LLMs: there’s no guaranteed ROI if people can go and download a model of similar characteristics and quality.
The dark side of LLMs
LLMs can do a lot of things well. But they’re limited by the information they’re fed.
Developers train them with info scraped from the internet. The consequences? The toxic content that populates the web poisons LLMs to the point that they can become unusable.
LLMs can also generate seemingly human-written texts, which makes them great tools to generate believable misinformation.
These limitations have generated a backlash against AI companies to force them to stop training these models with the sole purpose of passing some tests.
Bad for the climate
Although the longtermists consider AI as the most important problem we’ll ever face, many more people think the really important — and urgent — issue is climate change.
Building, training, and deploying LLMs is polluting. It’s not the most polluting activity we do, but among those for which we don’t have a purpose, it probably is (apart from crypto, of course).
The benefit-cost ratio is low
Millions go into training LLMs. The larger the models are (all other things being equal), the more expensive they get.
OpenAI partnered with Microsoft in 2019 for 1$ billion and even then Sam Altman thought they’d get out of money in 5 years.
OpenAI isn’t in this business to make money (they’re hard believers), but other companies are. If there’s no benefit, there’s no reason to take the cost.
GPT-3 is good enough to do most tasks people and companies may want it for.
Companies that want to create useful, valuable products and services don’t need to keep going. Larger models would only imply minor improvements, unnoticeable at the consumer level.
LLMs are useful
Large AI models aren’t useless. That’s not, in any way, the argument here.
What is pointless is pursuing a mirage of what they can become. LLMs are useful but they aren’t agents capable of reasoning and understanding in the human sense.
We’ll eventually see a larger AI model
I don’t claim we won’t ever see a new largest AI model ever again.
In fact, I’d guess that if we eventually build an AGI, at least a part of it will be a deep learning large model larger than any of the existing ones.
Yet, the reasons not to pursue this avenue right now are very strong.
The arguments are valid even if my predictions are wrong
Even if scale turns out to be the key to AGI (i.e. scale + some other things = AGI), the arguments laid out here can be true.
There’s more than enough to gain from what we’ve already built, and more than enough to reflect on the limitations we’ve encountered, to continue building bigger things for the sake of it.
The Algorithmic Bridge is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Subscribe to The Algorithmic Bridge. Bridging the gap between algorithms and people. A newsletter about the AI that matters to your life.
You can also support my work on Medium directly and get unlimited access by becoming a member using my referral link here! :)