Are we seeing diminishing returns on large larguage models?

Bla bla bla.. I'll ignore the introduction to LLMs since everyone knows what they are. Even my mom asked me about them. The question I want to propose is: are we seeing diminishing returns on large language model scaling?

The key observation that kick started GPT-2 evolving to GPT-3 was that with Transformers, the model performance scales linearly with the number of parameters. This is unlike the previous vision models where doubling the number of parameters would only give you an limited improvement in classification accuracy. GPT-4 basically confirmed this observation, a larger GPT is indeed better. And this is scary, if the only thing stopping us from achieving AGI is how much compute we can throw at it. Then we will build AGI as a natural consequence of our technological progress. Very soon if trends continue.

But lately I find me doubting it. Just my gut feeling and nothing else. Some key observations that I think are worth mentioning:

  • Inference efficient models like Mamba and RWKV v5 gaining traction.
  • Pratically everyone uses LLaMA2-14B for most tasks.
  • Mistral 8x7b (mixture of experts) getting a lot of attention.
  • NVIDIA and other firms independently came up with FP8 inference.
  • No one is training their own models (besides the usual suspects), most use LoRA or fine-tune.

Basically, the entire industry agreed that for pratical applications. The compute to run large (60B parameters) models is not worth it. And smaller models are good enough to handle most tasks. Weather it be Q&A, summarization. That is also not including the fact tricks like RAG and summerization-memories are used to extend the limited context size. Compute on the application side is now the limiting factor. Instead of a diminishing return on model performance, the diminishing return is on the economy of running these models.

My experience with today's 7B and 14B models is with a 4090-class card. The 4090 can only serve 1~2 users at a time. While the UX is not that great as high latency is not uncommon. I suspect what happens next is similar to what happen back in 2015 with vision models. The industry will focus on making the models more efficient (happning now) and cheap hardware to designed specifically for running these models will be developed.

Afterwards, I bet large models will get larger and larger. But the speed at which they grow will slow down and people will stop caring about the next release. Remember when YOLOv3 shook the world? Now YOLOv8 is out and it works wonders. But it don't hit the news as hard as it did (I know it's a different authour). I think the same will happen to LLMs. The next GPT-5 will be a big deal, but only marginally better than GPT-4. Heck GPT-4 was only marginally better than GPT-3.5.

Anyway, I just wanted to share my thoughts admist the hype. This article is more then likely wrong in the long term (which is simply years in AI time). But this is my prediction for the near term.


To be clear, I am not claiming that LLMs won't evolve into AGI or AGI is impossible. The linear scaling is real. And the first nation/organization to build AGI will probably have a lot to gain and thus will have a lot of incentive to do so.

Author's profile. Photo taken in VRChat by my friend Tast+
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.


  • marty1885 \at protonmail.com
  • Matrix: @clehaxze:matrix.clehaxze.tw
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df