Introduction

  • The notion of reducing model costs is a fallacy: it’s the outdated models, which nobody uses, that are the ones getting cheaper. Users will always pay for the strongest “new flagship.”
  • The real cost pit isn’t the price per token; it’s the evolution of AI capabilities: the more complex the task, the more resources consumed, ensuring that a fixed monthly subscription model will eventually be “crushed.”
  • The AI subscription model is a “prisoner’s dilemma”: choosing pay-per-use means losing market share; opting for a monthly fee risks sacrificing future viability.
  • The only ways to escape the “burn rate” fate are: either build a high-switching-cost “moat” that keeps enterprise clients locked in, or engage in vertical integration, treating AI as a loss-leader to attract users while profiting from backend infrastructure.

Further Reading

The True Cost of Tokens is Surging

The Myth That “Language Model Costs Will Drop by 10 Times” Won’t Save AI Subscription Services from Cost Pressures

image.png

Picture this: you’ve launched a company, and you know consumers are willing to pay a maximum of $20 per month. You think, no problem, this is a typical VC move—charge by cost and trade profits for growth. You’ve calculated customer acquisition cost (CAC), customer lifetime value (LTV), and all the necessary metrics. But here’s where it gets interesting: you see that widely circulated a16z chart showing that the costs of large language models (LLMs) decrease by 10 times each year.

Source a16z

So you reason: if I break even at $20 a month today, and next year the model costs drop 10 times, my profit margin will skyrocket to 90%. Losses are only temporary; profits are inevitable.

This logic is so simple even a VC assistant could grasp it:

  • Year 1: Break even at $20/month
  • Year 2: With computation costs down 10-fold, achieve a 90% profit margin
  • Year 3: Start buying yachts

This strategy seems rational: “If the inference cost of large language models drops by 3 times every 6 months, we can surely do it.”

But after 18 months, the profit margins remain at an unprecedented negative… The Windsurf project is fracturing, and even Claude Code had to cancel its initial $200/month unlimited plan.

The company is still bleeding money. The model has indeed gotten cheaper—GPT-3.5 costs a fraction of what it used to. But somehow, profits have worsened instead of improved.

There’s clearly an issue here.

Outdated Models, Like Yesterday’s Newspapers

The cost of GPT-3.5 is a tenth of what it was before. Yet, it’s about as relevant as a flip phone at an iPhone launch.

When a new model is released as the cutting-edge (SOTA), 99% of demand instantly shifts to it. Consumers hold the same expectations for the products they use.

Now, let’s look at the actual pricing history of those front-edge models that capture 99% of the demand at any given time:

Source iaiuse.com

Noticed anything?

  • When GPT-4 launched at $60, despite GPT-3.5 being 26 times cheaper, everyone chose GPT-4.
  • When Claude 3 Opus priced at $60, even with GPT-4’s price drop, users flocked to Claude.

While costs are dropping 10 times, it’s only applicable to outdated models that are no more powerful than a Commodore 64.

So, here lies the first fatal flaw of the “costs will decrease” strategy: market demand exists solely for “the strongest language model.” Period. And the costs of these top models tend to remain roughly the same, reflecting the current limits of inference technology.

Saying, “This 1995 Honda Civic is much cheaper now!” is misleading. Sure, that particular car has dropped in price, but the MSRP of a 2025 Toyota Camry is $30,000.

When using AI—whether for coding, writing, or brainstorming—you always strive for the highest quality. No one opens Claude and thinks, “Maybe I should stick with that inferior version to save some money for my boss.” We are naturally greedy in our cognition. We want the best “brain” available, especially when our precious time is on the line.

The Rate at Which Models Burn Cash is Beyond Imagination

“Okay, but this sounds manageable, right? We just need to stay around breakeven forever?”

Oh, my dear naive child.

While the unit cost per token of each generation of cutting-edge models hasn’t become more expensive, something worse is happening: the number of tokens they consume is skyrocketing exponentially.

In the past, ChatGPT would return one sentence in response to a one-sentence question. Now, the “Deep Research” feature spends three minutes planning, twenty minutes reading, and then five minutes rewriting a report. Opus 3 might even spend twenty minutes processing a simple “hello.”

The explosive development of reinforcement learning (RL) and test-time computation has led to an unexpected outcome: the length of tasks that AI can accomplish doubles every six months. Tasks that once returned 1,000 tokens can now yield 100,000.

Source METR

When you extrapolate this trend, the math gets crazy:

Today, a 20-minute “Deep Research” session costs around $1. By 2027, we will have agents that can run continuously for 24 hours without deviating… coupled with the stable pricing of cutting-edge models? This means single-session costs could hit $72. Per user. Every day. Furthermore, multiple tasks can be run asynchronously.

Once we can deploy agents to run workloads 24/7 asynchronously, we won’t just give it one instruction and wait for feedback. We’ll start batch scheduling them. A whole fleet of AI workers will work in parallel, using tokens as if we’re back in the internet bubble of 1999.

Clearly—I must stress this point—a $20 monthly subscription won’t even support one user doing a $1 deep research session per day. But that is precisely the future we are heading toward. With each increase in model capability comes a meaningful consumption of more computational resources.

It’s like you’ve built a more fuel-efficient engine only to use the saved fuel efficiency to create a massive truck. Yes, you can go further on each gallon, but the total fuel consumption becomes fifty-fold.

This is the fundamental reason forcing Windsurf toward being “cost-pressured” to the brink—also the plight faced by any startup adopting a “fixed rate subscription + high-intensity token consumption” business model.

Anthropic’s Brave Attempt to Hedge Against “Cost Pressure”

Claude Code’s unlimited plan experiment is one of the most ingenious attempts we’ve seen to tackle this storm. They pulled out all the stops, yet ultimately got crushed.

Their strategy was indeed clever:

1. Pricing at 10 Times Higher

While Cursor charged $20/month, they priced at $200/month. Before bleeding out, they left themselves more buffer room.

2. Automatically Scaling Models Based on Load

During heavy loads, they would switch from Opus ($75/million tokens) to Sonnet ($15/million tokens). Using Haiku to optimize reading tasks. It’s akin to AWS autoscaling, but for “brains.”

They likely built this behavior directly into the model weights, a paradigm shift we may see more of in the future.

3. Offloading Processing Tasks to User Machines

Why fire up a sandbox when users have idle CPUs ready?

Yet, despite all these engineering cleverness, token consumption still surged like a supernova.

Source Vibreank

Ten billion. Ten billion tokens. That’s the equivalent of 12,500 copies of “War and Peace.” In a single month.

How is that even possible? How can a single person consume 10 billion tokens, even if each run lasts 10 minutes?

It turns out that 10-20 minutes of consecutive run time perfectly demonstrates the utility of a “for loop.” Once you decouple token consumption from user online duration in the application, the laws of physics take over. Give Claude a task, let it check its work, refactor, optimize, and repeat the process until the company goes bankrupt.

Users become orchestration masters of the API, running a 24/7 code conversion engine at Anthropic’s expense. From chat to agents, the evolution occurs overnight. Consumption increases by a thousandfold. This is a phase change, not a gradual one.

So Anthropic canceled its unlimited plan. They could’ve tried charging $2,000/month, but the lesson isn’t that they didn’t charge enough; it’s that in this new world, no subscription model can promise unlimited use.

The crux is: In this new world, there is no feasible fixed subscription price.

This calculation just doesn’t add up anymore.

The Prisoner’s Dilemma for Everyone Else

This places all other companies in a conundrum.

Every AI company knows that pay-per-use could save them. They also understand that it could kill them. When you charge responsibly at $0.01/1k tokens, your VC-funded competitors are offering unlimited services at $20/month.

Guess where users will flock?

A typical prisoner’s dilemma:

  • Everyone pays by usage → Industry sustainability.
  • Everyone pays a fixed rate → Race towards bankruptcy.
  • You pay by usage while others pay a fixed rate → You die alone.
  • You pay a fixed rate while others pay by usage → You win (and then die later).

So everyone chooses “betrayal.” Everyone subsidizes heavy users. Everyone showcases “hockey-stick” growth curves. Ultimately, everyone issues “important pricing updates.”

Cursor, Lovable, Replit—they all understand this math. They chose today’s growth, tomorrow’s profit, and the eventual bankruptcy, but that’s a problem for the next CEO.

Honestly? That might be the right call. In a land grab, market share is more vital than profit margins. As long as the VCs are willing to keep writing checks to cover the dismal unit economics…

Go ask Jasper what happens when the music stops.

How to Avoid Being “Forced Liquidated”?

Can we still avoid this “cost pressure” on tokens?

Recently, rumors circulated that Cognition is seeking funding at a valuation of $15 billion, while its publicly disclosed annual recurring revenue (ARR) is less than $100 million (I suspect it’s closer to $50 million). This stands in stark contrast to Cursor’s $10 billion valuation based on $500 million ARR. How can revenue more than eight times higher lead to a valuation of only two-thirds? What do the VCs know about Cognition that we don’t? They are all AI agents writing code. Has Cognition found a way to escape this death spiral? (I’ll explore this in detail next time.)

There are three pathways out:

1. Adopt Pay-Per-Use from Day One

No subsidies. No “acquire users first, monetize later.” Just an honest economic model. Sounds great in theory.

But the problem is, find me a consumer-grade AI company that is experiencing explosive growth while charging per usage. Consumers loathe measuring charges. They’d rather pay more for unlimited plans than receive an unexpected bill. Every successful consumer subscription service—Netflix, Spotify, ChatGPT—has a fixed rate. Once you add a meter, growth dies.

2. High Switching Costs ⇒ High Profit Margins

This is where Devin is going all in. They recently announced partnerships with Citibank and Goldman Sachs to deploy Devin for 40,000 software engineers in each firm. At $20/month, that’s a $10 million project. But here’s the twist: would you rather secure $10 million ARR from Goldman Sachs or $500 million ARR from individual developers?

The answer is obvious: the six-month implementation cycle, compliance reviews, security audits, and cumbersome procurement processes mean that while Goldman’s income is hard-won, once secured, it’s never lost. You only land these contracts when the bank’s lone decision-maker stakes their reputation on you—then everyone will ensure the project succeeds by any means necessary.

This is also why, apart from the largest cloud providers, the biggest software firms are those that sell to such clients “system-of-record” solutions (like CRM/ERP/EHR systems). They consistently achieve 80-90% profit margins because the harder it is for clients to switch, the less sensitive they are to price.

By the time competitors emerge, you’ve already embedded yourself in their bureaucratic processes, and switching suppliers requires another six-month sales cycle. It’s not that you can’t leave; it’s just that your CFO would rather die than go through another vendor evaluation.

3. Vertical Integration ⇒ Profiting off Infrastructure

This is Replit’s approach: bundling coding agents with application hosting, database management, deployment monitoring, logging, and other services. Losing money on each token but capturing value on every other layer of the technology stack provided to the next generation of developers… Just look at how deeply Replit is vertically integrated.

Source mattppal

Treat AI as a loss leader to drive consumption of services that can compete with AWS. You’re not selling inference power; instead, you’re selling everything else, with inference being just part of your marketing expenses.

The brilliance lies in the fact that code generation naturally creates demand for hosting. Every application needs somewhere to run. Every database requires management. Every deployment needs monitoring. Let OpenAI and Anthropic duke it out in a price war for inference services down to zero profit, while you own everything else.

Those still playing the “fixed rate, grow at any cost” game? They’re the walking dead. Their expensive funerals are just scheduled for Q4.

The Road Ahead

I often see founders clutching onto the phrase “Next year, models will be ten times cheaper!” as if it were a lifeline. Sure, they will be. But your users’ expectations for the models will soar 20 times. That goalpost is moving away from you at breakneck speed.

Remember Windsurf? They couldn’t find a way out due to Cursor’s pressure on its profit sheet. Even with Anthropic having the most vertically integrated application layer globally, they couldn’t make a fixed subscription model for unlimited use work.

While the summary of “Leveraged Beta Is All You Need”—to “Get Ahead of the Game Beats Being the Smartest”—still holds true, unplanned haste merely means reaching the grave faster than others. There’s no Google willing to write a $2.4 billion check for a business operating at a negative profit. There’s no “We’ll figure it out later,” because “later” means your AWS bill surpasses your total revenue.

So how do you build a business in this world? The short answer: become a “neocloud”—which will be the title of my next article.

But at least the models will be ten times cheaper next year, right?