BuilderLab.ai

Pricing AI: What Actually Works

Pranav Pathak — Sun, 05 Apr 2026 05:01:33 GMT

SaaS pricing was a solved problem for fifteen years. Charge per seat, per month. Predictable for the buyer, predictable for the seller, beautiful for the board deck. Gross margins north of 80%. Everyone was happy, and nobody had to think very hard.

AI broke the math.

Not the model, the math. The cost to serve a traditional SaaS user is close to zero. The cost to serve an AI user is not. Every inference call, every agentic loop, every long-context request costs real money. AI companies run gross margins of 50-60%, compared to 80-90% for traditional SaaS. And those are the good AI companies. Early-stage ones can run as low as 25%.

But here’s what most pricing analyses get wrong: the margin compression isn’t the hard problem. The hard problem is that AI inverts the relationship between product success and pricing power. When your AI agent does the work of five people, charging per seat penalizes you for the exact outcome your product delivers. If your product works spectacularly, the customer needs fewer of the thing you charge for.

That’s a business model crisis.

Seat-based pricing dropped from 21% to 15% of companies in just twelve months. Hybrid models surged from 27% to 41%. Credit-based models grew 126% year-over-year. The PricingSaaS 500 Index tracked over 1,800 pricing changes across the top 500 B2B and AI companies in 2025 alone. That’s 3.6 changes per company in a single year. The industry is running experiments in public, at scale, and nobody has converged on the answer.

This piece isn’t a survey of those experiments. There are good ones out there. This is a framework for thinking about the structural question underneath all of them: what is your product actually worth, who knows it, and can you measure it cleanly enough to charge for it?

The Litmus Test That Matters More Than Any Framework

Before we get into models, one question separates companies that will get pricing right from companies that will iterate forever:

If your product succeeds spectacularly, does the customer need fewer of the thing you charge for?

If yes, you’ve picked the wrong metric. Full stop. It doesn’t matter how elegant your pricing page looks or how well your sales team can pitch it. You’re building a machine that destroys its own revenue as it improves. Every product improvement is a pricing headwind.

This is the single most important question in AI pricing, and most companies get it wrong because they inherit their pricing metric from whatever they charged for before they added AI. The seat was the unit because bodies were the input. AI changes the input. If you don’t change the unit, you’re optimizing a dashboard that’s lying to you.

Four Models, and When Each One Kills You

Seats: The Bridge, Not the Destination

Per-seat pricing is comfortable. Buyers understand it. CFOs can budget it. Procurement can approve it. It worked for Salesforce, it worked for Slack, it worked for every enterprise SaaS company of the last decade.

The problem: AI inverts the logic. Per-seat pricing was magical for Salesforce because more sales reps drove more revenue. For an AI that writes the emails those reps used to write, charging per seat penalizes you for the efficiency gain you just delivered.

Microsoft Copilot charges $30 per user per month on top of an existing Microsoft 365 license. It works as a bridge because Microsoft has distribution and lock-in that makes the per-seat premium tolerable. Most companies don’t have that luxury.

Use it when: AI is a lightweight enhancement to an existing workflow, usage is relatively uniform across users, and you’re adding AI features to a product that already charges per seat. Think: AI-assisted search, smart suggestions, grammar checking. The AI makes the seat more valuable without replacing the person in it.

Kill it when: Your P90 user costs 10x your P50 user in compute. When that happens, you’re subsidizing power users with light users, and your margins are eroding invisibly. The thing nobody tells you about AI cost distributions is that they aren’t bell curves. They’re power laws. Your heaviest users don’t cost a little more. They cost orders of magnitude more. And you won’t see it in your averages until it’s already in your margin.

Usage-Based: Honest but Dangerous

Usage-based pricing aligns revenue with cost. That’s its strength and its trap. You charge for tokens, API calls, credits, compute minutes, whatever maps to the resource being consumed. By 2025, 85% of SaaS companies were using some form of usage-based pricing.

The appeal is real. It lets small customers start cheap and grow. It protects your margins because revenue scales with cost. It’s transparent.

But transparency cuts both ways, and in AI, the variance between a simple request and a complex one can be 100x. A user doesn’t know that their three-sentence prompt triggered a six-step agentic loop that burned through their monthly allocation.

Cursor is the cautionary tale everyone should study. In June 2025, Cursor switched from a flat 500-requests-per-month plan to a credit pool tied to API rates. The economics were sound. Frontier models were getting more expensive, and Cursor was eating the cost under flat pricing. They had reportedly reached over $500 million in ARR while spending 100% of revenue on AI costs. Something had to change.

But the rollout was a catastrophe. Users who expected predictable billing got surprise charges. Some burned through their monthly allocation in a few prompts. The community backlash was so severe that the CEO published a public apology and offered refunds.

The lesson isn’t that usage-based pricing is wrong. The lesson is that unpredictable usage-based pricing is lethal. If a customer’s bill is ever a surprise, you’ve already lost their trust. And in AI, where the customer can’t see the compute their request triggers, surprise is the default unless you build aggressively against it: usage dashboards, threshold alerts, spending caps, model-cost previews. These aren’t features. They’re survival infrastructure.

Use it when: You’re selling API access, infrastructure, or developer tools where the buyer expects metered billing. Your cost scales linearly with consumption. Your customers are technical enough to understand and monitor usage.

Kill it when: Your buyers are business users who budget monthly and hate surprises. Or when your product encourages the kind of exploration and experimentation that makes metered billing feel punitive. Nothing kills adoption faster than a user thinking, “should I really ask this question, or will it cost too much?”

Outcome-Based: The Ideal That Most Products Can’t Reach

Outcome-based pricing is where incentives align perfectly. The customer pays when the AI delivers something valuable. If the AI fails, the customer pays nothing. It sounds like the obvious future of AI pricing. It isn’t, for most companies. But in specific domains, it’s devastating.

Intercom’s Fin is the reference case. When Intercom launched its AI agent in 2023, they charged $0.99 per successful resolution instead of per seat. Not per conversation. Not per API call. Per resolved issue, confirmed by the customer. Fin now resolves over 56% of conversations autonomously and generates tens of millions in annual revenue. The per-resolution model doubled adoption rates compared to seat-based alternatives.

The reason Fin works on outcome-based pricing isn’t that Intercom is clever about billing. It’s that customer service has a structural property most AI use cases don’t: the outcome is binary, the attribution is clean, and the counterfactual is obvious. Either the ticket got resolved or it didn’t. There’s no committee required to decide if value was delivered. No attribution model to argue over. No six-month lag between action and result.

Having worked in customer service automation, I can tell you this clarity is rare and precious. Advertiser retention (another area I have scars from) has the same structural property: either the churning advertiser came back or they didn’t. The line from AI action to business result is unmistakable. You can draw a box around the interaction, measure what went in, measure what came out, and price the delta.

That’s the prerequisite for outcome-based pricing: a binary outcome, clean attribution, and a value the customer already quantifies in their own business. Support resolutions. Completed transactions. Retained accounts. Verified lead qualifications. These domains share a structural shape that makes outcome-based pricing not just possible but natural.

Most AI products don’t operate in these domains. Most AI products operate where the outcome is soft, shared, or delayed. “Better decisions.” “Improved productivity.” “Faster workflows.” These are real, but you can’t put a price on them per instance because you can’t isolate the AI’s contribution from everything else that happened.

Salesforce learned this with Agentforce. They launched at $2 per conversation. Customers immediately started asking what counted as a “conversation.” Was a multi-turn exchange one conversation or several? What if the AI handled part and a human finished? The problem wasn’t the price. It was that “conversation” isn’t a clean outcome. It’s an activity metric dressed up as a value metric. Salesforce has since introduced three different pricing models in eighteen months: per-conversation, flex credits at $0.10 per action, and per-user licenses at $125 per month. They’re running all three simultaneously because they still haven’t found the answer.

Outcome-based pricing isn’t a maturity stage every company will eventually reach. It’s a structural fit. It depends on the measurability and attributability of the outcome in your specific domain. Kyle Poyar’s data shows that only about 5% of companies can actually pull it off. The other 95% aren’t behind. They’re in domains where the outcome can’t be cleanly isolated. And that’s fine, as long as they stop pretending they’re on a journey toward outcome-based pricing and start building the model that actually fits their product.

Use it when: Your outcome is measurable, binary, and unambiguously attributable to your product. The value of that outcome should already have a dollar figure in the customer’s mind. If you have to convince them of the value, you’re not ready.

Kill it when: Attribution is murky, outcomes are soft, or your model’s performance varies significantly. If your AI has bad weeks, your revenue has bad weeks. And if the customer has to trust that value was delivered rather than see it, you’re one bad quarter away from a churn event.

Hybrid: Where Most Companies Should Be (For Now)

Hybrid pricing combines a base subscription with usage or outcome tiers. It’s the pragmatic middle ground: predictable for the buyer, margin-protective for the seller, and it lets you learn.

Over 60% of SaaS companies now use hybrid models. The structure is typically a monthly base fee that covers a defined level of usage, with overages or premium tiers beyond that.

The base fee gives the CFO a number to budget. The usage tier gives you upside when the customer scales. The customer feels safe starting, and you don’t subsidize power users forever.

The design principle matters: your base tier should cover 70-80% of your users comfortably. The usage tier should kick in for the top 20-30% who are getting disproportionate value. If most users hit the usage tier, your base is too low and you’re functionally usage-based with extra steps. If almost nobody hits it, your usage tier is a rounding error and you’re functionally subscription with an illusion of upside.

Anthropic’s own tiering is instructive. Free, Pro at $20, Max at $100, Team at $30 per seat. Each tier serves a genuinely different user with different behavior patterns. A casual user and a developer running Claude Code all day aren’t “light” and “heavy” versions of the same persona. They’re different products serving different jobs.

Use it when: You’re uncertain about the right model. Hybrid lets you collect data on usage patterns, cost distributions, and willingness to pay while maintaining customer predictability. Start here, run it for 90 days, and let the data tell you where to evolve.

Kill it when: Complexity is killing your sales cycle. If you can’t explain your pricing in one sentence, it’s too complicated. Every layer of complexity adds friction to the purchase decision, and friction compounds across the sales cycle in ways that don’t show up until your win rates start declining.

Enjoying this? Spread the word, it helps me keep writing!

Why the Margin Problem Is Structural, Not Temporary

Before you pick a model, you need to accept something uncomfortable: AI margins are structurally lower than SaaS margins, and they’re not going back up.

Traditional SaaS has near-zero marginal cost per user. AI does not. Every inference call costs money. Every agentic loop burns tokens. Every long-context request hits your GPU bill. The industry runs 50-60% gross margins versus 80-90% for traditional SaaS.

This isn’t a temporary phase that better models or cheaper compute will fix. Even as inference costs decline, product complexity increases. Models get more capable, which means customers use them for harder tasks, which means more compute per request. The cost curve flattens but it doesn’t collapse. Plan for 50-65% gross margins for the foreseeable future.

Three things follow from this:

Track your cost distribution, not your average. What does your P90 user cost versus your P10? The average is meaningless if your distribution is skewed, and in AI, it’s almost always skewed. The cost per user you measure in month one is also meaningless because your users haven’t discovered the expensive features yet. The real cost curve doesn’t stabilize for at least 90 days.

Run the litmus test. Does product success reduce the thing you charge for? If yes, change the metric before you have 10,000 customers locked into the wrong model.

Model your margins at scale. If the math doesn’t work at 10 customers, it won’t work at 10,000. AI costs don’t improve as much with scale as SaaS costs did. The per-unit economics of your hundredth customer look a lot like your tenth.

Adding AI to a Product That Already Has Pricing

Most of us aren’t building AI-native companies from scratch. We’re adding AI features to products that already have pricing, customers, and expectations. This is harder than greenfield pricing because you have constraints and relationships to protect.

Four options, each with a sharp edge:

Absorb it. Bundle the AI feature into your existing plan at no extra charge. Google did this with Workspace. The bet: AI makes the core product stickier, reduces churn, and justifies future price increases. This works when the AI feature is lightweight, broadly used, and compute cost per user is low enough to absorb. It doesn’t work when usage varies widely. You’ll subsidize power users with light users and watch your gross margins compress without knowing why.

Add-on. Sell the AI feature as a separate line item. Notion charges $8 per member per month for its AI add-on. Microsoft charges $30 per user for Copilot on top of existing licenses. This works when the AI feature is clearly differentiated and not every user needs it. It doesn’t work when the add-on feels like a tax on features that should have been included. Watch your NPS after launching an AI add-on. If it drops, you’ve mispriced the relationship, not the feature.

New tier. Create a new plan tier that includes AI features. This works when AI features naturally cluster with other premium capabilities, and you can draw a line between tiers that feels like a product boundary rather than an arbitrary usage cutoff. It doesn’t work when the only difference between tiers is “more AI.” If your tiers are just usage gates with different labels, customers will see through it.

Usage gate. Give everyone access, meter it, and charge for heavy usage. The freemium-to-usage pipeline. This works when you want maximum adoption and your free-to-paid conversion is healthy (above 2-3%). It doesn’t work when your free tier is too generous. OpenAI reportedly burned $8 billion annually on compute in 2025 while serving 900 million weekly free users. Most companies can’t afford that bet. If conversion is below 2%, you’re running a charity.

Five Things That Kill AI Pricing

Pricing on cost instead of value. Your inference cost is your problem, not your customer’s. Customers don’t care what a token costs. They care what the outcome is worth. A support resolution worth $15 to the customer should be priced on that value, not on the $0.03 it cost you in compute. Cost sets your floor. Value sets your price. Most companies confuse the two.

Inheriting the old metric. Per-seat pricing was designed for a world where marginal cost per user was near zero and more users meant more value created. AI products don’t live in that world. The number of people logging in tells you nothing about the value being delivered or the cost being incurred. Using the old metric because it’s familiar is how you wake up to negative margins.

Surprise bills. The Cursor debacle came down to one failure: customers couldn’t predict their spend. In AI, where a single complex request can trigger a chain of expensive operations, prediction is hard. Build the transparency infrastructure before you need it. Usage dashboards, threshold alerts, spending caps. If you wait until customers complain, you’ve already lost them.

Treating pricing as a launch decision. The market tracked 3.6 pricing changes per company in 2025. Credit-based models grew 126% year-over-year. The best companies are iterating quarterly. If you’re revisiting pricing once a year, you’re falling behind the companies that are treating pricing as a continuous experiment.

Soft ROI positioning. Copilots that offer advice without closing the loop live in dangerous territory. “Are we really getting value?” is the question that kills renewals. As AI pilots from 2025 hit their renewal cycles, pricing must reflect actual value delivered, not potential or promise. Measure outcomes even when you don’t price on them. Build the dashboards, establish baselines, create feedback loops. This builds the trust that sustains pricing power. And when you’re ready to move to outcome-based pricing (if your domain supports it), you’ll have the data to make the case.

The Agentic Frontier

Everything above applies to AI features and copilots. Agents are a different animal.

An AI agent doesn’t assist a user. It replaces a workflow. It resolves the ticket, books the meeting, drafts the contract, processes the claim. Autonomously. Agents don’t log in. They don’t hold licenses. They can complete thousands of tasks in the time a human completes one.

Charging per-seat for an agent is like charging per-parking-space for a self-driving car fleet. The unit of value isn’t the seat. It’s the work.

This is why “digital labor” pricing is emerging: per action, per resolution, per completed workflow. Salesforce now lets enterprises convert user licenses into flex credits and back, treating human and AI labor as interchangeable budget lines.

The market hasn’t converged on the right model for agents, and it won’t for at least another 12-18 months. If you’re building agents, price on the work they do, not on access to them. Apply the structural test: is the outcome of the work binary and measurable? If yes, you have a real shot at outcome-based pricing. If not, start with hybrid and instrument everything so you can evolve.

Nobody has this figured out. Not Salesforce, not Intercom, not your competitors. The companies that learn fastest will win.

The Decision Sequence

If you take one thing from this piece, make it this sequence:

First: Know your cost structure. Not the average. The distribution. What does your P90 user cost? If you don’t know, stop and go find out. You cannot price what you cannot measure.

Second: Identify the value metric your customer already uses. Not what you want to charge for. What they believe they’re buying. If they’re buying resolutions, don’t charge for tokens. If they’re buying productivity, don’t charge for seats.

Third: Run the litmus test. If your product succeeds, does the customer need fewer of the thing you charge for? If yes, you’ve picked the wrong metric.

Fourth: Match the model to the buyer. Enterprise CFOs want predictable line items. Developers expect metered billing. SMB founders want simple monthly fees. Your pricing model must fit how your buyer purchases, not how your product works internally.

Fifth: Start hybrid if uncertain. Base subscription plus usage tiers. Collect data for 90 days. Then optimize. Don’t try to be clever with pricing before you have usage data. Cleverness without data is just guessing with confidence.

The Bottom Line

AI pricing forces you to confront things SaaS pricing let you ignore: real marginal costs, wildly variable usage patterns, and a value metric that might not be the one you inherited.

The industry is in the messy middle of hybrid models, quarterly iterations, and learning in public. That’s fine. The point isn’t to pick the perfect model. The point is to understand the structural question your product sits on top of: Is the outcome binary? Is attribution clean? Does the customer already quantify the value?

If yes to all three, price on outcomes. You’re in the 5% with a structural advantage.

If no, start hybrid, instrument everything, measure outcomes even if you don’t price on them, and iterate faster than your competitors.

The companies winning at pricing right now aren’t the ones who picked the perfect model on day one. They’re the ones who understood the structure of their problem, picked a starting point, and learned faster than everyone else.

ROI or die.

The Builder Bubble

Pranav Pathak — Sun, 29 Mar 2026 09:56:03 GMT

If you read BuilderLab, you are probably in a bubble. The bubble is your advantage.

You (or your team) shipped a feature last week that would have taken a team of three a month to build two years ago. You did it in a day, mostly by talking to an AI. And the strangest part was not the speed. It was that nobody around you seemed to notice how abnormal that is.

This is the feeling that builders have right now, and struggle to articulate. Something has shifted. Quietly. Practically. Daily. You can do things that you could not do before. Lots of things. Fast. And when you look up from your screen and talk to people outside your bubble, you realize they have no idea.

16.3% of the world’s working-age adults have used an AI tool. Of the people who use AI at work, only 5% use it in ways that actually transform their output. Meanwhile, builders in the 95th percentile of AI adoption are producing 6-17x more than the median worker using the same tools.

Completely different realities.

The conversation about AI is stuck on whether it will replace jobs. That question is years premature. The real story is simpler: a small group of people is building at a pace the general population cannot comprehend.

The Ability Bubble

Call it what it is: an ability bubble. A temporary period where a specific cohort of builders can do dramatically more than everyone else, because of tool adoption and workflow design.

Solo founders are shipping products that used to require teams of ten. A developer with Claude Code or Cursor is writing in a day what used to take a sprint. A product manager who never wrote code is deploying production software. Structural shifts in what one person can accomplish.

And the people doing this are pulling further ahead every week.

Why the Bubble Expands (Not Contracts)

Your intuition says this should not last. Early adopter advantages shrink as tools go mainstream. That is how it worked with email, spreadsheets, smartphones. Everyone catches up. The advantage normalizes.

AI does not work this way. Three forces keep the bubble expanding.

AI skills compound. The distance between using ChatGPT for summarization and orchestrating multi-agent workflows across research, code, and deployment is not a weekend workshop. It is months of daily iteration. Every hour you spend with these tools teaches you something that makes the next hour more productive. The builder who started in 2024 has two years of compounded intuition that cannot be shortcut. This is like learning to code: the skill itself creates the capacity to learn more of the skill.

The rest of the world faces structural barriers. The 55-year-old operations manager at a mid-market company has access to ChatGPT. What they lack is the workflow context, the prompt intuition, the tool-chaining mental models, and the organizational permission to rebuild how they work. Only 35% of enterprises have a mature AI upskilling program. The training infrastructure itself is the bottleneck.

Builder tools are accelerating faster than consumer tools. The tools getting dramatically better every quarter are Claude Code, Cursor, Windsurf, Devin, Replit Agent. Coding assistants now write 41% of all code. 73% of engineering teams use AI coding tools daily, up from 18% two years ago. Every improvement makes builders more productive while doing nothing for the person who uses AI to rewrite emails.

Tool velocity times adoption intensity. Builders have both.

The 2-3 Year Window

The gap will close eventually. But not soon.

AI interfaces will become invisible, embedded into every tool as default behavior rather than a separate thing you choose to use. Most enterprise software still treats AI as an add-on. That transition takes 2-4 years across the long tail. Education will catch up, but most corporate L&D programs are still running 2023-era prompt workshops. Organizational culture will shift from permission mode, where employees need approval to use AI, to expectation mode, where not using it requires justification. But culture change in large organizations is measured in years.

Add these timelines together: 2-3 years where builders operate at a fundamentally different level than everyone else.

After that, the advantage does not disappear. It normalizes. The edge shifts from “I can do what you cannot” to “I can do it slightly faster.” Slightly faster is a far less valuable position than categorically different.

The Three Psychological Barriers Nobody Talks About

The data says the bubble is real. The structural forces say it is expanding. The timeline says you have 2-3 years. So why are so many people inside the bubble still not building? Why are you not building - and shipping? And not monetising?

Because the biggest barriers are not technical. They are psychological. And they hit differently depending on which side of the code divide you sit on.

Barrier 1: “I’m not a real developer.”

If you do not code for a living, building with AI feels like trespassing. You are a product manager, a designer, a marketer, a founder with a business background. And here you are, shipping actual software. It feels like an outside-in move. Like you snuck into someone else’s domain through a side door that was not supposed to be open.

This is the primary reservation for most people who have not shipped yet. Not that they cannot figure out the tools. That they feel like they should not be allowed to. That somewhere there is a credential they are missing, a gatekeeping exam they never sat, a permission they were never granted.

I know this because it happened to me. The first time I shipped something built with AI coding tools, the dominant feeling was not pride. It was a quiet suspicion that I was getting away with something. That a “real” developer would look at what I built and see through it.

Here is the truth: nobody is checking your credentials at the deploy button. The user does not care whether you have a CS degree. The customer does not care whether you wrote the code by hand or with Claude. They care whether the product solves their problem. That is the only qualification that has ever mattered, and AI just made it accessible to a much larger group of people.

The mindset shift is simple but hard: you are not trespassing. The door is open because the walls came down.

Barrier 2: “I’m cheating.”

If you do code for a living, the barrier is the opposite. Using AI feels like cutting corners. You built your career on craftsmanship: understanding systems deeply, writing clean code, catching edge cases, knowing why something works and not just that it works. Now an AI writes 80% of the code and you are left wondering what you missed.

Obsessing over details becomes a pattern. Did the AI introduce a subtle bug? Is there a security hole I did not catch? Am I still a good engineer if I did not write this myself?

This is an identity crisis disguised as quality control. The anxiety is not really about bugs. It is about what it means to be a developer when the core act of writing code is no longer the bottleneck. The answer is that your value was never in the typing. It was in the judgment: knowing what to build, how to architect it, where the failure modes live, and when to override the machine. AI makes that judgment more valuable, not less.

Barrier 3: “If I can build this in a week, it must be worthless.”

This one is the most insidious, because it feels like logic. You built something in a week with AI tools. Your immediate thought: if it was that easy for me, anyone can do it. So why would anyone pay for it?

This is where the bubble shows up as a psychological blind spot. You are inside the bubble. You have the tool fluency, the workflow intuition, the domain knowledge, and the prompt engineering instincts that make a week-long build feel effortless. You forget that most people are still struggling with barriers one and two, if they have gotten that far at all. Many have not even started. And even among those who have, most lack the domain depth to know what is worth building in the first place.

The ease of your build is not evidence that the product is worthless. It is evidence that you are inside the bubble. The 16.3% number is your answer: the vast majority of the working world cannot do what you just did. Not yet. Not for another 2-3 years. And in that window, the thing you built in a week has real, defensible value to the people outside the bubble who need it.

Stop discounting your output by the effort it took. Start pricing it by the problem it solves.

Know a builder who needs to hear this? Share this article with them. The barriers are quieter than the hype, and most people fight them alone.

How to Capitalize: Five Moves for Builders

The question is not whether you have an advantage. If you are reading this, you almost certainly do. The question is whether you are extracting maximum value from a temporary asymmetry. Most builders are not. They are using AI to do their existing job 30% faster instead of using AI to build things that were previously impossible.

Here are five moves that separate builders who capitalize from builders who coast.

Move 1: Build products, not just productivity.

The most common mistake: using AI exclusively to speed up your current work. Faster emails. Faster code reviews. Faster slide decks. This is real value, and you should capture it. But the asymmetric opportunity is somewhere else.

The asymmetric opportunity is building new things. Solo-founded startups surged from 23.7% of all new companies in 2019 to 36.3% by mid-2025. A solo founder with the right AI stack can now operate at a level that previously required a team of five to ten. Operating margins of 60-80% on a tool stack that costs $100-500 per month.

Danny Postma runs HeadshotPro solo at $3.6 million ARR. Maor Shlomo built Base44 to 250,000 users and sold it to Wix for $80 million within six months. These are not outliers of talent. They are outliers of timing and tool adoption.

You do not need to quit your job. But you should be building something on the side. A product, a tool, a service, a system. Something that converts your AI fluency into an asset that exists outside your employer’s org chart.

Move 2: Invest in workflow design, not just tool selection.

The 6x productivity gap comes from workflow architecture, not tool choice. The power users and the median employees are often using the same tools. The difference is how tasks are decomposed, how AI is chained across steps, how outputs from one model become inputs to another, and how human judgment is inserted at the right checkpoints.

The real question is: “how do I redesign this entire process so that AI handles 80% and I handle the 20% that requires judgment?”

The builders who compound fastest are the ones who spend time designing workflows, not just executing tasks. Every workflow you build is a template you can reuse, share, sell, or license. Your workflow library is your moat.

Move 3: Ship publicly and build reputation now.

In 2-3 years, everyone will claim AI expertise. Right now, the proof is thin. The people who are visibly building, shipping, and sharing their work in public during this window will own the credibility when the market matures.

Write about what you are building. Show the process, not just the output. Document your failures and your iterations. The builders who establish authority during the bubble will be the ones hired, funded, and followed when the rest of the market catches up.

This is about converting a temporary skill advantage into a durable reputation asset.

Move 4: Build your data and distribution moats early.

AI tools are commoditizing. The models get cheaper every quarter. The interfaces get easier. The competitive advantage lives in proprietary data that makes AI more valuable and distribution channels that give your AI-powered products an audience.

Every product you build should be designed to collect unique, defensible data with every use. Every workflow you automate should connect to an audience or customer base you own. The AI advantage fades. The data and distribution advantages do not.

This is the difference between someone who used AI to build a tool and someone who used AI to build a business. The tool can be copied. The business cannot.

Move 5: Hire or partner to fill your gaps, while the talent is underpriced.

Here is a counterintuitive implication of the bubble: the people who have AI building skills are in an expanding bubble, but the market has not fully priced this in yet. The 56% wage premium is growing but still lags the actual productivity differential. A developer who can build AI-native workflows is generating 6-17x more output, but they are not being paid 6-17x more.

If you are hiring, this is the window to lock in AI-fluent talent before the market fully prices the skill. If you are a solo builder, this is the window to partner with other builders who complement your gaps, before the talent pool gets priced up to match the productivity premium.

The Uncomfortable Truth

There is one more thing that does not get said enough in AI commentary, because it sounds elitist and uncomfortable.

The bubble is also a compounding inequality.

The people who benefit most from AI tools are the people who were already high-performers: educated, technically literate, in knowledge-work roles, in high-income countries, at companies that provide access and permission. Microsoft’s data shows it clearly: AI usage is concentrated among younger, educated, higher-income workers in the Global North.

Worth acknowledging, because ignoring it leads to bad strategy. If you are in this cohort, your advantage is real but not earned purely through merit. It is partially a function of circumstance. And the ethical response to that is not guilt. It is urgency. Use the advantage while it exists. Build things that create value beyond your personal gain. And recognize that the window where this kind of asymmetry is possible is historically rare and structurally temporary.

The Close

Let me return to where we started. Two populations, one timeline.

16.3% of the world’s working-age adults have used AI. 5% of employees are maximizing it. And inside that 5%, there is a smaller group still: the people who are not just using AI to be more productive, but building with it. Creating products, systems, workflows, and businesses that did not exist two years ago.

That group is in a bubble. A bubble of capability that the rest of the world has not caught up to yet, and will not catch up to for another 2-3 years.

The question is not whether you are in the bubble. The question is what you are building while you are inside it.

The tools are temporary. The models will change. The interfaces will evolve. But the products you ship, the reputation you build, the data you collect, and the distribution you own will outlast the window.

Build now. The bubble is real. And it will not last forever.

Most AI UX Is Lazy

Pranav Pathak — Sun, 08 Mar 2026 09:34:44 GMT

Every AI feature shipped in the last two years looks the same.

A chat box, a loading spinner, a disclaimer at the bottom, and the interface is a text field that says “Ask me anything.”

The problem is not the AI. The problem is that product builders don’t ask the right question. They ask “how do I add AI to our product?” They should be asking: “What user problem can I solve now that was previously unsolvable?”

When you skip that question, you get the default: a chatbot. When you answer it, you get a lot more success. My most successful launches using AI were not chat interfaces. Some of them were text fields that turn into checkboxes. Some were descriptions and summaries. Some others were just high performing classifiers. More on that later.

Lazy AI UX comes from lazy AI thinking. Teams start with the interface instead of the capability.

The Four Lazy Defaults

The Chatbot. Everyone’s first instinct. “Chat with your data.” Notion, Salesforce, HubSpot, Shopify. They all shipped it as their flagship AI feature. Chat shifts cognitive load to the user. “Figure out what to ask” is work for the user, that you’ve offloaded from the work you need to be doing. Your product should already know what the user needs. Chat is one AI primitive forced onto every problem regardless of fit. When your travel app, your CRM, your project management tool, and your email client all have the same chat interface, something has gone wrong with the thinking.

The Magic Wand Button. “AI Generate” dropped next to every text field. No context about what it will generate. No way to guide it. No refinement loop. Click, get a blob of text, accept or regenerate. This is another single primitive (generation) with zero UX thought applied to it. The lottery machine approach to product design. Pull the lever, see what comes out, pull it again if you don’t like it.

The Loading Spinner of Trust Destruction. Eight-second wait. No progress indication. Then streaming text appears character by character. Impressive in a demo, anxiety-inducing in production. Users cannot verify something that is still being written. They sit there watching words appear, not knowing whether to trust what they are reading or wait for the next sentence to contradict it. The latency is not a performance problem. It is a trust problem.

The Disclaimer. “AI-generated content may contain errors.” Plastered at the bottom of every output as a substitute for designing verification into the experience. If your UX strategy for handling AI errors is a legal footnote, you did not design UX. You designed liability coverage.

Why Teams Default to Lazy

The problem is structural. These are not bad designers making bad decisions. These are reasonable people responding to bad incentives.

Demo-driven development. Chat demos beautifully in board meetings. Executive types a question. AI answers. Room applauds. The demo becomes the product strategy. But what demos well and what gets adopted are different things. A chatbot that impresses a VP in a conference room can still get low adoption from actual users who have faster ways to get the same information.

Model-out thinking instead of user-back thinking. “We have GPT-4, what can we do with it?” instead of “Users struggle with X, can AI help?” The capability drives the UX instead of the problem driving the capability. This is the root of lazy AI UX. Starting with the tool instead of the user.

The API makes chat easy and everything else hard. The default API interface is messages in, text out. Chat is the path of least engineering resistance. Structured output, proactive surfacing, inline suggestions: all harder to build. So teams ship the easy thing and call it v1.

No established patterns, so teams copy each other. When Notion shipped a chatbot, everyone shipped a chatbot. The industry is cargo-culting UX from whoever shipped first, regardless of whether it worked. Copying someone else’s lazy default does not make it less lazy.

The Right Question: What Can You Do Now That You Could Not Before?

AI gives you a set of primitives. Things that were previously impossible or prohibitively expensive to do at scale. So the question to ask is, you have a capability now that you did not have before. What problems were previously unsolvable, that have now become possible to solve?

Finding the answer to this question requires a deep understanding of the capabilities the tech unlocks. I’m going to list down my mental model here. Yours may be slightly different, but as long as the broad strokes capabilities converge, you have a tool set that you can use to build a foundation on.

The LLM Primitives

1. Intent Detection

What it unlocks: understanding what users mean, not just what they type. Interpreting unstructured input and mapping it to structured actions at scale. This was previously impossible without building thousands of hand-coded rules that broke every time someone phrased something differently.

Example: Booking.com Smart Filters. User types “hotels with pool and gym.” The AI maps that to existing structured filters: Swimming Pool and Fitness Center checkboxes, pre-selected. Property type set to Hotels. 42 results. The AI detected intent and translated it to structured action. No chat. No generated text. Full story below.

2. Generation

What it unlocks: creating content that did not exist, at zero marginal cost. First drafts, suggestions, completions, variations. Previously, every piece of content required a human to write it from scratch.

Example: inline drafts with refinement loops. GitHub Copilot suggests a line of code. Tab to accept. Keep typing to reject. The suggestion disappears with zero friction. Cursor takes it further: it shows diffs you can accept or reject line by line. The output is not text to read. It is a decision to make. The user stays in control throughout.

3. Summarisation

What it unlocks: compressing large volumes of information so humans do not have to process all of it. Distilling hundreds of pages, thousands of reviews, or hours of conversation into the parts that matter.

Example: contextual highlights and progressive disclosure on a set of reviews or user generated content. Surface the three things that matter. Let the user drill into detail on the ones they care about. The AI does not just compress information. It prioritises it, based on what the user is trying to do right now. A summary of a 50-page legal contract should look different for the CFO than for the head of legal.

4. Classification

What it unlocks: sorting, routing, labelling, and triaging automatically. Categorising unstructured content accurately without hand-built rules for every scenario.

Example: Linear’s Triage Intelligence. When a new issue comes in, the AI suggests a team, assignee, labels, and related issues. The suggestions are visible. Hover over them to see the reasoning. Accept with one click or override. The interesting thing about classification is that the lazy default is not a bad interface. It is no interface at all. Teams use classification behind the scenes but never surface it as something the user can see, trust, and refine. That is a missed opportunity. Making classification visible turns it into a feature users value.

5. Extraction

What it unlocks: pulling structured data from unstructured input. Reliably extracting specific fields from messy documents, images, or conversations at scale.

Example: auto-populated fields from uploaded documents. Upload an invoice, get line items extracted and pre-filled. Upload a resume, get candidate fields populated. The user reviews and corrects. They are editors, not transcribers. The shift from “type everything” to “verify and fix” is enormous. It turns a ten-minute task into a thirty-second review.

6. Memory

What it unlocks: accumulating context over time and using it to make every other primitive smarter. A system with memory knows that this user always filters for pet-friendly hotels. It knows that this team routes billing bugs to the payments squad. It knows that this reader skips the methodology section.

Memory is the enabling layer. Intent detection with memory gets better at understanding what this specific user means. Classification with memory routes more accurately based on how similar items were handled before. Summarisation with memory knows what this person cares about and prioritises accordingly.

Example: a support agent that already knows your order history, your past issues, your account tier. It does not ask you to explain your problem from scratch. It picks up where the last interaction left off. Every interaction makes the next one faster.

Without memory, every AI interaction starts from zero. With it, the product compounds. This is also why memory is the hardest primitive to copy. Your competitor can use the same model. They cannot replicate what your product has learned about your users.

These six primitives naturally combine. Intent detection figures out what the user wants. Memory recalls what it already knows about them. Classification routes the request. Extraction pulls the relevant data. Summarisation compresses it. Generation produces the response. Orchestrate all six well and you get a conversation. That is why chat makes for impressive demos. It exercises every primitive at once. It is also why chat is the laziest default. You are using all six primitives to do a job that one or two of them could handle better.

The Agent Primitives

These are emerging. The UX patterns are still being invented.

1. Planning and Decomposition

What it unlocks: the user states an outcome. The AI figures out how to get there. Breaking an ambiguous goal into concrete, executable steps.

The lazy version: an agent that narrates its chain-of-thought in a chat window. “First, I will search for... Now I am analysing... Let me think about this...” The user watches the AI think out loud, step by step, for ninety seconds. This is performative reasoning. It looks impressive. It wastes the user’s time.

The good version: a proposed plan the user can review, edit, and approve before execution starts. Show the steps. Let the user reorder, remove, or add to them. Then execute. The user wants to approve the plan, not watch the thinking.

2. Autonomous Execution

What it unlocks: completing multi-step tasks without human involvement at every step. The AI does the work, not just the thinking.

The lazy version: a chat-based agent that asks permission at every single step. “Should I do this? Okay. Should I do this next? Okay.” This turns autonomy into a slower version of doing it yourself. The whole point of an agent is that it handles the steps. If you have to approve each one, you are just operating a very slow remote control.

The good version: background execution with confirmation checkpoints at meaningful decision points. The agent does the work. You approve the result. Not every intermediate step. Just the ones that matter.

3. Environment Interaction

What it unlocks: the AI does not just suggest. It acts. It books, sends, files, updates. It reaches into external systems and changes things.

The lazy version: “I have drafted an email, would you like me to send it?” Repeated for every individual action. Twelve confirmation dialogs for twelve actions. The friction of asking permission one action at a time destroys any efficiency the agent was supposed to create.

The good version: batch actions with review, clear undo capability, and sandbox-then-commit patterns. This is where trust design matters most. The gap between “AI that suggests” and “AI that acts” requires UX that gives users confidence without demanding constant supervision. Show the user what will happen. Let them approve the batch. Make it reversible.

4. Monitoring and Reaction

What it unlocks: the AI pays attention when you are not. Watching for conditions and acting proactively when they are met.

The lazy version: notification spam. Alerting on everything because the system cannot distinguish signal from noise. Fifteen notifications a day, fourteen of which are irrelevant. The user learns to ignore all of them.

The good version: smart triggers with context and action. “Your flight price dropped $200. I held the lower rate for 24 hours.” The notification is not just information. It is a completed action the user can confirm or undo. Event-driven, not attention-demanding. The AI earns trust by acting on the user’s behalf with clear, reversible actions.

The agent primitives will define the next generation of AI UX decisions. The lazy defaults are already forming. Mostly chat windows narrating agent behaviour. The teams that design better patterns now will own the category.

Smart Filters: Intent Detection Done Right

The most successful AI feature I shipped at Booking.com has no chat, no generated text, and no disclaimer.

The experience: you are searching for hotels in Paris. On the search page, there is a Smart Filters text field. You type “hotels with pool and gym.” You tap “Find properties.” What comes back is not a paragraph of AI-generated recommendations. It is the existing filter UI. Checkboxes for Swimming Pool and Fitness Center, pre-selected. Property type set to Hotels. 42 results.

That is it. The AI understood what you meant and translated it into the structured actions the product already supports. Here is why it works:

The AI does not replace the interface. It accelerates it. The filters already existed. Users already knew how to use them. The AI just fills them in faster than the user could manually. The user ends up in the exact same place, with the same controls, the same mental model. Just faster.

The output is verifiable at a glance. Did it understand “pool”? Swimming Pool is checked. “Gym”? Fitness Center is checked. Verification takes one second. Compare that to a chatbot response: “Based on your preferences, I recommend these hotels...” followed by a generated list you have no way to audit for completeness or accuracy.

It is editable without starting over. Wrong filter? Uncheck it. Missing one? Add it. The user stays in control without having to rephrase a question and hope the AI gets it right the second time. Compare this to the regeneration loop of a chatbot, where your only option when the output is wrong is to try again with different words.

Failure is cheap. If the AI misinterprets something, the user adjusts a checkbox. No dead end. No trust collapse. No disclaimer needed, because the output is not generated content. It is a filter selection the user can see and fix instantly.

It uses the right primitive. This is intent detection. The AI’s job is to understand what the user means and translate it to structured action. A chatbot would have used dialog management and generation. Those are the wrong primitives for this problem. The user does not want a conversation about hotels. They want 42 results that match their criteria.

The question that led to this feature was not “how do we add AI to search?” It was “users know what they want, but finding the right combination of filters from a list of 200+ takes too long. Can we solve that?” The primitive (intent detection) matched the problem (translating fuzzy preferences into structured filters). The UX followed from there.

Think your design team could use this?

The Real Problem

The lazy UX problem is a lazy thinking problem. Teams jump from “we have AI” to “ship a chatbot” without asking what capability they have and what user problem it solves.

Eleven primitives across two tiers. Each one unlocks UX patterns that were previously impossible. Each one has a lazy default that teams reach for because it is easy, not because it is right.

The builders who win will match the primitive to the problem and design the interface that primitive deserves.

A text field that turns into checkboxes. Intent detection with the right UX. No chat. No generated paragraphs. No disclaimer. Just the existing interface, faster.

Start with the user problem. Identify the primitive. Design the UX it deserves.

Growth Machines: The Cursor Story

Pranav Pathak — Sun, 01 Mar 2026 14:32:04 GMT

Four MIT students fork a code editor in 2022. Two years later, they are running the fastest-growing SaaS company in history.

$1 million in revenue in 2023. $100 million in 2024. $1.2 billion in 2025. A valuation of $29.3 billion by November 2025. An 11x increase in twelve months.

Here is the detail that should bother every startup founder reading this: they spent zero dollars on marketing to reach $100M ARR. Not “low marketing spend.” Not “efficient growth.” Zero. If you are building a product right now and your growth plan starts with “demand gen,” this story is about to rearrange your priorities.

Cursor is the most instructive growth story in software since Notion. It made five specific product decisions that any builder can study, steal, and adapt, whether you are building an AI-native startup, a developer tool, or a productivity product in any category.

This is a teardown of those five decisions. What they chose, what they rejected, and what made each choice compound into the kind of growth that makes VCs rewrite their models.

Michael Truell, Sualeh Asif, Arvid Lunnemark, and Aman Sanger met studying computer science and math at MIT. They collaborated at MIT CSAIL, MIT’s Computer Science and Artificial Intelligence Laboratory, before founding Anysphere in 2022..

Everyone was saying AI is the future of coding, but their bet was more specific and more interesting: the right place to embed AI is not on top of the code editor as a plugin. It is inside the code editor itself. That distinction, on top vs. inside, is the entire story of Cursor’s product.

They raised $3.38 billion in total funding from Accel, Coatue, Thrive Capital, Andreessen Horowitz, Google, and NVIDIA. Today, Cursor has over 1 million daily active users, 360,000 paying customers, and more than 50,000 enterprise accounts. It generates nearly a billion lines of code daily. Major customers include Stripe, OpenAI, and Spotify.

The numbers are staggering. The decisions behind them are useful.

Decision 1: Fork, Don’t Build

The first decision Cursor made saved them two years. They forked VS Code. They took the open-source codebase of the most widely used code editor in the world and made it theirs.

This gave them three things instantly:

Muscle memory. Every developer who uses VS Code (hundreds of millions) already knew how to navigate Cursor on day one. Keyboard shortcuts, extension ecosystem, settings, themes — all familiar. The switching cost was almost zero.

Maturity. VS Code is decades of engineering work. The file system, the debugging tools, the terminal, the Git integration, and everything that makes it work, Cursor did not have to build any of it. They inherited a production-grade foundation.

Depth permission. A plugin sits on top of VS Code, it can suggest a line of code. It can autocomplete. But it cannot fundamentally change how the editor understands your project. A fork on the other hand can rewrite the way context flows between files. It can introduce Composer mode, a multi-file, project-aware AI that understands the relationships between your modules, your tests, and your config.

The lesson is clean: when a mature, open-source platform exists in your domain, forking it is not lazy, it is strategic. You skip the years of infrastructure work and go straight to the layer where your actual differentiation lives. Every month you spend building what already exists is a month your competitor spends building what doesn’t.

Decision 2: Custom Models for Workflow, Foundation Models for Reasoning

The second trap Cursor avoided was the “API wrapper” trap.

In 2023 and 2024, thousands of startups launched products that were thin interfaces around OpenAI or Anthropic API calls. The problem is, when the model provider ships a better default UI, your product becomes redundant overnight.

Cursor built custom AI models trained specifically for coding workflow tasks. Things like advanced autocomplete that understands multi-file context, predictive edits based on your recent changes, and code generation that respects your project’s patterns and conventions. These custom models handle the workflow-specific tasks. Foundation models like GPT and Claude handle the general reasoning.

This is a deliberate architectural split: proprietary models for the 80% of interactions that are workflow-specific, foundation models for the 20% that require open-ended reasoning.

By the time Cursor 2.0 shipped, it included a multi-agent architecture and a proprietary Composer model that could orchestrate complex, multi-step coding tasks across an entire project. This was not something you could replicate by calling an API.

If your only moat is a nice UI around someone else’s model, you do not have a moat. Build custom intelligence for the workflows that are specific to your domain. Use foundation models for everything else. The custom layer is where your product becomes irreplaceable.

Decision 3: Serve Power Users First, Everyone Else Later

GitHub Copilot has over 20 million users. Cursor has around 1 million. Cursor makes roughly four times the revenue.

20x fewer users, 4x more revenue.

How? Copilot optimized for the broadest possible audience. Single-line autocomplete suggestions works fine for casual code. It’s a great convenience tool.

Cursor optimized for professional developers doing the hardest work. Multi-file edits. Project-wide context. Agent-mode workflows that can scaffold entire features. Composer sessions that understand the relationships between your API routes, your database schema, and your frontend components.

When you serve power users first, three things happen:

They pay more. Cursor Pro at $20/month and Business at $40/month convert at higher rates because the value proposition is clear: this tool does not help you write code a little faster. It fundamentally changes how much you can ship in a day.

They evangelize harder. A casual user who saves 10 minutes a day tells nobody. A senior engineer who ships a feature in an afternoon that used to take a week tells everyone on their team. And their team tells other teams. This is how you get $100M ARR with $0 marketing spend.

They are harder to displace. Once a power user has rewired their workflow around your tool, switching costs are enormous because of habit, muscle memory, and accumulated context.

By October 2025, Cursor had captured approximately 40% of all AI-assisted pull requests despite having a fraction of Copilot’s user base. The power users were producing the majority of the output.

The lesson is counterintuitive for founders raised on “total addressable market” thinking: a smaller market of high-intent users is more valuable than a larger market of casual users.

Decision 4: Let the Product Be the Marketing

Cursor’s growth was entirely product-led. No ads. No content marketing. No sales team (until enterprise). No growth hacks. The product itself was the distribution mechanism.

Here is how this worked:

A developer tries Cursor’s free tier (2,000 monthly completions). They experience the difference between single-line autocomplete (Copilot) and project-aware AI editing (Cursor). They upgrade to Pro. They mention it in a team standup. Their teammates try it. Their manager notices the team is shipping faster. The company starts a Business plan. That company’s engineers mention it at a conference. The loop repeats.

This is not a new idea. Slack did it. Figma did it. Notion did it. But Cursor executed it in a category (developer tools) where the word-of-mouth dynamics are unusually potent. Developers are opinionated about their tools. When one of them switches editors, everyone notices and everyone asks why.

The free tier was not a lead-gen funnel in disguise. It was genuinely useful. 2,000 completions per month is enough to get real work done. This matters. A free tier that feels crippled creates resentment. A free tier that feels generous creates evangelists.

Revenue doubled approximately every two months during the hypergrowth phase. That does not come from marketing campaigns. That comes from a product that makes people want to tell other people about it.

The principle for builders: if your product requires a sales pitch, it is not ready. The product should be the pitch. The users should be the sales team.

Decision 5: Usage-Based Pricing for the AI Layer

In August 2025, Cursor shifted to usage-based pricing for its agent features. The logic: the more the AI does for you, the more you pay. If Cursor’s agent writes 500 lines of code that would have taken you two hours, you pay proportionally more than when it autocompletes a function name. The price reflects the value delivered, not the time spent in the editor.

This matters for two reasons.

First, it makes the economics self-correcting. As AI capabilities improve, users get more done per session, which means they pay more per session, which means revenue grows with capability, not just with headcount.

Second, it signals confidence. A usage-based pricing model says: “We believe you will use this so much that you will be happy to pay for what you consume.” A flat-rate model says: “We hope you use this enough to justify the price.”

The pricing tiers: free (2,000 completions), Pro ($20/month), Business ($40/month), plus usage-based agent pricing create a natural upgrade path. You start free. You hit the limit. You upgrade. You start using agent features. You pay for what you consume. At every step, the value is obvious before the payment is required.

The Copilot Counterpoint

What Happens When You Go Broad Instead of Deep

No teardown is useful without a counterexample. Copilot is the counterexample.

GitHub Copilot launched with Microsoft’s distribution muscle. It is bundled with GitHub. It is integrated into VS Code as a plugin. It has the backing of OpenAI. It has over 20 million users and penetration in 90% of the Fortune 100. It generates roughly $300 million in ARR. Cursor, with 5% of the user base, generates $1.2 billion.

Why? Because Copilot is constrained by the plugin architecture. A VS Code plugin cannot do what a VS Code fork can. It cannot rewrite how the editor processes project context. It cannot introduce multi-agent workflows that operate across files. It cannot build Composer mode. The plugin model inherently limits how deeply AI can integrate with the development workflow. Copilot optimized for breadth: every developer, every use case, every company.

The Moat Question: Is This Sustainable?

Intellectual honesty demands we address this. Cursor’s moat is real, but it is thinner than the valuation implies.

Three threats worth watching:

Microsoft’s bundling power. GitHub Copilot is already included in enterprise contracts. When your competitor can offer “free with GitHub Enterprise,” you need your product to be dramatically better, not marginally better. So far, Cursor clears that bar. But the bar moves every quarter.

Agent platforms targeting non-developers. Tools like v0 and Bolt.new are not competing with Cursor for professional developers. They are competing for the “vibe coder”, the non-developer who can now build software with AI assistance. If that market becomes larger than the professional developer market, Cursor’s power-user strategy becomes a ceiling, not an advantage.

Model providers building vertically. OpenAI, Google, and Anthropic all have incentives to build their own coding environments. They have direct model integration advantages that third parties cannot match. If Claude or GPT ships a native IDE experience, Cursor’s custom model layer becomes less differentiating.

The current valuation of $29.3 billion sits at roughly 58x ARR. That prices in sustained hypergrowth. If any of these three threats materializes faster than Cursor’s product evolves, that multiple compresses quickly.

There is also a sentiment signal worth noting: developer favorability toward AI coding tools dropped from over 70% in 2023-2024 to 60% in 2025, even as adoption rates rose to 91%. Developers are using these tools more but trusting them less.

The Fork-and-Deepen Playbook: What You Can Steal

Strip away the specifics of code editors and AI models. What is the replicable playbook?

At 0 to 1: Fork, don’t build. Find a mature, open-source or widely used platform in your domain. Fork it or build deeply on top of it. Skip the infrastructure years. Go straight to the differentiation layer. Your users get instant familiarity. You get instant focus on what matters.

At 1 to 10: Serve the 10x user. Do not build for the average user. Build for the most demanding user in your category. The user who will restructure their workflow around your product if it is good enough. These users pay more, evangelize more, and are harder to displace. They are your growth engine.

At 10 to 100: Let the product be the distribution. If you need a sales deck to explain your product’s value, the product is not ready. Build until the experience is so obviously better that users become your marketing department. Free tiers should be genuinely useful, not crippled demos.

At every stage: Build custom intelligence for your domain. Do not be an API wrapper. Foundation models are commodities. Your custom models trained on your domain’s specific workflows are your advantage.

When scaling: Align pricing with value delivery. Usage-based pricing for AI features is a signal of confidence in your own product. If your AI delivers real value, you should be willing to price it based on how much value it delivers, not on a flat subscription that averages out the power users with the casual ones.

Where you should exercise caution

One caveat worth naming: playbooks extracted from hypergrowth stories are seductive precisely because the outcomes are so clean. Cursor’s decisions look inevitable in retrospect, but each one carried real risk that happened to break their way.

Forking creates a dependency on an upstream project you do not control.

Serving power users first means saying no to revenue from a larger, easier market while you burn cash.

Product-led growth with zero marketing spend works until it doesn’t, and you have no distribution muscle to fall back on.

Usage-based pricing can suppress adoption in cost-sensitive segments.

The value of this piece is in the underlying logic: choose depth over breadth, align your architecture with your differentiation, and price to the value you deliver. When the logic fits your context, the decisions follow naturally. When it doesn’t, copying the moves will hurt you.

The Number That Matters Most

Here is the data point I keep coming back to: Copilot has 20 million users. Cursor has 1 million. Cursor makes 4x the revenue. Depth beats breadth. Power users beat mass users, integration depth beats surface-level convenience, a fork beats a plugin.

Every founder I talk to worries about total addressable market. The Cursor story says the opposite: worry about depth of value per user. The market follows the product.

Useful? Pass it on.

Your Product Now Has Two Users

Pranav Pathak — Sun, 22 Feb 2026 18:50:54 GMT

Julie wants running shoes for wet pavement. Two years ago, she would have typed “best waterproof running shoes” into Google, clicked through ten blue links, bounced between review sites, and eventually bought something from the brand that ranked highest and had the most convincing landing page.

Today she asks ChatGPT. It recommends three shoes. Your product is not one of them.

Your shoe isn’t bad. It’s not that expensive either. It’s just invisible. The AI intermediary that now sits between your product and your customer could not understand what your product does, who it is for, and where it falls short.

Welcome to the era of AI-mediated discovery. Your product now has two users. The human who uses it, and the AI that decides whether the human ever sees it.

The Numbers That Should Scare You

This is not a hypothetical future. It is already here, and the data is brutal.

Over 80% of all searches now end without a single click. When a Google AI Overview appears, that zero-click rate climbs to 83%. Click-through rates drop from 15% to 8% when an AI Overview is present, a 58% reduction. The click, once the atomic unit of digital discovery, is becoming an endangered species.

But that data isn’t indicative alone. Sure, information seekers clicking on fewer links adds noise. Not everyone is looking for product recommendations after all. Some of us just want to know if Koalas are vegan. So here’s some more well rounded data that proves my point:

SOCi analyzed over 350,000 locations across 2,751 multi-location brands and found that ChatGPT recommends just 1.2% of them. Google’s local three-pack shows 35.9%. That makes AI recommendation roughly 30 times more selective than traditional search.

Read that again: 30 times more selective. If traditional search was a funnel, AI-mediated discovery is a needle.

Meanwhile, the traffic that does come through AI channels is surging. AI-driven traffic to US ecommerce sites increased 758% year over year by November 2025. According to IBM’s study with NRF, 45% of consumers now turn to AI during buying journeys: 41% to research products, 33% to interpret reviews, 31% to hunt for deals. About a third of US consumers say they would let an AI make purchases for them entirely.

This is not a trend that might matter someday. This is the new distribution channel. And your product is either legible to it or invisible within it.

The Old Playbook Is Broken

For two decades, the discovery playbook was straightforward. Rank for keywords. Drive traffic. Convert on-site. The entire machine was built on the assumption that discovery happens through a list of links, and the winner is whoever sits highest on that list.

AI-mediated discovery breaks every piece of that machine.

Non-brand, awareness-driven B2B traffic has declined by up to 60% as AI search reduced click-through behavior. These companies watched something deeply disorienting happen: their search rankings stayed stable, but the clicks disappeared. They were still “winning” at SEO. The traffic just stopped arriving. The game changed around them while the scoreboard kept showing the old score.

When a user asks an AI a question, the AI does not return a list of links for the user to evaluate. It returns an answer. It synthesizes, filters, compares, and recommends. This is the fundamental shift: your product is no longer competing for a user’s attention. It is competing for an AI’s comprehension.

What AI Intermediaries Actually Reward

The brands ChatGPT recommends are not the biggest. They are the most legible.

Recommended locations average 4.3-star ratings. AI platforms heavily weight sentiment. Not volume of reviews, but consistency of positive signal. More importantly, AI rewards cross-channel consistency over strength in a single channel. A brand that is excellent on its website but thin on Google, Yelp, and social profiles performs worse than a brand that is consistently good across all of them.

This is a fundamentally different optimization target than traditional SEO. Search engine optimization rewarded you for being strong in one place: Google’s index. AI-mediated discovery rewards you for being coherent everywhere.

The second pattern is even more counterintuitive. Constraint-focused product content outperforms feature-focused content for AI recommendations. “Best for daily commuters; water-resistant for light rain, not heavy downpours” beats “advanced waterproof technology with moisture-wicking lining.” The AI intermediary is not looking for marketing language. It is looking for specificity about what your product does, who it serves, and, critically, where it does not work.

Think about why. When a user asks an AI for a recommendation, the AI is trying to match a specific need to a specific product. Feature lists are noise. Constraints are signal. The AI needs to know what your product is not good at just as much as what it is good at, because that is how it avoids recommending the wrong thing to the wrong user.

The Second-Order Consequences

This shift does not stop at “optimize your product data.” The incentives underneath AI-mediated discovery are different from search, and that difference compounds.

Google monetized clicks. Large language models monetize sessions. Their economic incentive is to keep users inside the interface, deliver a high-confidence answer, and reduce friction. That means fewer options, higher precision, and lower cognitive load. A list of ten links was acceptable when the click was the business model. A single synthesized answer is required when retention is.

Right now, most AI interfaces recommend three to four products. In many categories, eight to twelve products could legitimately satisfy the user’s constraints. The compression is already severe. When sponsored placements enter the system, surface area shrinks further. If one slot is paid and one is reserved for a trusted integration, you are effectively competing for one or two organic positions. This is not SEO with better copy. It is a high-precision, low-inventory discovery economy.

Here, discoverability becomes winner-takes-most. Marginally legible products do not receive marginal traffic. They disappear. From the AI provider’s perspective, monetization must remain tightly coupled to relevance. If AI platforms degrade recommendation quality for ad revenue, users leave. Unlike search engines, they cannot fall back on ten blue links. Trust is the product.

There is another layer. Today, structured constraint data is an advantage. Tomorrow, it becomes table stakes. When every brand expresses “best for daily commuters, not ideal for heavy rain” in clean, machine-readable form, differentiation moves elsewhere. The AI must choose among equally legible products.

What does it optimize for then? Likely the lowest probability of user regret. Consistent satisfaction signals. Return rates. Review stability. Cross-channel coherence. In that world, operational excellence beats marketing brilliance. The brands that win are not those with the loudest features, but those with the most reliable outcomes.

The Trust Layer Underneath

Before you rush to “optimize for AI,” there is a critical nuance.

Only 46% of shoppers fully trust AI recommendations. And 89% still check information before buying based on an AI recommendation. The AI intermediary is not a decision-maker. It is a filter. It narrows the consideration set from everything to three or four options. Then the human takes over.

This means the AI is deciding who gets into the room. But the human is still deciding who wins.

Your product needs to pass two tests now, in sequence. First, can the AI understand it well enough to recommend it? Second, once the human arrives (from the AI recommendation), does the product experience close the deal?

What This Means for Builders

The implications are immediate. Your product’s discoverability is no longer just a marketing problem. It is a product data problem. How your product describes itself in its attributes, constraints, use cases, and limitations, determines whether it passes through the AI filter.

Does your product data describe constraints and use cases, or does it describe features and benefits? Does your product information maintain consistency across every surface it appears on, or is it fragmented across channels with different messaging on each?

If you are a startup founder, there is a genuine opportunity here. AI recommendation is more meritocratic than traditional search in one specific way: it rewards data density and accuracy over brand size and ad spend. A small brand with thorough, constraint-focused product data can outperform a Fortune 500 brand with thin product pages. You do not need a massive SEO team or a big advertising budget. You need clear, specific, honest product data that an AI can parse and match to user intent.

Here is the practical checklist.

Audit your product data for AI legibility. Take your top 10 product pages. Ask ChatGPT, Gemini, and Perplexity to recommend a product in your category for a specific use case. Are you in the results? If not, look at who is and study their product data structure.

Shift from features to constraints. Rewrite your product descriptions to explicitly state who the product is for, what specific problems it solves, and where it falls short. “Works best for X. Not ideal for Y.” This feels counterintuitive. Why would you highlight limitations? Because the AI intermediary uses limitations to make accurate recommendations, and accurate recommendations build trust in the AI’s suggestions.

Ensure cross-channel consistency. Your product data should tell the same story on your website, Google Business Profile, Amazon listing, social profiles, and review platforms. The AI is synthesizing across all of them.

Monitor AI recommendations, not just search rankings. Your weekly check should now include “what does ChatGPT recommend for [your category] + [specific use case]?” Track this the way you track keyword rankings. Build it into your product analytics workflow.

Design your product information architecture for machines, not just humans. Structured data, consistent taxonomy, explicit attribute-value pairs. The prettier your product page is for humans, the less useful it may be for AI intermediaries if the underlying data is trapped in marketing copy rather than structured fields.

The Opportunity Behind the Disruption

Retailers who have already adapted to this shift are seeing results. Those adopting conversational, intent-driven search internally reported a 35% drop in zero-result queries and a lift in first-click conversion. Companies implementing AI-powered discovery report conversion lifts of 20-40%.

Major retailers like Walmart, Target, and Etsy are integrating directly with ChatGPT, Gemini, and Copilot, allowing purchases within AI conversations. The transaction layer is moving into the AI intermediary’s environment. If your product is legible to these systems, you are not just getting discovered, you are getting purchased without the user ever visiting your site.

The irony is beautiful: in a world of AI intermediaries, the competitive advantage is radical clarity about your product.

What Comes Next

We are still early. The AI-mediated discovery layer is clumsy, inconsistent, and sometimes wrong. The 46% trust number tells you that consumers know this. The 89% verification rate tells you humans are not blindly following AI recommendations.

But the behavioral shift is locked in. The zero-click trend, the 758% traffic surge, the one-third of consumers willing to let AI buy for them. These are not numbers that reverse. The intermediary layer between your product and your user is only going to get thicker, more capable, and more selective.

The products that start designing for two users now, the human who uses the product and the AI that recommends it, will have a compounding advantage. Not because they cracked some new marketing hack. Because they did the hardest, most boring, most valuable thing in product development.

They made their product easy to understand.

That running shoe for wet pavement? The one the AI did not recommend? The product was fine. The data was not. Same shoe. Same quality. Same reviews. But one had structured, constraint-specific product data that an AI could parse, and the other had a beautiful landing page full of marketing copy that a human loved and an AI could not decode.

Your product now has two users. Design for both of them, or the one that matters most will never know you exist.

PS: My most favorite electronics e-commerce brand is CoolBlue. I’ve bought all my laptops and phones from them, all my home appliances, even an entire kitchen. Most of it online. They’ve been doing the Pros and Cons list (and a fantastic product page) for ages.

ROI or Die.

Pranav Pathak — Sun, 08 Feb 2026 06:02:19 GMT

In 2024, corporations spent $252 billion on AI. That same year, several studies found that the majority of enterprise AI pilots delivered zero measurable impact on the P&L. Not “underperformed expectations.” Zero. No measurable bottom-line impact whatsoever.

By 2025, 42% of companies had abandoned most of their AI initiatives, up from 17% the year before. The average organization now scraps nearly half its AI proof-of-concepts before they reach production. Gartner predicted that at least 30% of generative AI projects would be abandoned after proof-of-concept by end of 2025. RAND Corporation found that AI projects fail at twice the rate of non-AI IT projects.

Read those numbers again. A quarter-trillion dollars deployed. Ninety-five percent failure rate.

And yet, at the same time, others are making AI work. Decisively.

AI-powered personalization is lifting ecommerce conversion rates by up to 23%. Contact centers using AI see 30% operational cost reductions. A Harvard Business School study found AI users completed tasks 25% faster at 40% higher quality. Early GenAI adopters report $3.70 in value for every dollar invested, with top performers hitting $10.30 per dollar. McKinsey found that the 6% of organizations they classify as “AI high performers” attribute over 5% of their entire EBIT to AI.

Developers using GitHub Copilot code 55% faster in controlled studies. 88% of accepted Copilot suggestions stay in the codebase. Teams across the software development lifecycle report 15%+ velocity gains from AI tooling, and coding has become a $4 billion AI category in 2025, up from $550 million the year before. That’s not a pilot. That’s a market.

Same technology. Same year. Same models. Radically different outcomes.

So what separates the 5% that extract millions in value from the 95% that produce nothing? It’s not the model. It’s talent, it’s data, and…

It’s accountability. And the lack of it has a name.

The Lie

Two words: “strategic investment.”

This is how AI projects avoid accountability. At every altitude. In every organization. And it’s the most expensive phrase in corporate vocabulary right now.

Think about something. If a VP of engineering walked into a budget review and said “I need $3 million for a new microservices architecture, no measurable business outcome expected for 18 months, and I can’t tell you what metric it will move”, that request would be dead before lunch. The CFO would laugh. The board would never see it.

Now replace “microservices architecture” with “AI.” Same request. Same missing numbers. Different outcome. The room nods. The budget gets approved. The CFO might even smile. “We need to be investing in AI.”

This is the exemption. And it operates at every altitude.

In the boardroom, the pattern is remarkably consistent. An investor sits on four boards. Every board meeting, the CEO presents the AI roadmap. She asks about ROI. The CEO says “it’s a strategic investment, we’ll see returns in phase two.” She doesn’t push harder because she doesn’t want to be the partner who “doesn’t get AI.” Multiply this by every firm, every fund, every seat in the portfolio. That’s how billions get deployed without a number attached.

One level down, in the executive suite, the pattern mutates but survives. A VP of Product has four AI initiatives running. One is a zombie with no measurable impact, no clear path to impact, but it has executive sponsorship and “strategic” status. Killing it means a political fight with the exec who championed it. So the zombie lives, quietly consuming budget and engineering time that should be flowing to the projects that are actually working. He’s subsidizing failure with the budget meant for success. He knows it. His team knows it. Nobody says it.

On the product team, a PM is sitting in a review. Someone asks about ROI for the AI feature. The project lead deploys the magic words: “it’s a strategic investment.” Everyone nods. The question dies. The project lives another quarter. The PM notices the pattern over time: projects with numbers get funded aggressively or killed fast. Decisions happen. “Strategic” projects just linger, never killed, never funded properly, never held accountable. They exist in a permanent state of limbo, consuming resources and producing nothing.

Meanwhile, a solo founder with eleven months of runway doesn’t get to play “strategic.” Every dollar spent on AI that doesn’t convert to revenue, retention, or reduced costs is a dollar closer to death. But he’s watching competitors raise $20 million rounds on demos and narrative. AI-native. Foundation model. Platform play. Zero revenue. He’s wondering if he’s doing it wrong. He’s not. He’s the one doing it right. He just doesn’t know it yet.

“Strategic investment” is not a strategy. It’s a hiding place. And its getting crowded.

You Already Know How to Do This

Here’s what makes the AI accountability gap so maddening: it’s not a knowledge problem. Every organization already has the tools to prevent it.

You don’t ship a feature without success criteria. You don’t approve headcount without a business case. You don’t launch a product without knowing what metric it’s supposed to move. You don’t hand an engineering team six months and a blank check and say “surprise me.” You run quarterly business reviews. You set OKRs. You kill projects that miss milestones. You’ve been doing this for decades.

Except with AI. With AI, you do all of the things you’d never do with any other investment. You fund without milestones. You launch without success criteria. You give “strategic” projects unlimited runway. And then you’re shocked that 95% of pilots produce nothing.

IBM spent $5 billion building Watson Health. Five billion. They acquired four health data companies, employed 7,000 people, and partnered with MD Anderson Cancer Center, which spent $62 million over four years on a Watson-powered oncology advisor. The system was never used on a single actual patient. MD Anderson killed the project citing cost overruns, delays, and procurement problems. IBM eventually sold Watson Health to a private equity firm for roughly $1 billion. A $4 billion write-down on a “strategic investment” that nobody held accountable until it was too late.

Zillow trusted its algorithm to buy homes. Their Zestimate model had been valuing homes for years. Good enough for browsing, not good enough for betting the company. But they bet the company anyway. In Q3 2021, Zillow’s iBuying division lost $304 million. Total write-downs exceeded $500 million. They laid off 2,000 people, 25% of their workforce. Their stock lost $9 billion in market cap. Their competitors, Opendoor and Offerpad, used similar models but with tighter risk controls and shorter feedback loops. They survived. Zillow didn’t.

The technology worked in both cases. The accountability didn’t. Normal investment discipline; milestones, kill criteria, measurable outcomes, would have caught the problem long before it became catastrophic. But the rules didn’t apply. Because AI.

Here’s the double standard made concrete. Zillow’s competitors, Opendoor and Offerpad, used similar pricing algorithms in the same markets during the same period. Opendoor reported $170 million in profits and 7.3% gross margins in the same quarter Zillow lost $304 million. The difference wasn’t the model. Opendoor had been in the iBuying business since 2014 and had built tighter risk controls, shorter feedback loops, and a healthy respect for what the algorithm could and couldn’t predict. They treated the algorithm as a tool with known limitations. Zillow treated it as an oracle and scaled aggressively on faith. Same technology, same market, but opposite discipline, opposite outcome.

IBM’s story is even more instructive. Watson Health wasn’t a bad idea. It was a badly managed investment. The MD Anderson project’s original budget was $2.4 million. It ballooned to $62 million over four years. A University of Texas audit found that contracts were deliberately structured just below the threshold for board approval, avoiding the very scrutiny that would have caught the problem. The system was built on a legacy medical records platform that MD Anderson had already replaced. Nobody stopped to ask whether the integration made sense. Nobody set a milestone that would have surfaced the misalignment. The project drifted for years under “strategic” cover until the audit blew it up. If someone had asked “what’s the dollar, and when do we see it?” at month six, the $62 million loss could have been a $5 million lesson.

Now look at the winners. McKinsey’s 2025 AI survey tested 25 organizational attributes to find what predicts bottom-line impact from AI. The single strongest predictor was workflow redesign with clear milestones before scaling. The 6% of organizations seeing real EBIT impact are three times more likely to have redesigned workflows and set measurable targets before they deployed. They didn’t have better technology. They had better questions. They asked “what’s the dollar?” before they wrote a line of code.

BCG’s 2025 research on AI in finance tells the same story from a different angle. The highest-ROI teams don’t just build AI, they focus relentlessly on value from day one. They prioritize quick wins over open-ended learning. They allocate dedicated budgets. They emphasize early impact. BCG found that emphasizing early impact increases the likelihood of AI project success by six percentage points. In a landscape where the baseline success rate is in the single digits, that’s enormous. The pattern is consistent: discipline first, technology second.

The discipline already exists. You just stopped applying it when someone attached the word “AI” to the budget request. Stop making the exception.

The Framework

“ROI or die”. Here’s a reminder to use the framework you already have.

Before any AI initiative gets funded, run it through a test of value. Use the framework you use for everything else. I use PROVE-IT (link here.) If it doesn’t pass, it doesn’t get built. If it does, it enters the cycle.

Explore → Exploit → Platformise.

Explore [20% of AI budget]. Tiny projects. Small segments. Limited exposure. Fast and cheap. The goal isn’t scale, it’s signal. Did we find a dollar worth chasing? Run the experiment in weeks, not months. Most things die here. That’s not failure, that’s the system working. The ROI of Explore is learning speed. The faster you can run cheap experiments and kill the losers, the faster you find the winners. A team running four small experiments a month will outperform a team running one big “strategic” initiative every quarter. Every time.

Exploit [50% of AI budget]. This is where the money is made. The winners from Explore get scaled. Broad rollout. Full integration into existing funnels. Conversion optimization, support deflection, pricing, ops efficiency, workflow acceleration. AI applied to the places you already make or spend money. ROI-positive within 90 days or kill it. No exceptions. No “phase two.” No “strategic.” Show the dollar or free up the budget for something that will. This track funds everything else. If Exploit isn’t performing, you don’t have permission to explore or invest long-term. You have to earn the right to innovate by proving you can execute.

Platformise [30% of AI budget]. The proven wins from Exploit get built into infrastructure. Distilled capabilities. Reusable systems. Self-service tools that make the next cycle of Explore faster and cheaper. This is where data flywheels live, where memory systems compound, where proprietary feedback loops create defensibility. Platformise still has ROI, it’s just measured in compounding velocity rather than quarterly P&L.

The cycle is the point. Explore feeds Exploit. Kill fast. Only the winners graduate. Exploit feeds Platformise. Double down. The proven wins become infrastructure. And Platformise feeds back into Explore. Better platform means faster, cheaper experiments. Each revolution of the flywheel gets faster than the last.

This is also how you prepare for the future you can’t predict. When agents arrived, teams that had been exploring weekly had the muscle to spin up an experiment in days. Teams that had put everything into Exploit were six months behind before they started. And the teams that had platformised their AI capabilities could plug new paradigms into existing infrastructure instead of building from scratch. The framework doesn’t predict the future, but it makes you fast enough that you don’t need to.

A concrete example

Take AI-powered support deflection. This is a classic use case because it touches real cost, has fast feedback loops, and fails loudly when it doesn’t work. Perfect for disciplined AI investment.

Explore

You don’t start by “reimagining support.” You start by isolating one narrow problem.

Pick a single issue category, say, password resets or order confirmation questions. These are high-volume, low-risk, and well understood by agents. You deploy an LLM in a constrained way: either drafting responses for agents or deflecting tickets to self service. No full automation, no broad rollout.

You cap exposure at ~5% of traffic. You run it for two weeks. What you measure is boring and explicit:

Resolution rate vs baseline
CSAT delta
Cost per ticket
Agent intervention rate
Escalation frequency

You’re not trying to prove AI is “the future of support.” You’re asking a much simpler question: is there a dollar here worth chasing?

Most experiments die at this stage. Maybe the model hallucinates edge cases. Maybe CSAT drops. Maybe agents hate it. That’s fine. You’ve spent weeks, not quarters, and thousands, not millions. If there’s no clear signal, you stop.

Exploit

Now assume the signal is real. The experiment shows a 12% reduction in tickets for that category, with no CSAT degradation and lower average handling time. That’s no longer a research question. It’s a business one.

You move into Exploit. You expand coverage to the top 5–10 issue categories that share similar characteristics. You integrate the capability directly into the help flow instead of running it as a sidecar. You invest in guardrails, fallback logic, and operational monitoring. You staff it properly.

And now the bar changes.

This phase has explicit expectations:

Weekly cost savings tracked in dollars
Clear ownership
A defined 60–90 day window to prove the numbers hold at scale

If the deflection rate collapses, or CSAT starts to drift, you don’t debate intent. You stop, surgically fix issues, rerun, or roll back. The project doesn’t get infinite runway just because it once worked in a pilot.

This is the critical shift most teams skip. They treat pilots as proof of inevitability rather than proof of value.

If Exploit works, this becomes one of the few AI projects that actually funds others.

Platformise

Only now does platform thinking start to make sense. You still don’t build a platform off a single use case. You look for reuse pressure.

Your support deflection product succeeds. Separately, a sales assist product shows promise. A fraud triage model is running into similar confidence and fallback questions. A content moderation workflow needs quality thresholds.

Now you have three or four production AI systems asking the same questions:

How do we measure quality consistently?
How do we compare model versions safely?
How do we incorporate human feedback?
How do we decide when automation is allowed vs deferred?

Only at that point do you invest in shared evaluation tooling: common metrics, logging standards, review workflows, offline test harnesses. Not because “AI needs a platform,” but because multiple profitable systems are being slowed down by the lack of one.

Platformisation exists to compress future Explore cycles, not to justify past ones.

The result is compounding speed. The next support experiment launches in days instead of weeks. The next workflow doesn’t need to reinvent evaluation logic. This discipline creates leverage.

Know someone running AI with no kill criteria? Send them this.

The Rules

Exploit funds Explore and Platformise. No Exploit performance, no permission to explore or invest long. This is how you answer the question “what are we getting for this?”

Every project has a track. Untracked AI work doesn’t get funded. If someone can’t tell you whether their project is Explore, Exploit, or Platformise, they haven’t thought hard enough.

Track migrations are explicit decisions, not drift. Projects get promoted, killed, or demoted, never “let’s give it one more quarter.” That sentence has killed more AI projects than bad data ever has.

The 20/50/30 is representative, not a target. Unspent exploration budget is a feature, not a failure. And the ratios flex by stage. A seed-stage startup might be 80% Exploit, 20% Explore, 0% Platformise, and that’s exactly right. An enterprise with $80 million in ARR might run the full 20/50/30. The principle is constant, but the allocation can adapt.

Rebalance quarterly. Review the full portfolio. Which Explore experiments found signal? Which Exploit projects hit their 90-day milestones? Which Platformise investments are actually compounding velocity? Move budget toward what’s working. Kill what’s not. This is the portfolio discipline you already apply to every other investment. Apply it here.

The Transition

If you’re reading this and thinking “we already have AI projects running with no tracks, no milestones, and no kill criteria”, you’re not alone. That’s the default state of most organizations right now.

You can’t walk in Monday and announce that everything is changing. Here’s the move instead.

Start with an audit. Map every active AI initiative. For each one, answer three questions: what’s the dollar it’s chasing? What’s the milestone for this quarter? What happens if it misses? If no one can answer those questions for a given project, that project goes on a 30-day clock. Define the answers or kill it.

If you’re running a portfolio of AI initiatives, this is the political cover you’ve been missing. You’re not killing someone’s pet project, you’re applying the same milestone discipline you apply to every other investment. Frame it that way. “We’re not cutting AI. We’re making AI accountable.” The zombie projects die. The budget they were consuming flows to the projects that are actually working. Your best teams get more resources, not less.

If you’re about to pitch the next AI feature, this is your competitive advantage. Don’t walk into the room with “we should build an AI feature.” Walk in with “This is an Exploit-track project. Here’s the funnel. Here’s the dollar. Here’s the 90-day milestone. Here’s what we kill if it misses.” That pitch gets funded every time.

If you’re watching funded competitors burn cash on AI features with no revenue model, keep going. Your ROI obsession is correct. The discipline you’re forced into by your runway is the same discipline that separates the 6% from the 94%. You’re not the small one. You’re the smart one.

Every AI project should have a dollar sign attached. Every one.

Not because innovation doesn’t matter. It does. But because the projects that change the world are the ones that survive long enough to do it. And survival requires proof. IBM Watson had a $5 billion budget. It didn’t survive. Zillow Offers had the largest real estate dataset in America. It didn’t survive. The projects that survive are the ones that can answer the question: what did we get for that?

The accountability gap is closing. The era where “we’re investing in AI” was enough to satisfy boards, impress LPs, and justify headcount is ending. The firms that imposed discipline early, that tracked ROI from day one, that killed projects at 90 days, that required every initiative to name its dollar before writing a line of code, will outperform.

The firms that let “strategic” be a hiding place will write checks they can’t explain. Some already are.

You already know how to run disciplined investments. You already know how to set milestones and kill what doesn’t work. You already know that “strategic” without a number is just another word for “we don’t know.”

Apply what you know. AI doesn’t get special treatment.

ROI or die.

Claim Verifications and Citations

$252 billion corporate AI spending in 2024.
Source: Stanford HAI AI Index 2025 Annual Report.

95% of enterprise AI pilots delivered zero measurable P&L impact.
Source: MIT Project NANDA, “The GenAI Divide: State of AI in Business 2025” (July 2025).

42% of companies abandoned most AI initiatives, up from 17%.
Source: S&P Global Market Intelligence, “Voice of the Enterprise: AI & ML, Use Cases 2025.”

Gartner: 30% of GenAI projects abandoned after POC by end of 2025.
Source: Gartner press release, Data & Analytics Summit, Sydney (July 2024).

AI projects fail at twice the rate of non-AI IT projects.
Source: RAND Corporation, “Root Causes of Failure for AI Projects” (2024).

AI personalization lifting ecommerce conversion by up to 23%.
Source: Multiple industry reports (Envive AI, McKinsey retail research).

Contact centers using AI see 30% operational cost reductions.
Source: Multiple industry sources (Fullview, AmplifAI, McKinsey CX research).

AI users completed tasks 25% faster at 40% higher quality.
Dell’Acqua et al., “Navigating the Jagged Technological Frontier,” HBS Working Paper 24-013 (Sept 2023).

Early GenAI adopters report $3.70 per dollar; top performers $10.30.
Source: Widely cited (CIO Dive, BotsCrew, Fullview).

6% “AI high performers” attribute 5%+ of EBIT to AI.
Source: McKinsey, “The State of AI in 2025” (Nov 2025).

GitHub Copilot: 55% faster coding in controlled studies.
Source: GitHub Research (2022). Controlled experiment, P=.0017.

88% of accepted Copilot suggestions stay in the codebase.
Source: GitHub usage data, confirmed across multiple compilations.

Coding AI: $4B category in 2025, up from $550M.
Source: Menlo Ventures, “2025: The State of Generative AI in the Enterprise” (Jan 2026).

IBM spent ~$5B building Watson Health.
Source: IBM acquisition costs (Truven, Phytel, Explorys, Merge). Sold to Francisco Partners for ~$1B (2022).

MD Anderson spent $62M; system never used on a patient.
Source: University of Texas System audit report (Jan 2016). Confirmed by JNCI, Medscape, Forbes.

MD Anderson original budget $2.4M, ballooned to $62M.
Source: UT audit. IBM contract extended 12 times.

Contracts structured below board approval threshold.
Source: UT audit. “Fees consistently set just below the amount that would have required Board approval.”

System built on legacy EHR (ClinicStation) already replaced by Epic.
Source: UT audit.

Zillow iBuying lost $304M in Q3 2021; $500M+ write-downs; 2,000 laid off; $9B market cap loss.
Source: Zillow Q3 2021 earnings and public filings.

Opendoor: $170M profits, 7.3% gross margins same quarter.
Source: Opendoor financial filings, same period.

McKinsey: workflow redesign strongest predictor of AI impact.
Source: McKinsey, “The State of AI in 2025.” Note: McKinsey says “one of the strongest”.

High performers 3x more likely to have redesigned workflows.
Source: McKinsey, “The State of AI in 2025.”

BCG: emphasizing early impact increases success likelihood by six percentage points. Source: General thesis supported by BCG AI Radar 2025.

You're Not Building an ‘AI Product’.

Pranav Pathak — Sun, 01 Feb 2026 13:42:39 GMT

You’re an ecommerce app. You have an AI model that predicts which customers will cancel their order before it ships. Three scenarios: the model has 99% precision, 80% precision, or 40% precision. What do you build with each?

Precision means: when the model flags someone, it’s right that percentage of the time.

Most people answer like this:

99%: “Automate it, cancel their order before they do.”
80%: “Good start, but let’s improve the model before we ship.”
40%: “Not useful. Too many false positives.”

Wrong. Wrong. Wrong.

The right answer reveals the skill that actually matters when you’re building with AI. Almost nobody has it.

The Answer

Notify the customer: “Confirm or cancel your order”. Same model. Same email. Different intensity.

99% precision: Aggressive CTA. “Still want this? Click to confirm or we’ll cancel your order.” You’re almost certain they’ll cancel, better they do it now than you ship it and eat the return costs.

80% precision: Medium CTA. “Your order ships tomorrow!” with a visible, easy cancel button. You’re right 4 out of 5 times. Give them a clear out. The 1 in 5 who weren’t going to cancel? Most of them feel informed. No harm done.

40% precision: Soft CTA. “Your order is on the way!” with a subtle cancel link. You’re wrong more often than you’re right, but you’ve still got a meaningful signal over random. You can’t push hard. But you can quietly surface the option for those who need it.

One model. Three scenarios. Three calibrations.

The precision is the constraint. The intervention intensity is the variable.

This is the entire game. And most builders don’t even know they’re playing it.

The Skill Nobody Teaches

Most builders ask: Is the model good enough?

The rare ones ask: Good enough for what?

That’s the skill gap. Not “how do I handle uncertainty.” But “how do I calibrate the product to the confidence level I actually have?”

Netflix understood this early. Their recommendation algorithm doesn’t need to be right, it needs to surface options you’ll scroll past if they’re wrong. 80% of Netflix watch time comes from recommendations. Not because they’re perfectly accurate, but because the cost of a bad recommendation is close to zero. You see a title you don’t want. You scroll. You pick something else. Maybe you hover, decide not to click, move on. The algorithm watches all of it: the hovers, the skips, the half-watched episodes, and learns. Netflix has said this saves them an estimated $1 billion per year in reduced churn. Not by being right. By making “wrong” cheap.

Spotify understood this. Discover Weekly gives you 30 songs you’ve never heard. Some you’ll hate. But the product isn’t “songs you’ll love.” It’s “songs you might love.” The intervention is calibrated perfectly: skipping costs 30 seconds and a thumb tap. No wasted money. No broken workflow. No angry customer. And your skips train the algorithm just as much as your saves do. Failure is feedback. The product gets better specifically because it’s allowed to be wrong.

GitHub Copilot understood this at a deeper level. Roughly 30% suggestion acceptance rate. Think about that. Most suggestions are wrong. Seven out of ten lines Copilot writes are rejected. By any traditional software metric, that’s a broken feature. But Copilot is one of the fastest-growing developer tools in history. Why? Because a wrong suggestion costs you nothing. Gray text appears, you keep typing, it vanishes. A right suggestion saves you minutes. Tab to accept, keep typing to reject. 88% of accepted code stays in the codebase. Users report coding 55% faster. The product works because failure is free, not in spite of its error rate.

The pattern is consistent across all three. Winners don’t wait for perfect precision. They calibrate the intervention to the confidence they have. They design the product around the error rate, not despite it.

The Cost of Getting It Wrong

Air Canada didn’t understand this.

Their chatbot gave a customer wrong information about bereavement fares. The customer relied on it, booked a flight, and applied for the discount. Air Canada refused the refund, then argued in a tribunal that the chatbot was “a separate legal entity responsible for its own actions.”

The tribunal called this a “remarkable submission.” Ordered Air Canada to pay $650. The chatbot was pulled from the website.

The product failed because nobody asked “what’s our precision, and what intervention is appropriate for that precision?” A chatbot that gives direct answers acts like a 99% precision product. Air Canada didn’t have 99% precision. They had a confident-sounding model and no fallback.

McDonald’s made the same mistake differently. They spent years building an AI drive-thru system with IBM. It misinterpreted orders, adding bacon to ice cream, ringing up 260 chicken nuggets. Discontinued in 2024 across more than 100 locations.

The precision might have been fine for a soft suggestion. “Did you mean a medium combo?” But they built a hard automation. Drive-thru failure isn’t like Netflix failure. You can’t scroll past a wrong order. The customer sees it. The kitchen makes it. Wasted food, frustrated customers, operational chaos. They mismatched the intervention to the precision.

Klarna made the boldest version of this mistake. In 2024, they replaced 700 customer service agents with AI. The CEO bragged publicly that he hadn’t hired a human in a year. By 2025, they were reassigning software engineers to handle customer calls because they couldn’t rehire fast enough. The CEO now says Klarna wants to be “best at offering a human to speak to.”

The pattern is just as consistent on the losing side. Every one of these companies built interventions that were too aggressive for their precision. No fallback. No calibration. No recovery path.

The Frame

This is the divide:

The precision didn’t determine success. The match between precision and intervention did.

Every winner on this list has a cheap failure mode. Every loser has an expensive one. That’s not an accident. That’s a product decision someone either made deliberately or failed to make at all.

What This Looks Like in Practice

Way back at Booking.com, I had a model that detected the topic of customer service requests. Billing, cancellations, property issues, check-in problems.

The goal was routing. If you know what the customer needs, you send them to the right agent on the first try. No transfers. Transfers are expensive. Every handoff costs time, frustration, and money. A customer who gets transferred once is measurably less satisfied. Transferred twice and you’ve probably lost them. And you’ve wasted time for 3 customer service agents.

So we tuned for high precision. When the model predicted a topic, it had to be right. And it was. But recall was 5%. The model barely fired. Out of every hundred customer requests, it confidently classified about five. The other ninety-five got the same routing they always had.

It sat there, almost perfectly correct and almost perfectly useless. Didn’t move metrics. Couldn’t move metrics. A model that’s right 95% of the time but only makes a prediction 5% of the time has almost no product impact.

Same model. We lowered the precision threshold. Recall jumped to 80%+. Now it caught most incoming requests, but it was wrong more often. Meaningfully wrong.

The old intervention, routing, couldn’t tolerate that. If you send a customer with a billing issue to the cancellation team, you’ve created the exact problem you were trying to solve. The customer gets transferred. They wait again. They’re angrier than when they started. Wrong routing is worse than no routing.

So we changed the intervention.

Instead of routing to an agent, we surfaced self-service options based on the predicted topic. Before the customer ever reached a human. “It looks like you might need help with your cancellation. Here are some options.”

Prediction right? Customer solves it themselves. Never reaches an agent. Cost to serve: near zero.

Prediction wrong? Customer ignores the suggestions, closes the panel, and connects to an agent anyway. No harm done.

30% reduction in contact rates. Same model. Same underlying precision.

The precision didn’t change the outcome. The match between precision and intervention did. We didn’t make the model better. We made the product smarter about how it used the model.

The 70% Mindset

The rare builders share a mental model I call the 70% mindset.

They internalize that the model will be wrong some percentage of the time, and they calibrate the intervention to that percentage. Not hide it. Not wait to fix it. Design for it. They treat the error rate as a first-class input to product design, not as a bug to be squashed before launch.

The simulation they run:

“I have 70% precision. What intervention is appropriate for that?”
NOT “I need 95% precision. How do I get there?”

The difference sounds subtle. It’s not. The first question leads to a product you can ship this quarter. The second leads to a roadmap that never converges.

They start with the constraint. They calibrate inside it.

Gmail Smart Compose works maybe 30% of the time. But it’s explicitly designed not to be authoritative. It’s gray text that disappears as you type. Wrong guesses vanish silently. You never feel wrong. You never feel corrected. The model’s error rate is invisible because the intervention is so soft that failure literally doesn’t exist from the user’s perspective. Soft intervention for low precision. Perfectly calibrated.

Perplexity has variable precision across topics. A question about recent events might be less reliable than a question about established science. But every claim has a numbered source. Wrong answer? One click to verify. They didn’t try to eliminate error. They built verification into the product itself. The intervention assumes the model will be wrong sometimes and gives the user the tool to catch it. That’s calibration.

Waze has roughly 85% precision on surface street timing. But the product isn’t “perfect ETA.” It’s “less traffic.” Wrong prediction? You still arrive. Maybe a few minutes later than expected. And your drive, the actual data from your actual route, makes the next prediction better for everyone. Soft intervention that self-corrects with feedback. The product improves because it ships with imperfect precision, not in spite of it.

This is the mindset shift that separates builders who ship from builders who wait. Stop asking “how do I make the model better?” Start asking “what product can I build with the model I have?”

A startup with 40% precision can beat an incumbent with 95%—if the startup calibrates the intervention correctly and the incumbent doesn’t. This happens more often than you think.

Why Almost Nobody Gets This Right

Nothing in conventional product training develops this skill. And it’s not because training is bad, it’s because the fundamental paradigm changed and the training hasn’t caught up.

Traditional software is deterministic. The spec says X, the engineer builds X, and X either works or it doesn’t. The button either saves the file or it throws an error. The search either returns results or it returns nothing. There’s no “the button saves your file 80% of the time.” You don’t ship that. You fix the bug.

This is how every PM, designer, and engineer has been trained to think. Define the requirement. Build to the requirement. Test against the requirement. Ship when it passes.

AI is probabilistic. The model says X, and X is right some percentage of the time. There is no “fix the bug.” The error rate is the product. The intervention has to match the confidence. “Ship it” isn’t enough. “Ship it with the right calibration” is the actual job, and nothing in the traditional playbook prepares you for that.

The entire pipeline: CS education, PM bootcamps, design programs, agile certifications, produces builders who think in certainties. Then hands them probabilistic tools and expects them to figure it out. Most don’t. They either wait for certainty that will never come, or they ship with certainty they don’t have.

DPD’s chatbot started swearing at customers and writing poems mocking the company after a system update. Disabled immediately. Nobody had asked: “What’s our precision on edge cases, and what intervention is appropriate when the model goes off the rails?” They built a product that assumed the model would behave predictably. The model didn’t. There was no fallback, no graceful degradation, no circuit breaker.

That’s not an AI failure. That’s a calibration failure. The skill that would have prevented it isn’t technical, it’s judgment. Matching intervention intensity to model confidence. Designing the failure state as carefully as the success state.

My team at Booking.com reviews 2,000 to 2,500 applications per year. We hire 3, maybe 5. That’s a 0.2% acceptance rate. As selective as Google, more selective than Y Combinator, 18x more selective than Harvard.

The rare ones don’t have better AI knowledge. They have better calibration instincts. They hear “40% precision” and immediately start designing interventions. Everyone else hears “40% precision” and says “not good enough.” The rare ones ask “good enough for what?” Everyone else asks “how do we get to 95%?”

That’s the gap. And no amount of “AI PM” certifications will close it.

The Calibration Checklist

Before you build anything, answer these six questions:

What’s the precision? Not what you hope for, what you actually have right now. When the model flags something, how often is it right?
What intervention does that precision support? 99% supports hard automation. 80% supports clear suggestions. 40% supports soft nudges. Different precisions demand different intervention intensities.
What happens when it’s wrong? Describe the failure state as vividly as you describe the success state. If you can’t, you don’t understand your product yet.
Who pays the cost of failure? The user? The company? A third party? The answer changes your risk tolerance entirely.
How do you make failure cheap? Soft intervention? Easy reversal? Built-in verification? The best AI products are designed so that being wrong costs almost nothing.
How do you detect failure? For high-stakes decisions, you need individual flags. For low-stakes, aggregate metrics. Match your monitoring to your risk.

If you can answer all six, you might have a product.

If you can only answer the first two, you have a demo.

The Exercise

Here’s how to develop this skill: run the 99/80/40 exercise on your own product.

Take whatever your model does, recommendations, predictions, classifications, generation. Now imagine three scenarios:

99% precision. What would you build? How aggressive would the intervention be? At this level, you can automate hard. You can take action on behalf of the user. You can make decisions that are expensive to reverse.
80% precision. What changes? What do you pull back? At this level, you can suggest confidently. You can surface options prominently. But you need the user in the loop. You need a clear “no thanks” path.
40% precision. What do you build now? Most people say “nothing.” The right answer is almost never nothing. At this level, you can still nudge. You can still surface. You can still learn. The intervention just has to be soft enough that being wrong is invisible.

Then ask yourself: where is my product right now? And is my intervention calibrated to that precision—or am I pushing too hard?

If you’re building a 99%-confidence intervention on an 80%-precision model, you’re building the next Air Canada chatbot. If you’re waiting for 95% precision before shipping anything, you’re leaving the Copilot opportunity on the table. Both mistakes come from the same place: not treating the precision as a design input.

Start with the constraint. Calibrate inside it. Ship.

The Point

You’re not building an AI product.

You’re building a product that has a capability which is right some percentage of the time. That percentage isn’t a limitation, it’s a design constraint. And like every design constraint, it produces better products when you respect it than when you fight it.

The question isn’t “how do I make the model more precise?” The question is “what intervention does this precision support?”

The 99% model and the 40% model aren’t better or worse. They’re different. Different precisions, different interventions, different products. A 40% model with the right intervention beats a 95% model with the wrong one. Every time. Copilot at 30% acceptance rate generates more revenue than McDonald’s AI drive-thru at 95% accuracy. Netflix at 80% relevance saves more money than Air Canada’s chatbot at 95% confidence. The precision doesn’t determine the outcome. The calibration does.

The AI arms race is misunderstood. It’s not about who has the best model. It’s about who matches the intervention to the precision. That’s a product skill, not an AI skill. And it’s the reason the “AI PM” job title is fake. The skill that matters here, calibrating intervention to confidence, is product judgment. It’s design sense. It’s understanding what happens when things go wrong and building for that from day one.

Saying you’re building an “AI product” is like saying you’re building a “C++ product.” The technology is an implementation detail. The market doesn’t care what’s under the hood. Users don’t care what’s under the hood. Your CFO definitely doesn’t care what’s under the hood.

What matters is what the product does.

Lead with that. “20% faster. 30% cheaper. 10x more scale.”

When they ask how, then you say AI.

Growth Machines: The Notion Story

Pranav Pathak — Sun, 25 Jan 2026 16:28:56 GMT

In 2015, Notion was dead. Out of money, team laid off, product too confusing for anyone to use.

Ten years later: 100 million users, $10B valuation, and a playbook every founder wants to copy.

The common explanation, “templates went viral”, misses the point. Notion won because they solved a problem most builders ignore: how to make a product get deeper with use, not just wider with distribution.

Most startups chase distribution first: more users, more channels, more reach. Notion did the opposite. They refused to scale until they’d solved a different problem: whether users would do more with the product over time, not just try it once.

That’s the difference between depth and distribution. Depth means users replace other tools, customize their setup, and eventually share what they’ve built, not because you incentivized them, but because the artifact is useful on its own. Distribution just brings people to the door. Depth determines whether they stay.

Notion’s insight: distribution only compounds when depth is already rising. Without depth, growth is just accelerated churn. With it, every new user makes the product stickier for the next.

Here’s how they built that depth, and the framework for knowing when you’re ready to scale.

The first Notion failed on legibility, not ambition

Early Notion was ambitious in exactly the way builders admire. It was flexible, expressive, and theoretically powerful. The problem wasn’t quality; it was interpretation. New users were confronted with too much possibility and too little guidance. The product demanded imagination before it delivered value, which meant only a narrow slice of users could succeed without help. Most didn’t churn because the product disappointed them. They churned because they didn’t know what to do.

Ivan Zhao later reflected on this failure: “We focused too much on what we wanted to bring to the world. We needed to pay attention to what the world wanted from us”[1]. That sentence matters because it reframes the problem. This wasn’t about marketing reach or onboarding polish. It was about legibility, about whether a user could look at the product and immediately recognize how it fit into their life.

The deeper realization came later: “People don’t want to build apps. They don’t think about programming, they simply want something that solves their problems quickly and efficiently”[2]. The first Notion had been built for toolmakers. Toolmakers, it turned out, are rare.

Legibility is a growth constraint, one often ignored. If users can’t see themselves in the product quickly, no amount of distribution fixes that. It only accelerates confusion.

The rebuild was about first value, not feature power

When Notion ran out of money, Ivan Zhao and Simon Last made a brutal calculation. They laid off their entire team of four, sublet their San Francisco office, and moved to Kyoto [3]. Kyoto housing was larger and cheaper, the cost of living was dramatically lower, and they could focus without distraction. In a rented two-story house with paper walls, no heating, and bedrooms separated only by shoji screens, they spent the next year rebuilding everything from scratch [4].

“Neither of us spoke Japanese and nobody there spoke English,” Ivan later said, “so all we did was code in our underwear all day” [5].

The central question wasn’t how to relaunch bigger. It was how to relaunch clearer. What does meaningful value look like in the first hour of use? That question forced a different kind of rebuild. The goal was no longer to showcase flexibility, but to reduce cognitive load. Every interaction was redesigned so the product felt usable before it felt powerful.

During this period, Ivan appeared at the top of Figma’s most active user list, spending 18+ hours a day iterating on the interface [6]. It was a war against ambiguity. As Ivan later put it, “Designers spend too much time on edge cases. Most of the time, what matters is the dumbest path”[7]. In Kyoto, they found that path.

This was Notion’s first major growth decision: scale didn’t matter until first value was obvious.

Templates were not a growth hack

When Notion relaunched, they needed traction fast. Ivan asked early investor Naval Ravikant to wake up at midnight, when Product Hunt’s daily list refreshed, and tweet to his followers to vote. Naval had, in Ivan’s words, “a bazillion Twitter followers” [8]. The gambit worked. Notion hit #1 for the day, week, and month.

But the product they launched wasn’t a blank canvas. It shipped with a set of internally created templates: project trackers, team wikis, meeting notes, reading lists. These templates are often described as a clever growth tactic. That misses the point. Templates were Notion’s solution to the legibility problem.

A blank page asks users to imagine value. A working system demonstrates it. Templates allowed users to start inside something that already worked, rather than designing from scratch. This collapsed time-to-value and removed the need for explanation. Users didn’t have to ask what Notion could be. They could see it operating in a familiar context.

Ivan later described this as “sugar-coated broccoli.” The original vision of letting anyone build their own software tools was the broccoli. Most people don’t want broccoli. But wrap it in something they already care about, like notes, docs, project management, and they’ll consume the underlying capability without resistance [9].

The critical move came next. Notion allowed users to publish and share templates themselves. This turned onboarding into a distributed system. New users increasingly encountered Notion not as a product, but as a solution, as a specific system someone else was already using. They didn’t arrive asking “what is Notion?” They arrived asking “how do I use this?”

Over two years, templates from Notion’s early gallery were downloaded over one million times [10].

Growth became possible once usage became shareable

At this point, Notion’s growth stopped being fragile. A user would customize a template, embed it into their workflow, and eventually share it because the artifact had value on its own. Sharing wasn’t incentivized, it was a byproduct of real work.

This is a crucial inversion. Most products try to engineer virality on top of shallow usage. Notion did the opposite. It engineered deep usage first, then let sharing emerge naturally. That’s why the template ecosystem could exist outside the company. Creators were selling on Gumroad, in newsletters, on YouTube, without collapsing into spam. The artifacts were useful even if you never signed up. Some creators have built entire businesses selling Notion templates, with top sellers earning over $1 million without writing a line of code [11].

Community amplified depth; it didn’t manufacture it

Notion’s community is often cited as the second reason for its growth. That gets causality backward. Community only worked because the product was already embedded in people’s workflows.

Ivan’s early behavior wasn’t a branding exercise. It was product development disguised as customer support. “In the early days, I spent a lot of time replying one-on-one to users on Twitter, making sure they knew there was a human behind Notion,” he explained. “We also logged every piece of feedback we got and tagged it so we could use it to develop features” [12].

When creators like Marie Poulin shared early frustrations with the product’s learning curve, Ivan didn’t just acknowledge the feedback. He watched how they used the product and shipped changes to remove friction [13]. That responsiveness turned critics into evangelists, because users could see their fingerprints on the product.

The results compounded. One ambassador now runs Notion’s subreddit with hundreds of thousands of members. Others built large regional communities in Korea, the Middle East, and Europe [14]. These were power users who became invested enough to build ecosystems around the tool.

Community didn’t create enthusiasm from nothing. It amplified usage density that already existed. Without that depth, community would have been noise.

Why Notion looked slow, and why that was the point

For years, Notion didn’t look impressive by startup standards. There were no viral spikes, no aggressive funnels, no growth-at-all-costs playbook. From the outside, it looked like restraint. Internally, it was discipline.

That discipline came at a cost. While Notion was rebuilding in Kyoto, Coda launched with $60M in funding. Airtable was scaling aggressively into enterprise. Slack had already become a verb. Investors knocked so persistently that Notion eventually moved offices and removed itself from Google listings. Ivan declined most conversations [15].

When Notion finally raised its Series A in 2020, it reached a $2B valuation with roughly 30 employees [16]. That ratio tells you what they were optimizing for: depth, not headcount.

And then that depth nearly broke them. During COVID, as millions of users flooded the platform, Notion’s infrastructure buckled. For years, the entire product ran on a single Postgres instance, upgraded again and again. Eventually, there were no larger machines left. There was a literal doomsday clock counting down to when the database would run out of space and the product would shut down [17].

They froze feature development. Every engineer pivoted to infrastructure. It was, in Ivan’s words, a close call [18].

They survived. And when Notion finally scaled, it did so cleanly because the system underneath could absorb load. Each cohort customized more deeply, relied on the product more centrally, and replaced more tools than the one before.

Notion and the AI Wave: Sticking to core principles.

When Notion launched AI features in February 2023, they faced a strategic choice: build a standalone AI tool, or embed intelligence into the workspace users had already built.

They chose depth over distribution again.

Notion AI is embedded directly into pages, databases, and workflows where users already work [19]. This matters because Notion had spent years getting users to build systems inside the product: project trackers, team wikis, meeting notes, personal dashboards. That accumulated structure became AI’s context layer.

A standalone AI tool knows nothing about your work. Notion AI knows your page relationships, your database schemas, your team’s writing style, your project history. When you ask it to summarize last week’s meetings or draft a document in your company’s voice, it draws on depth that already exists [20].

Ivan described this as “consolidate before you automate” [21]. The years Notion spent getting users to replace fragmented tools with a unified workspace were AI groundwork. The more deeply users embedded their work in Notion, the more valuable AI assistance became, because context compounds.

This is sugar-coated broccoli, version two. The original vision of letting anyone build software tools was hidden inside productivity features people already wanted. The AI vision, giving everyone a reasoning engine for their work, is now hidden inside the workspace they’ve already built. Users don’t adopt “AI.” They just notice that Notion got smarter.

The pattern repeats: depth created the surface area for the next capability to land.

Most companies bolt AI onto shallow products and wonder why adoption stalls. Notion waited until users had built something worth augmenting. That patience is compounding again.

The real lesson: depth earns distribution

Notion’s story collapses into a single growth law:

If the product doesn’t get deeper with use, distribution just buys churn.

Templates, community, and word of mouth worked because users were already doing more with the product over time. Most builders invert this order. They chase reach before depth, then wonder why growth feels fragile. The Notion story is instructive. But stories don’t scale. Frameworks do.

From story to system: the Usage Density Framework

The Notion story collapses into a single constraint:

Growth compounds only when usage deepens faster than reach expands.

Everything that follows exists to help you answer one question honestly:

When a new user shows up, does the product become clearer or more fragile?

The Usage Density Framework answers that question using two lenses:

A diagnostic (what the system looks like under load)
A depth ladder (why some users become distribution and others become churn)

1. Usage Density Diagnostic

This table is a stress test.

Each row is asking the same underlying question in a different way:

When new users arrive, do they converge toward a clear way of using the product, or do they fragment into edge cases?

If any row is firmly in the Low Density column, scaling acquisition will increase fragility.

Let’s unpack what each dimension actually means in practice.

Core feature usage

Fragmented usage looks like engagement. Users click around, touch many features, and explore the surface area. This often shows up as healthy DAU/MAU, but it’s a false signal. Fragmentation usually means users are uncertain about what the product is for.

Converging usage means retained users increasingly rely on the same subset of functionality. This is the product teaching users its intended shape.

Dominant usage means there is an obvious default mode of use. This becomes the product’s center of gravity.

How to test this

Take your top 20% most active users
Look at which features they use weekly
If you can’t describe a default usage pattern in one sentence, you’re still fragmented

Notion example
Retained users converged on wikis, docs, and project systems, not every block type.

Sharing behavior

This dimension distinguishes earned distribution from forced distribution. Incentivized sharing requires prompts, rewards, or nudges. It scales poorly and degrades trust. Occasional sharing happens when users see value but don’t yet rely on it. Natural sharing happens because the artifact itself is useful. The user shares because it helps them, not you.

How to test this

Look at shares that happen without prompts
Identify what’s being shared (pages, reports, templates)
Ask users why they shared, and if the answer is “because you asked,” it doesn’t count

Notion example
Users shared templates because the template was the value.

Time-to-value

This row tells you how much imagination your product requires. Long time-to-value means users must configure, design, or understand before benefiting. This kills density. Improving means onboarding and defaults are doing some work. Immediate means success is obvious in the first session.

How to test this

Measure time to first meaningful action (not signup)
Watch first-session recordings
Ask: Could someone succeed without reading docs?

Rule
If your product requires explanation before value, don’t scale acquisition.

How to interpret the diagnostic as a whole

You don’t need perfection across all rows.
You do need convergence.

Good growth makes users behave more similarly over time, not less.

If growth increases variance, confusion, or support load, you’re scaling uncertainty.

2. The Usage Depth Ladder

The diagnostic tells you whether depth exists. The ladder explains why some users become distribution and others become churn.

Layer 1: Adopt

What it means
The user starts from something that already works. This is where templates, defaults, and opinionated setups matter.

Failure mode
Users start from a blank state and freeze.

What to measure

% of users who start from a default/template
% who complete setup in first session

If users don’t adopt a working system, everything downstream fails.

Layer 2: Embed

What it means
The product becomes part of a recurring routine. Weekly planning. Daily notes. Team updates.

Failure mode
Usage depends on reminders, novelty, or nudges.

What to measure

Weekly active usage
Regular cadence actions
Drop-off when notifications stop

Rule
If usage collapses without reminders, you don’t have embedding.

Layer 3: Export

What it means
The user creates something worth sharing. This is where depth turns into distribution.

Failure mode
Sharing requires incentives or explicit prompts.

What to measure

Shares per retained user
Invites without referral rewards
Artifact reuse by others

Key insight
Export is not virality. It’s proof of depth.

The operating rule (this is the whole point)

Here is the rule that ties both tables together:

You are allowed to scale acquisition only if newer cohorts climb the ladder more reliably than older ones.

If newer cohorts:

adopt less
embed less
export less

then growth is degrading the system.

Stop. Fix depth. Resume later.

Why this matters

Most teams use dashboards to justify growth. These tables exist to prevent premature growth.

They force you to answer one uncomfortable question honestly:

Will more users make this product clearer, or will they expose what’s still broken?

Notion waited until the answer was “clearer.”

That patience is why their growth compounded.

References

[1]: Figma Blog, “Design on a deadline: How Notion pulled itself back from the brink of failure,” March 2019. https://www.figma.com/blog/design-on-a-deadline-how-notion-pulled-itself-back-from-the-brink-of-failure/

[2]: Kitrum, “The Phenomenal Journey of Ivan Zhao, Notion’s Founder,” August 2025. https://kitrum.com/blog/the-phenomenal-journey-of-ivan-zhao-notions-founder/

[3]: Sequoia Capital, “Notion CEO Ivan Zhao: Augmenting Human Intellect,” April 2024. https://sequoiacap.com/article/notion-spotlight/

[4]: Sequoia Capital, ibid. (”a rented two-story house so small that only a traditional Shoji screen separated their bedrooms”)

[5]: Figma Blog, ibid.

[6]: Figma Blog, ibid. (”He suddenly popped to the top of our most active user list—spending upwards of 18+ hours a day in our design tool.”)

[7]: DigidAI, “Ivan Zhao: Notion’s $10B Productivity Vision,” November 2025. https://digidai.github.io/2025/11/23/ivan-zhao-notion-ceo-augmenting-human-intellect-deep-dive/

[8]: Sequoia Capital, ibid. (”And he has a bazillion Twitter followers”)

[9]: The Software Report, “The Kyoto Reboot: How Ivan Zhao Rebuilt Notion,” 2025. https://www.thesoftwarereport.com/the-kyoto-reboot-how-ivan-zhao-rebuilt-notion/

[10]: Notion Blog, “A brand new gallery for Notion templates and community creations,” November 2021. https://www.notion.com/blog/new-template-gallery

[11]: Simple.ink, “Notion Statistics: Growth, Revenue & More,” 2024. https://www.simple.ink/blog/notion-stats

[12]: Ness Labs, “Building the world’s most customizable workspace with Ivan Zhao,” April 2023. https://nesslabs.com/notion-featured-tool

[13]: Sequoia Capital, ibid. (”They found a small group of really die-hard fans that carried the product with them.”)

[14]: Ness Labs, ibid. (”One of our ambassadors runs our subreddit with nearly 100,000 subscribers. Another runs our Facebook Group in Korea with over 20,000 folks. Another has an Arabic community with over 30,000 folks.”)

[15]: Nira, “How Ivan Zhao’s Notion Is Going After Atlassian and Why It Just Might Win,” March 2020. https://nira.com/notion-history/

[16]: The Launcher, “Notion - Reaching 1 million users with 18 employees,” May 2021. https://thelauncher.substack.com/p/notion-reaching-1-million-users-with

[17]: Lenny’s Podcast, “Notion’s lost years, its near collapse during Covid,” March 2025. https://www.lennysnewsletter.com/p/inside-notion-ivan-zhao (transcript: “there’s a Doomsday Clock that when we’re going to truly run out of this space to store everything in Notion”)

[18]: Lenny’s Podcast, ibid.

[19] Notion Blog, “Introducing Notion AI,” February 2023. https://www.notion.com/releases/2023-02-22

[20] Kipwise, “Notion AI Features & Capabilities,” 2025. https://kipwise.com/blog/notion-ai-features-capabilities (”Unlike standalone AI tools that require constant switching between applications, Notion AI understands your page structures, database relationships, and project contexts”)

[21] PodCosmos, “How Notion Reimagined Productivity Tools | Ivan Zhao,” August 2025. https://www.podcosmos.com/kleiner-perkins/grit-podcast/how-notion-reimagined-productivity-tools-ivan-zhao (”Consolidate Before You Automate - Build integrated tool foundations before attempting AI agent implementation”)

Enhancing Trust in AI Systems

Pranav Pathak — Sun, 19 Oct 2025 09:25:13 GMT

Trust has quietly become the real bottleneck in AI adoption. Not compute. Not model size. Not latency. Trust.

The last two years have shown what AI can do. What comes next will depend on whether people believe it will do what they expect it to do, and nothing more. We’ve reached the stage where capability outpaces confidence. AI systems can reason, generate, and act, but too often they behave like overeager interns: smart, unpredictable, and hard to supervise.

If you’re building products with AI today, your biggest challenge isn’t getting better models. It’s engineering trust by design.

The Real Trust Gap

When people say they don’t trust AI, they’re rarely questioning the math. They’re questioning the behavior.

This erratic behavior manifests in strange ways: a model writes perfect code in one moment, then refuses to recognize its own syntax five minutes later. A chatbot answers 100 customer queries flawlessly, then invents policy violations on the 101st. Or the ranking algorithm learns too aggressively, pushing click-bait and driving short-term engagement.

Technically, these systems are “working.” Statistically, their aggregate metrics look fine. Psychologically, they’ve failed.

In classical software, you can earn trust once and coast on it for years. With AI, you have to re-earn it daily. AI is probabilistic at its core. The same input doesn’t always yield the same output. To AI-native engineers, that’s normal variation. To users, it feels like instability.

One early GenAI customer-support pilot showed this perfectly. The bot handled 85 percent of questions without issue, but the remaining 15 percent were catastrophic contradictions: apologizing for errors that hadn’t happened, offering refunds that didn’t exist. The result wasn’t 85 percent satisfaction. It was far lesser, because one unpredictable failure erases the memory of many reliable moments.

The uncomfortable truth is that trust is not proportional to capability. Humans trust cars more than planes, even though planes are statistically safer, because they ‘feel’ more in control in a car. The same principle applies to AI: people are more comfortable with a less capable system that they control than with a powerful one that they don’t.

What Trust Actually Looks Like

Trust in AI is built through calibration: the alignment between what the user expects and what the system actually does. A self-driving car doesn’t need the driver to marvel at its intelligence. It needs to react in expected ways when it sees a stop sign or a child on the street. The same is true for AI systems.

Predictable beats perfect: Predictability creates permission. When users can anticipate the next step, they relax into the system’s rhythm.

Consistent interaction patterns: The best AI interfaces teach users what to expect through repetition. They turn stochastic behavior into a stable mental model:
“Click here to get a different angle.”
“See sources here to verify.”
“Expect four variations.”

Explain enough, not everything: Another signal of trustworthy systems: selective transparency.
Anthropic’s Claude often provides a short rationale (“Here’s why I think this answer fits”) instead of an overwhelming technical trace. That small explanation helps the user calibrate confidence without drowning them in details.

Too much explanation breeds fatigue; too little breeds suspicion. The art lies in showing just enough reasoning to make the system feel steady.

Three Layers of Trust

When you look at systems people truly rely on - air traffic control, autopilot, or payment infrastructure - they all earn trust in layers. AI products should do the same.

1. Predictability

Users should be able to anticipate what happens next. The model should behave the same way for the same inputs, and make its limits clear when it can’t.

In GenAI, unpredictability usually comes from context drift, or small variations in prompts or hidden memory states that change output meaningfully. This is why even capable systems lose user confidence: the same request feels “off” on a different day.

The best builders fix this not by chasing perfection, but by making behavior consistent.

ChatGPT’s “Custom Instructions” feature lets users explicitly define tone and context, anchoring the model’s behavior across sessions.
Midjourney’s numbered variation grid teaches users that creative exploration has boundaries. The randomness becomes structured discovery, not chaos.

Predictability gives users a stable mental model. They stop testing the system and start using it.

2. Reversibility

People take more risks when they know they can recover.
A reversible system is one that says: “Try this. If it fails, you can go back.”

In GenAI, that usually means containing the blast radius of automation.
GitHub Copilot doesn’t push to main. Rather, it suggests code inline, where the human remains the gatekeeper. Notion AI and Canva’s “Undo” and “Restore” options let users roll back any generated content.

When autonomy is paired with reversibility, users shift from fear to experimentation. They start asking, What can this system do for me? instead of What could it mess up?

3. Proof of Competence

Trust comes from small proofs.

Before an AI system asks for full autonomy, it needs to earn credibility in low-stakes contexts.
Copilot earned developer trust by finishing lines, not writing files.
Notion AI started with rewriting and summarizing text, not running entire workflows.
Both systems built competence in narrow tasks, then expanded scope once reliability was visible.

Each visible success reinforces the belief that the system knows its job.

Stacking the layers

Predictability builds confidence.
Reversibility builds courage.
Proof of competence builds commitment.

Together, these layers create the scaffolding that lets intelligence scale without anxiety.

When users stop wondering if the AI will work and start focusing on how best to use it, trust has been achieved.

Principles for Building Trust in AI Systems

Trust emerges from design. These principles translate directly into product choices that shape how people experience AI reliability day to day:

1. Make behavior visible

Users trust what they can see. Every action an AI takes, for example, what data it used, why it acted, and what it produced, should be observable. The goal isn’t radical transparency though; it’s relevant transparency.

Example: Perplexity AI’s citation cards let users trace every claim back to a source. That single UX choice converts what would have been a “trust me” response into an evidence-based one.

Insight: Visibility turns a black box into a glass box. When users can peek inside, even briefly, uncertainty drops and confidence rises.

2. Build for reversibility

Confidence grows when people know they can undo. Every AI action should carry an implied “escape hatch.”
Example: In Notion AI, every edit has an immediate “Undo” state. It encourages experimentation because users know they can roll back. The same principle underpins GitHub Copilot’s inline suggestions: nothing executes without human review.

Insight: Reversibility transforms fear into curiosity. When users can recover, they engage more deeply.

3. Keep explanations human

Clarity beats completeness.
Good explanations don’t unpack neural math; they help users decide what to do next. The test of a useful explanation isn’t accuracy, it’s actionability.

Example: Anthropic’s Claude often prefaces responses with brief context (“Here’s how I approached this…”). It signals reasoning without dragging users into technical weeds.

Insight: The point isn’t to prove the model’s intelligence. It’s to make its logic legible in human terms.

4. Design safety margins

A good system knows when to stop.
Overconfidence kills trust faster than errors do. When an AI isn’t sure, it should defer, clarify, or escalate.

Example: Copilot for Microsoft 365 regularly prompts, “Would you like me to continue?”, a small checkpoint that prevents runaway generation. OpenAI’s function-calling APIs similarly force boundaries between thinking and doing.

Insight: Predictable failure beats unpredictable success.

5. Show a consistent track record

Reliability compounds.
A system that behaves predictably builds more trust than one that dazzles sporadically. Users remember consistency far longer than brilliance.

Example: ChatGPT’s monthly updates quietly improved memory, formatting, and speed without breaking prior behavior. That rhythm of steady improvement signals maturity, not volatility.

Insight: Every stable interaction becomes a micro-deposit in the user’s trust account. Over time, those deposits turn into loyalty.

The Economics of Trust

Trust is an economic principle. When users trust a system, they delegate more, churn less, and experiment faster. When they don’t, everything slows down: adoption, retention, and even internal confidence within the team.

Predictability drives speed: Teams that invest in predictability ship faster.
Because when the system behaves consistently, engineers spend less time firefighting edge cases, and product managers spend less time negotiating stakeholder anxiety.

Predictable systems scale faster inside organizations because every function—legal, design, marketing—knows what to expect.

Reversibility increases retention: Users stay longer when they can recover from mistakes. Reversibility changes how people feel about using AI: from cautious testing to active trust. In AI products, the ability to reverse is the permission to adopt.

Reliability accelerates enterprise adoption: Enterprises don’t buy “creativity.” They buy predictable ROI. A model that is slightly less capable but measurably more stable will always win procurement battles. Reliability reduces perceived implementation risk, which shortens sales cycles. Reliability converts hesitation into contracts. The more visible your guardrails, the faster trust turns into revenue.

Building Trust Is Building Product

In traditional software, trust is implicit. Users click a button and assume the system will do what it says. That’s because software has always lived in a world of determinism: same input, same output, no surprises.

AI breaks that contract.

Each new layer of autonomy - every agent that acts, every model that remembers, every chain that calls another API - reopens the negotiation between user and system. “Do I still know what this thing will do next?” becomes an everyday question.

This means that trust isn’t an afterthought in AI products; it is the product.

In deterministic software, you design features. In AI systems, you design behavior. Every prompt template, fallback rule, and safety margin is part of the product’s personality. And users form relationships with that personality, not the underlying model weights. When a chatbot calmly admits it doesn’t know an answer instead of hallucinating, that’s a product decision. When an image generator previews before committing compute, that’s a trust decision. When an agent asks for confirmation before executing a purchase, that’s a product and trust decision. These moments shape the emotional contract between user and system.

Every log, every preview, every “undo” button is part of that architecture. So are things users never see: audit trails, confidence thresholds, temperature settings, moderation layers. Together, they form what might be called a trust stack - the invisible scaffolding that lets intelligence operate safely at scale. OpenAI’s memory control panel, Anthropic’s constitutional layers, and Apple’s on-device intelligence boundaries are all versions of this same idea: transparency, containment, reversibility.
They differ in implementation, but not in philosophy.

Trust is a habit you earn through repetition. Every small proof of competence - a correct response, a graceful fallback, a transparent explanation - is a deposit in the trust account. Over time, those deposits accumulate into something powerful: delegation. Once users delegate, they stop questioning the system’s right to act.

Think this post will be useful for someone you know? Share with your colleagues!

Closing Thought

The future of AI will be won by the most trusted systems. Intelligence gets you in the door. But trust is what turns pilots into production, skeptics into advocates, and one-time users into people who can’t imagine working without your system.

The uncomfortable reality is that we’ve been optimizing for the wrong thing. We’ve been in an arms race for capability when the real constraint is confidence. The breakthrough products of the next decade won’t be the ones that can do more; they’ll be the ones people are willing to let do more.

That’s the unlock trust provides: permission to delegate. Delegation is how AI actually changes work. Everything else is just a demo.

The systems that admit uncertainty will earn more trust than the ones that fake omniscience. We’ve spent two years teaching AI to be confident. The next two years will be about teaching it to be honest about its limits, its uncertainty, its reasoning. Not because transparency is virtuous, but because it’s the only way to build systems people can actually rely on.

Intelligence is abundant now. What’s scarce is the courage to constrain it.

The builders who understand this, those who design for predictability over performance, who build undo buttons before building autonomy, who earn trust through small proofs rather than big promises, won’t just build better products.

They will define how the world learns to work with machines.

User Memory as a Product Delighter

Pranav Pathak — Sat, 04 Oct 2025 20:00:50 GMT

Amazon attributes $100B+ in revenue to personalization. Spotify users discover 1.9B new tracks every day through recommendations. TikTok’s 1B+ active users swear by their personalized feed.What makes it feel magical? User memory. When memory works, products feel alive, perfectly attuned to you.

Long before GenAI, this “memory” lived in click streams, purchase histories, engagement trails, cancellations, likes, and countless other signals. These data points still drive some of the most powerful algorithms we have. Ranking and recommendation systems remain the beating heart of e-commerce and social, and will be for a while.

What’s changed is the arrival of conversational AI. Memory has found a voice. Free-form dialogue has unlocked a new feature factory, producing richer, more nuanced signals about customers than we’ve ever had before. And for the first time, users themselves can step in. They can say what to remember, what to forget, what to add, remove, or modify.

That shift makes memory visible. It’s interactive. It’s user-driven. And when done well, it feels alive. Stateful, elegant, powerful.

So the builder’s question becomes: how do you design memory not just as infrastructure, but as a source of delight?

Memory isn’t one thing

It’s tempting to talk about memory as if it were a single bucket: data goes in, user profile comes out. In practice, memory behaves more like layers with purpose, failure modes, and design choices.

1. Episodic (in-session, short-lived) Memory: The cart you just filled, the search you ran five minutes ago, the episode you left half-watched, or the conversation you’re still having. Episodic memory makes the product feel present, like it’s paying attention right now. If you stretch it too far, it becomes irritating. Think of a travel site still showing you flights long after you’ve booked them. That’s episodic memory clinging on when it should have let go.

2. Medium term (weeks to months, patterns): Seasonal trips, listening habits, food orders that repeat just often enough to feel like a ritual. Medium-term memory makes personalization feel contextual. Done well, it fades gracefully, giving more weight to what’s recent while not erasing the past. Done poorly, it becomes stale, trapping users in a loop of yesterday’s choices.

3. Long term (identity anchors): Dietary preferences, languages, loyalty tiers, home cities. They don’t change often, and when they do, they need to be updated with care. Long-term memory builds trust when it’s right, and breaks it when it’s wrong. A system that keeps treating someone as an omnivore after they’ve changed diets doesn’t just feel outdated; it feels disrespectful.

Every layer has its own role. Episodic makes a product feel responsive. Medium-term makes it feel contextual. Long-term makes it feel trustworthy. Together, they form the backbone of personalization. Treat them as one, and you risk breaking the experience. Treat them as layers, and you can design memory that feels alive.

How to think about memory architectures

If memory is layered, the architecture has to reflect it. It’s not enough to collect signals, you need a model for where they live, who can use them, and when they should move or fade.

Storage: Episodic memory belongs in fast, short-lived stores. Medium-term memory sits in feature stores, event streams, or embeddings that decay. Long-term anchors go into profile systems that act like contracts: typed, versioned, auditable.

Access: Decide who can read and write each layer. Episodic signals may feed a ranking model but shouldn’t leak beyond the session. Medium-term patterns can power recommendations or filters. Long-term anchors demand explicit user control and visibility.

Transfer. Memory should move with intent. Episodic signals that repeat can become medium-term patterns. Medium-term habits that persist can graduate into long-term anchors. Just as important, old signals must decay or be pruned. Forgetting is part of the design.

Governance. Metadata, lineage, and “why shown” traces make memory explainable. Consent and retention rules make it trustworthy. Without them, memory may work technically, but it won’t build confidence.

Signals flow, fade, or graduate with purpose. The system explains itself simply, and trust is built in from the start.

When to remember, when to forget

User’s preferences, tastes, and wants evolve over time, or sometimes even from session to session. Memory needs to evolve with this change. The design challenge is knowing when to hold on, when to promote, and when to clear the slate.

Decay: Medium-term signals should fade over time so fresh behaviour outweighs stale history. Last night’s pizza order should dominate today’s suggestions, but it shouldn’t define them a week from now.

Conflict resolution: When facts clash, the system needs deterministic rules. Recency can take precedence, or source authority, or versioning. What matters is consistency: the system should never be uncertain about which fact wins.

Event-driven forgetting: Once a purchase is complete, the cart should clear. Once a trip is booked, stop pushing the same flights and shift to hotels or activities. Once a refund is processed, remove the grievance from active memory. These resets prevent memory from becoming noise.

Explicit control: Small gestures like “forget this,” “don’t recommend again,” “dismiss,” or “clear history” build trust. Preference management for long-term anchors signal respect for identity.

Layer transfers: Repeated episodic signals can be promoted into medium-term patterns. Medium-term habits that prove stable over months can become long-term anchors. Sensitive or one-off events should never be promoted. Cool-off windows can prevent recency noise from hardening into identity too quickly.

User interaction with memory (conventional UX + conversational)

Memory should be mostly invisible, but offer graceful points of control when users want them.

Some of those controls are familiar. Feedback loops like thumbs up, thumbs down, star ratings, or a quick “not relevant” let people steer recommendations without heavy effort. Dismiss and skip options like “not now” or “don’t show again” give users a lightweight way to prune memory in the flow of use.

Others are about choice and transparency. Filters and controls let people actively shape medium-term memory: genres in streaming, cuisines in food apps, price or dietary tags in e-commerce. Consent and confirmation are essential for long-term anchors. An opt-in at the start, and an occasional nudge later: “Still vegetarian?” “Still want us to track direct flights?”

Sometimes people want a clean slate. Reset options like “clear history,” “reset recommendations,” or “start fresh” may be rarely used, but their presence signals respect and control. Paired with explainability, for example with simple reasons like “Because you watched…” or “Based on your last 3 orders…”, they make memory feel less like surveillance and more like a partnership.

In conversational systems, you can go further. Lightweight hooks like “remember this” or “forget that” let users treat memory as part of the dialogue itself.

Balance matters. Too much user management, and personalization becomes work. Too little, and memory feels opaque or even manipulative. The sweet spot is memory that mostly stays out of sight, but surfaces just enough for users to shape it when they want.

Metrics, experiments, and QA

If memory is going to be a first-class product lever, it needs to be measured like one. That means moving beyond engagement spikes and looking at how memory shapes both performance and trust.

The usual product KPIs: click-through and engagement lift, conversion and repurchase rates, even measures of novelty versus repetition. Just as important are negative signals: complaint rates or drops in usage when personalization feels stale or clingy.

Trust signals: How often do people use the memory controls you’ve provided? Do they clear history, update preferences, click on “why shown” explanations? Opt-in and opt-out rates are more than compliance artifacts. They’re direct indicators of whether users feel safe with your memory design.

Quality diagnostics: A repetition index can flag when recommendations loop too tightly. A stale-rec rate shows how often expired signals are resurfacing. Post-completion spam, like pushing flights after booking, should be measured and eliminated. Conflict error rates, where the system holds contradictory facts, can be tracked and brought down.

Experimentation: A/B testing decay schedules, promotion thresholds, or conflict resolution policies can reveal how subtle changes affect user experience. Explainability variants - different ways of telling users “why” something is shown - are also ripe for experimentation. For long-term anchors, running holdouts help measure accuracy and trust over time.

Rigorous testing: Unit tests for promotion and pruning logic. Synthetic edge cases that simulate rapid preference flips. Safety checks to ensure that sensitive attributes never get promoted where they don’t belong.

Memory is product behaviour. And like any behaviour, it needs KPIs, experiments, and QA baked into the process.

Cases in the wild

There are products today that show memory done right.

In e-commerce, Amazon’s “Frequently bought together” is a smart use of medium-term memory. It doesn’t stop at the purchase; it pivots into the next logical step. The best platforms also remember stock status and policy versions so recommendations don’t feel out of sync.

In streaming, Spotify’s Discover Weekly is at its best when it balances novelty with anchors, expanding your taste while grounding it in what you’ve loved before. Netflix’s “Because you watched…” explanations, when clear, make recommendations feel less like a black box and more like a natural extension of your history.

In travel, loyalty systems are a bright spot. Airlines and OTAs that reliably remember your tier, preferred seat type, or favorite destinations build trust that compounds over years. These long-term anchors are exactly where memory should shine.

In productivity, small touches like Gmail surfacing “drafts you started” or Notion letting you pin active projects make episodic memory feel helpful, not intrusive. The best tools make it easy to focus on what’s current, while still letting you recover past context when you need it.

The gaps are just as visible. E-commerce platforms still recommend items already purchased, instead of pivoting to complements. Streaming apps often over-personalize, looping users endlessly in a single genre. Travel sites continue to surface the same flights even after booking, when the intent should reset immediately. And productivity apps too often drown you in irrelevant recall, surfacing projects that should have decayed long ago.

🧠 Summary: The Builder’s Memory Audit

✅ Separate episodic, medium-term, and long-term memory.
✅ Design signals to decay, reset, or promote with intent (purchase clears cart, seasonal resets, stable patterns graduate).
✅ Resolve conflicts with a deterministic policy (recency, authority, versioning).
✅ Give users the ability to see, edit, reset, or confirm what’s remembered.
✅ Provide simple explanations (“Because you watched…”) so recommendations aren’t a black box.
✅ Make memory a delighter: stateful, fading elegantly, giving agency without extra work.
✅ Measure the right KPIs: novelty vs repetition, staleness, post-completion spam, conflict error rate, trust signals (use of controls, clear-history, opt-ins).

What’s next

My speculation: the next wave will stretch beyond today’s click streams and conversation logs.

Multimodal: Text alone won’t be enough. Signals from images, voice, and location will be captured and stitched together.

On-device: Users will expect faster, more private recall. Local storage, combined with federated updates, will give products the ability to personalize without shipping raw data back to the cloud. That shift will blur the line between device and service.

Team-level: In collaborative tools, memory won’t stop at the individual. It will need to represent shared context like project histories, group decisions, team preferences with clear consent scopes. Who owns the memory becomes as important as what’s remembered.

Policy-awareness: region, age, and sensitivity rules will need to be baked into promotion logic. A one-off crisis, a protected attribute, or data from a restricted geography can’t be allowed to flow upward into long-term anchors. Governance won’t sit outside the architecture; it will live inside it.

All in all, memory will remain an evolving core system across products. It will shape user interactions, and be shaped by them. New paradigms for storage, context, and interaction will continue to augment and evolve our thinking.

I’m curious, How do you manage memory in what you’re building? What thoughts does this evoke for you? Does all of this feel like an exciting design opportunity, or does it seem too complex to implement in practice?

Approximating a World Model - The Builder's Guide to Stateful AI

Pranav Pathak — Sat, 27 Sep 2025 06:45:35 GMT

Yann LeCun is pushing the field towards world models. And he is right: scaling LLMs alone will not get us to systems that understand, reason, and plan.

But most builders cannot afford to wait for JEPA v2 papers or Meta’s research breakthroughs. The question is practical: how do you approximate a world model today, with the tools already available?

What a World Model Really Is

Strip away the academic framing, and a world model is not mysterious. It is three ingredients:

State
A structured representation of what the world looks like now.
Transition
A way to update that state when something changes, whether through new observations or actions.
Planning
The ability to project the state forward and evaluate possible actions against future outcomes.

That is it. A toddler does this instinctively. Builders can do it with databases, rules, and lightweight reasoning loops. You don’t need a perfect latent predictor to get value. You just need to build the loop.

Where Builders Go Wrong With LLMs

The default instinct today is to treat the LLM itself as the “world model.” Push more into the prompt, expand the context window, and hope the model “remembers” enough to simulate state.

This breaks in practice:

Memory becomes brittle. Context stuffing is expensive and inconsistent.
State drift creeps in. The model hallucinates facts that no longer match reality.
Actions lose grounding. The LLM proposes fluent answers that fail in a dynamic environment.

A real-world example: task management agents that forget what tasks are complete, or booking bots that “assume” availability without checking the underlying system.

The fix is to separate language interface from world state.

Approximating World Models With What You Have

Here are four practical ways to approximate a world model today:

🔹 Entity and state memory
Keep a structured store of entities, attributes, and goals. It can be a SQL database, a key-value store, or a vector DB with typed embeddings. The key is persistence across sessions.

Customer support bot: Store user identity, open tickets, and previous resolutions. The system recalls a past complaint automatically instead of asking the customer to repeat.
Fitness app: Maintain state for weight, calorie intake, and workout history. When a user logs breakfast, the app updates remaining daily macros without recalculation from scratch.
Travel planner: Remember destinations, dates, and budget goals across sessions. If the user returns a week later, the system still knows they were planning a 7-day trip to Spain under €2k.

🔹 Simple transition functions
You don’t need deep learning here. Rule-based updates work. If a user confirms a booking, mark their state as “confirmed.” If an item disappears from the inventory feed, remove it from the map. These simple transitions bring stability.

E-commerce: When inventory hits zero, the system updates product state to “out of stock.” The bot no longer suggests it in recommendations.
Banking app: When a transfer is executed, the balance is reduced immediately. The state reflects reality before the next sync with the core system.
Learning app: When a student completes a module, mark the concept as “mastered” and unlock the next lesson.

🔹 LLM-assisted reasoning
Use the LLM tactically. It can propose candidate actions, but always ground its reasoning against the state memory. Example: “Given the current state and constraints, which option satisfies the goal?”

Logistics: The world model tracks packages in transit. The LLM reasons about re-routing: “Given a delayed flight and the customer’s deadline, what delivery route satisfies the goal?”
Healthcare triage: The state model has symptoms logged. The LLM proposes next questions to narrow possibilities but only within the structured medical knowledge base.
Education: State = student proficiency map. LLM reasons: “Given weak algebra and strong geometry, which exercise is the best next step?”

🔹 Simulation by replay
Planning does not require generative video models. You can approximate it by rolling forward “what if” scenarios with your transition functions. Test possible actions, then choose the one that leads to the best future state.

Ride-hailing: Simulate assigning a driver to passenger A vs passenger B. Roll forward travel times and see which choice minimizes total wait across the system.
Personal finance app: Simulate savings outcomes. If the user invests €500 a month vs €1,000, replay the balance forward 10 years and compare results.
Warehouse robotics: Simulate picking orders in sequence A-B-C vs C-B-A. Replay travel distances and pick times, then choose the shortest plan.

The Builder’s Playbook

Think of this as a minimum viable world model:

Step 1. Define your schema
Decide what your world consists of. Users, items, locations, states, goals. Make it explicit.

Step 2. Implement transitions
Keep them transparent. Rules that map input to updated state are enough to start.

Step 3. Insert the LLM as a reasoner
Let the LLM interpret ambiguous inputs and propose candidate actions. But never let it rewrite the state directly.

Step 4. Add a planner
Use simple replay. Simulate a few possible next steps, score them, and pick the best.

Step 5. Log surprises
Whenever the real world diverges from the model, log it. Surprises become your training signal for improvement.

Examples Across Domains

This approach works across verticals:

Customer support
World model = user profile + issue history + open tickets. Transitions = updates from user replies or system events. Planning = suggest next best resolution.
E-commerce logistics
World model = order states + inventory + shipping nodes. Transitions = order placed, item picked, package scanned. Planning = choose optimal route.
Health tracking
World model = biometrics + habits + goals. Transitions = new measurements logged. Planning = adjust nutrition or workout plan.
Education apps
World model = learner progress + mastered concepts + weak areas. Transitions = quiz results, completed lessons. Planning = recommend next module.

In all cases, the pattern is the same: state, transition, planning.

Why This Works

Approximating a world model gives you four practical benefits:

Persistence: users feel the system remembers them.
Grounding: actions are tied to real states, not hallucinations.
Adaptation: surprises trigger learning loops instead of silent failures.
Separation of concerns: LLMs handle language and reasoning, while structured systems handle reality.

This is enough to turn fragile prototypes into reliable products.

The Strategic Lens

LeCun’s JEPA will eventually give us elegant latent predictors. But builders cannot wait. Approximating a world model with structured memory and simple transitions is enough to start.

The pattern is not research-lab-only. It is the same loop product teams already know from good system design: model the state, update the state, plan over the state.

The opportunity is to treat this as product architecture, rather than just prompt engineering.

Closing Thought

You do not need Meta’s stack to build a useful world model. You can build one today with databases, rules, and an LLM stitched in as a reasoner.

👉 What state schema, if it persisted and evolved over time, would make your product instantly smarter?

Why LeCun is Betting on World Models (and Why Builders Should Pay Attention)

Pranav Pathak — Sat, 20 Sep 2025 14:49:45 GMT

At Nvidia’s GTC 2025, Yann LeCun said something that cut through the noise:

“I’m not so interested in LLMs anymore.”

The AI GRID has a strong breakdown of his remarks. The takeaway is clear: LLMs have reached the point of diminishing returns. Scaling them further is useful, but it will not unlock the next stage of intelligence.

The Four Frontiers

LeCun outlined four capabilities that define the real challenge ahead:

Grounded understanding of the physical world
Persistent memory
Reasoning
Planning

These are the exact gaps builders encounter when trying to turn LLM demos into production-grade systems.

Where Tokens Break Down

Tokens work well for language, but they do not map neatly to the continuous and unpredictable nature of the real world.

Discretization hides the richness of physical dynamics
Breaking the world into fixed tokens oversimplifies reality. Physical processes are continuous and fluid, not neatly bucketed.
Example: A robot arm pouring liquid cannot be reduced to a finite set of “token” movements. The flow and slosh of water are continuous, and discretization misses those dynamics.
Predicting pixels wastes resources on details that are unknowable
Models spend capacity trying to guess noise, like the exact shape of someone’s face in the next video frame. The important structure gets lost in the effort.
Example: If a model tries to predict every pixel of a crowded street scene, it wastes energy guessing who will turn their head left or right, instead of focusing on the physics of cars moving through traffic.
Next-token prediction does not translate into reliable next-action prediction
Language fluency does not guarantee grounded decisions. Predicting the “right word” is not the same as predicting the “right move” in a real environment.
Example: An LLM can generate a perfect recipe description but cannot guarantee the cake will rise. The transition from text to action requires grounding in physical cause and effect.

LeCun’s answer is Joint Embedding Predictive Architectures (JEPA / V-JEPA). These architectures predict in latent space: abstract representations that capture what matters, ignore what does not, and reveal when the world violates learned patterns.

System 1 and System 2

Psychologists describe two modes of thought:

System 1: fast, reactive, policy-driven
System 2: slower, deliberate, planning-oriented

LLMs approximate System 1. System 2 requires models that can manipulate abstract representations of the world and plan within them. That is the gap LeCun is targeting.

The Data Imbalance

Consider the scale of human sensory input versus text:

LLMs train on ~30 trillion tokens (~10^14 bytes)
A four-year-old child sees ~10^14 bytes through vision alone in just four years

The conclusion is direct: text-only training cannot match the richness of multimodal world experience.

Implications for Builders

This research agenda has immediate consequences for how we design products today:

🧠 Memory as a product surface
Create explicit, structured memory for entities, states, and goals.

⚙️ Two operating loops
Planning in latent space, then distilled policies for fast execution.

📊 Rethink evaluation
Track latent prediction accuracy, surprise recovery, and policy efficiency instead of surface-level fluency.

💬 Tactical use of LLMs
Use them where they excel: language, orchestration, and code generation. Connect them to models that represent real-world state and dynamics.

A 90-Day Blueprint

You do not need a research lab to explore this direction.

Weeks 1–2
Select a task with real dynamics. Define a compact state representation.

Weeks 3–6
Train a self-supervised encoder and predictor to model transitions. Measure plausibility rather than pixel fidelity.

Weeks 7–10
Layer in a planner to act over the latent space. Track when and how the system recovers from surprises.

Weeks 11–12
Distill the planner into a lightweight policy. Add a “surprise guard” that triggers re-planning when prediction errors spike.

The Bigger Picture

LeCun’s shift away from LLMs signals where the field is heading. Language interfaces will remain essential. Progress depends on systems that can model the world, remember, reason, and plan.

The open challenge for builders:
👉 Where in your product could latent-space planning outperform a larger prompt?

How to Spot If Your Product Is Hitting LLM Limits

If you are building with LLMs today, here are clear signs you are running into structural limits rather than prompt tuning problems:

Forgetting across sessions
Users expect the system to remember entities, states, and goals. Your product relies on hacks like re-prompting or long context windows.
Slow or expensive reasoning
Tasks require multi-step reasoning chains, but the model burns cost and latency on re-generating context every time.
Unreliable action prediction
The model produces fluent text but fails to reliably trigger the correct action in dynamic or stateful environments.
Evaluation mismatch
Your metrics focus on fluency or token accuracy, but users care about planning quality, adaptation speed, and success rates.
Scaling without leverage
Bigger prompts and bigger models give marginal gains. What you need is structured state, not more tokens.

If you see these patterns, it is time to rethink your architecture. Keep the LLM as the language interface, but pair it with a world model that can represent state, reason in latent space, and plan actions. That is the step change LeCun is pointing towards.

In my next post, I will go deeper into practical patterns: concrete examples of how to approximate world models today using four simple building blocks, for example, entity and state memory, transition functions, LLM-assisted reasoning, and simulation by replay. Each with real use cases you can start testing immediately.

Subscribe now

Testing Something New: A Lightning Lesson on AI Strategy

Pranav Pathak — Wed, 03 Sep 2025 06:55:42 GMT

A while ago I wrote that I’d be building a course on AI Strategy for Product Leaders. Before launching the full program, I’m starting small.

I’ve put together a short 30 minute lightning lesson to test the waters. It’s focused on a skill I think every product leader will need: evaluating which AI opportunities are actually worth pursuing.

We’ll go through the PROVE-IT framework — a structured way to score ideas on pain, reach, owned data, verifiability, execution loops, integration time, and cost. The goal is to help you cut through noise and focus on bets that can actually create impact.

If you’d like to join, you can sign up here.

As always, I’ll share learnings from the session afterward too.

Best,

Pranav

PS: If you know a PM who’s wrestling with AI opportunities right now, feel free to forward this to them — they might find it useful

How to Spot the 10% of AI Projects That Work

Pranav Pathak — Sat, 30 Aug 2025 21:23:32 GMT

You’ve probably lived this déjà vu: someone pitches a shiny new feature, the slides are filled with screenshots of chat interfaces or “copilots,” and there’s a passing reference to OpenAI or Anthropic. Everyone nods politely. A pilot is greenlit.

Three months later - nothing.

No metric moved. No one adopted it. The project limps along because nobody wants to admit it failed.

This isn’t the exception, it’s the rule. 90% of AI projects don’t actually work. Not because the models fail to generate output. They work in that narrow sense. They fail in the more important sense: they don’t change a KPI, they don’t create adoption, and they don’t survive the first quarter once the demo excitement fades.

The 10% that succeed stick. They scale. They generate incremental revenue or reduce cost in ways you can see in the P&L. I’ve seen those rare wins add hundreds of millions to the bottom line.

The obvious question: how do you spot the 10% and not waste time on the 90%?

Why most AI projects die

The biggest killer isn’t AI performance. It’s novelty bias.

A founder sees ChatGPT draft an essay and imagines an AI chatbot for their product. An exec hears a competitor talk about “copilots” on an earnings call and suddenly every team is tasked with building one.

But novelty is not impact, and if you’ve ever built anything, you know this. Impact comes from solving a problem that already hurts, in a place users already live, with data you actually own, tied to a metric the business already cares about.

When you flip the lens from “what’s new?” to “what moves numbers?”, the 10% stand out clearly.

The PROVE-IT Filter: 7 Rules That Predict Which AI Projects Will Fail (and Which Will Stick)

Over the last few years, I’ve reviewed hundreds of AI product ideas, from scrappy prototypes to global roadmaps. The same pattern always shows up: the successful ones share the same DNA.

That DNA lives in a filter I call PROVE-IT (pretty consistent with my first response to every AI pitch I hear). My filter turns hand-wavy excitement into a concrete go/no-go decision.

Pain: The problem has to be hair-on-fire. If it doesn’t connect directly to a revenue leak, an operational bottleneck, or a customer frustration that already has budget and urgency, adoption odds drop sharply.
Reach: The solution must live inside the user’s existing workflow or journey. If it demands a new app, tab, or process change, expect abandonment. Impact only compounds when the AI is embedded where work already happens.
Owned Data: Proprietary signals are the moat. Logs, transactions, outcomes, and labeled history that only you have access to, and that can’t be trivially replicated by a competitor or model provider, make the system defensible.
Verifiability: Every strong AI project has a single KPI that can be measured weekly. That might be conversion, SLA adherence, average order value, containment, or refund rate. If success can’t be expressed in one crisp metric, it usually won’t last.
Execution Loop: Good systems learn. They take user actions -- clicks, edits, corrections, returns -- and feed them back into the product so it improves week after week. Static prompts without feedback loops stall quickly.
Integration Time: A project that takes six months before seeing real data rarely survives. Fast integration de-risks impact and builds confidence.
Total Cost of Ownership: Unit economics need to hold as usage scales. Inference costs, retrieval layers, human review, and compliance overhead all matter. If cost grows faster than value, the project collapses under its own weight.

Each project gets a score from 0–2 on every dimension. If something lands below a 9, it’s weak. Anything 11 or higher is worth testing in an 8–12 week sprint with clear metrics.

The PROVE‑IT Field Guide

Scoring: Each line item gets 0–2. Below 9 is weak. 11+ is worthy of the next 12 weeks.
Scale: 0 = missing or hand‑wavy, 1 = partial or unproven, 2 = strong, evidenced.

P: Pain: “Hair‑on‑fire or nice‑to‑have?”

What it means: Ties directly to a leak (cost), bottleneck (throughput), or urgent frustration (NPS).
Score rubric:
- 0: Vague “delight” pitch, no money or minutes at stake.
- 1: Qualitative pain without quantified stakes.
- 2: Quantified leak or delay with owner and budget.
How to test this week: Pull last 90 days of tickets, refunds, or cycle times. Put euro or minute signs next to each.

R: Reach: “Inside the user’s existing path”

What it means: Lives where work already happens, ideally auto‑invoked.
Score rubric:
- 0: New app or tab with context switching.
- 1: Deep‑link from existing tool.
- 2: Inline in the flow, triggered by the event that creates the need.
Proof in the wild: Gmail’s Smart Reply and Smart Compose sit in the inbox and auto‑suggest replies; at launch Smart Reply generated ~12% of mobile replies in Inbox, a classic “in‑flow” win. Workspace Updates Blog

O: Owned Data: “Do we have a moat?”

What it means: Proprietary signals or history others can’t copy: clickstreams, outcomes, refunds, SLAs, process docs, device or call audio.
Score rubric:
- 0: Public web + generic LLM.
- 1: Limited first‑party logs without outcomes.
- 2: Rich first‑party data with labels/outcomes and compliance clarity.
How to test this week: List top 10 tables or stores you control. For each, mark: freshness, coverage, and whether outcomes are labeled.

V: Verifiability: “One weekly KPI”

What it means: A single metric the team can move and defend.
Examples: AHT, CSAT, repeat‑contact rate, conversion, Average Order Value, refund rate, SLA%, developer cycle time.
Score rubric:
- 0: Vanity metrics or subjective “wow”.
- 1: Proxy metric without business linkage.
- 2: Primary KPI with baseline, target, owner, and dashboard.
Proof in the wild: For Microsoft 365 Copilot, controlled studies and TEI analyses track time saved by task (e.g., ~20% on email writing, ~30% on search, ~34% on content creation). Whatever you think of ROI studies, the KPI is explicit and measurable weekly. Forrester

E: Execution Loop: “Gets better with feedback”

What it means: Productized learning from clicks, edits, thumbs, returns, QA outcomes.
Score rubric:
- 0: Static system.
- 1: Logging exists, but no updates.
- 2: Regular retrains or policy updates; in‑product feedback visibly improves results.
How to test this week: Ship explicit feedback affordances (accept, edit, reject). Close the loop with a weekly model or heuristic update.

I: Integration Time: “Shadow mode in weeks”

What it means: Thin slice ships into shadow or assist mode fast, with human‑in‑the‑loop.
Score rubric:
- 0: Big‑bang build.
- 1: Pilot in a new tool or subset of users.
- 2: Shadowing real traffic in the primary surface within weeks.
Proof in the wild: Klarna launched an AI support agent that immediately shouldered real workload in‑flow; shipping quickly in the primary channel mattered. Klarna Italia

T: Total Cost of Ownership: “Unit economics survive scale”

What it means: Costs scale sub‑linearly vs. value. Model, retrieval, inference, human review, compliance, and re‑training all counted.
Score rubric:
- 0: “We’ll figure out cost later.”
- 1: Rough per‑interaction math without adoption curve.
- 2: Full P&L view with sensitivity to token prices, guardrail costs, and human fallback.

Subscribe now

The PROVE‑IT scorecard

Real‑world snapshots mapped to PROVE‑IT

1) Microsoft 365 Copilot in knowledge work

What was built and why: Knowledge workers spend countless hours on repetitive tasks like email, document search, and summarization, creating a clear pain point. The solution integrates directly into Word, Outlook, and Teams, eliminating context switching and placing AI where the work already happens. Its impact is verifiable through a simple KPI [time saved per task] with studies showing consistent 20–30% efficiency gains. And because the system improves through tenant feedback and policy updates, accuracy and guardrails get stronger over time, creating a product that delivers compounding value the longer it runs.
Evidence: Forrester TEI reports average time savings like ~20% on email writing and ~30% on search; UK public‑sector trials documented meaningful daily minutes saved across thousands of employees. Forrester GOV.UK
Example score: P2 R2 O2 V2 E1 I1 T1 = 11.

Note on evidence quality: Microsoft‑ and partner‑run studies skew positive; third‑party reporting occasionally questions magnitude, so insist on your own baseline and measurement.

2) UPS ORION route optimization

What was built and why: Delivery routes were a classic “hair-on-fire” problem [fuel waste, driver inefficiency, and delayed packages all eroded margins]. UPS developed ORION (On-Road Integrated Optimization and Navigation), an AI optimization engine embedded into driver workflows. With proprietary GPS, package, and driver data across millions of routes, UPS had an unmatched data moat. Impact was tracked through clear KPIs: miles saved, fuel reduction, CO₂ emissions avoided, and on-time delivery performance.
Evidence: UPS reports ~100M miles and ~10M gallons saved annually after deployment -- hundreds of millions in operational savings. GlobeNewswire supplychaindive.com
Example score: P2 R2 O2 V2 E2 I0 T1 = 11.

3) Wendy’s FreshAI for drive‑thru ordering

What was built and why: Peak-time labor shortages and slow order throughput created a visible bottleneck at Wendy’s drive-thrus. The company introduced FreshAI, a voice-ordering assistant integrated directly into the ordering lane, where customers were already speaking to place orders. The key metrics are service time, order accuracy, and upsell effectiveness. While the system benefited from fast integration into physical drive-thrus, feedback loops and consistency across accents, background noise, and menu complexity remain challenges.
Evidence: Expansion plan to 500–600 locations with reported ~22 seconds faster service and near‑99% order accuracy at pilots; industry peers show mixed outcomes, so instrumentation matters.
Business Insider restaurantdive.com
Example score: P2 R2 O1 V1 E1 I1 T1 = 9.

Counter‑example, same pattern: DoorDash ended its voice‑ordering pilot in 2025 -- a reminder to validate accuracy, edge cases, and unit economics in shadow before scaling. Bloomberg Restaurant Business Online

4) Stripe Radar fraud prevention

What was built and why: Online payments are constantly exposed to fraud risk, creating both direct financial loss (chargebacks) and indirect cost (false positives blocking legitimate customers). Stripe built Radar directly into its checkout and payments flow, embedding AI where the transactions already happen. Its moat comes from proprietary cross-network fraud data, enriched by outcomes and labels at global scale. The KPI is sharp: reduction in fraud rate with controlled false positives. Radar continuously retrains on outcomes, feeding fraud decisions back into the model to improve accuracy at scale.
Evidence: Stripe reports users see ~42% SEPA and ~20% ACH fraud reduction on average with Radar’s new models. Stripe
Example score: P2 R2 O2 V2 E2 I2 T2 = 14.

5) Gmail Smart Reply / Compose

What was built and why: Email is a universal time sink, with knowledge workers spending hours drafting routine responses. Google built Smart Reply and later Smart Compose directly into Gmail, meeting users inside their existing workflow with zero friction. The moat came from proprietary training data [billions of anonymized email threads] with real response outcomes. Success was tracked through a simple KPI: adoption and usage of suggested replies. The system improved through user edits and accept/reject signals, forming a natural feedback loop that sharpened predictions over time.
Evidence: At launch, Smart Reply generated ~12% of all replies in Google’s Inbox app, scaling to billions of characters saved each week once rolled into Gmail. Over time, Google has emphasized efficiency gains and adoption at massive scale across its user base.
Example score: P1 R2 O1 V2 E2 I2 T2 = 12 (slightly lower on Pain and Owned Data, since time saved per email is useful but less acute than cost or revenue drivers, and training relied heavily on scale rather than uniquely owned business-critical data).
Workspace Updates Blog University of Calgary in Alberta

1) Week 0–1: Baseline and thin slice

Define the one KPI and baseline it for 4 weeks of historical data.
Pick the surface you already own (e.g., search box, ticket console, compose box).
Ship a thin slice in shadow: log model suggestions, human decisions, cost per suggestion, and “edit distance.”
Events to instrument: suggest_shown, suggest_accepted, edit_distance, task_time, fallback_triggered, token_cost_eur.

2) Week 2–4: Close the loop

Feedback UX: thumbs, edit‑before‑send, “good vs safe to ship”.
Weekly updates: retrain heuristics or models; publish “what improved” changelog.
Guardrails: auto‑fallback rules and blocked intents.

3) Week 5–8: Controlled exposure

Holdout design: user‑ or session‑level A/B.
Human‑in‑the‑loop: queue low‑confidence cases to reviewers; learn from corrections.
Cost controls: cache, truncation, and retrieval; measure effective cost per accepted outcome.

4) Week 9–12: Scale or stop

Unit economics review:
- Value per action: minutes saved × loaded hourly rate, or revenue delta × margin.
- Cost per action: model + retrieval + orchestration + review overhead.
- Adoption: % of eligible actions where AI is accepted.
- Decision: Kill, iterate, or scale with hard gates.

Common failure modes and how PROVE‑IT catches them

Chatbot in a new tab fails Reach. Expect abandonment even if the model is “good.”
Cool demo with public data fails Owned Data. Anyone can copy it.
Laundry list of metrics fails Verifiability. Pick one KPI and live by it.
Static prompts fail Execution Loop. Improvements stall after Week 2.
Quarter‑long integration fails Integration Time. Push for shadow mode in the primary surface fast.
Ignoring review labor fails TCO. Human‑in‑the‑loop costs can swamp token savings.

A few playbooks you can borrow

Support assistant: Shadow‑mode suggestions + containment rate + edit distance; promote to agent‑assist then to partial autonomy by intent. Klarna’s trajectory shows why containment and AHT are the north stars. Klarna Italia customerexperiencedive.com
Knowledge worker copilot: Task‑level time trials per artifact type; Microsoft’s task‑specific minutes saved is the right granularity. Forrester
Ops optimization: Start with deterministic heuristics, then learn residuals; measure miles or minutes saved like ORION. GlobeNewswire
Voice ordering: Gate by confidence and fallback fast; scale only when service time and accuracy match human benchmarks. The DoorDash reversal is your cautionary tale. Bloomberg Restaurant Business Online
Fraud: Optimize both containment and false positives; Stripe’s cross‑network signals showcase the power of Owned Data at scale.Stripe

Startups vs scaleups: different lenses

The same filter works in both contexts, but what you weight changes.

Startups should bias toward reach and speed. If you can’t land in a user flow in <1 month, you’ve lost. An e-commerce startup testing AI-generated bundle suggestions at checkout can run the experiment in two weeks and see uplift or not. That’s the advantage.
Scaleups must bias toward verifiability and TCO. Distribution is your edge, but bad economics will crush you. A large retailer once piloted AI-generated imagery for product listings. The demo wowed. But inference + QA costs meant negative margin at scale. It was dead on arrival.

Strong bets vs weak bets: patterns across industries

Patterns repeat.

Strong bets:

B2B: triage and routing, summarization into action, structured extraction.
E-commerce: search rerankers, AI-driven product bundles, fraud detection (flagging high-risk orders with proprietary behavioral data), returns prevention (AI-powered sizing and fit recommendations).

Weak bets:

“AI chat for our store.”
Copilots without a KPI.
Anything that depends on users changing their primary tool.

Why do returns prevention and fraud detection score high? Because they tie to pain (margin leakage), live in the flow (checkout, post-purchase), use owned data (transactions, return history), and are verifiable (chargeback rate, return rate). They are boring. But they matter.

The builder’s takeaway

At the end of the day, spotting the 10% is about discipline.

Prove the pain with numbers.
Live where the user already is.
Write the ROI on a napkin.
Define the fallback.

If you can’t, kill it. If you can, run the 12-week play.

The temptation in AI is to believe every idea is a moonshot. The reality is that the projects that work are plumbing, not poetry. They quietly move a KPI. They integrate seamlessly. They survive past the demo stage.

Think big. Start small. Iterate to greatness. That’s how you avoid the 90% trap, and build the 10% that actually stick.

Why I’m Building a Course on AI Strategy for Product Leaders

Pranav Pathak — Sat, 17 May 2025 07:39:36 GMT

A few months ago, Gagan Biyani (co-founder of Maven) reached out and nudged me to consider creating a course. I wasn’t looking to launch anything at the time, but the idea lingered. It kept resurfacing in conversations with other product leaders, in my mentoring sessions, and during internal strategy reviews.

And here’s what I’ve come to realize:

Most product leaders today are flying blind when it comes to AI.

Not because they lack intelligence or initiative. But because the AI landscape, especially with ML and GenAI, is moving too fast, and product orgs haven’t caught up. There’s a growing gap between what teams could do with AI, and what they are doing.

The Problem: We’re Drowning in AI Hype, But Starving for Strategy

Over the past year, I’ve worked with ML teams and execs across Meta, Booking.com, Flipkart, Goldman Sachs, and several startups. Some consistent themes keep emerging:

ML products often feel fragmented.
GenAI experiments get spun up without real alignment to user problems.
Tech debt builds fast. Strategy… not so much.

In short, product leaders are overwhelmed, under-supported, and missing the strategic frameworks needed to build AI products that actually move the needle.

And I get it. I’ve been there.

At Flipkart, I was one of very few ML PMs in the company, building during crunch-time. At Booking.com, I’ve helped lead one of the most scaled AI transformations in travel tech, spanning classical ML, ranking systems, and GenAI platforms. At Meta, I’ve built ML products at a scale so vast that the resources at my disposal seemed infinite. I’ve seen what works, and what falls apart.

The Vision: A Course for Impact-Driven Product Leadership in AI

So here’s what I’m building:

A high-intensity course for product leaders who want to go beyond the buzzwords and actually drive business impact with AI.

We’ll cover things like:

Identifying high-leverage ML/GenAI opportunities in your org
Designing product strategy for intelligent systems, not just interfaces
Structuring AI bets across Horizon 1, 2, and 3
Balancing GenAI and classical ML for practical business value
And yes, the hard stuff too: fallback design, ML evaluation, experimentation, and compliance

It’s built for people leading product at startups, scale-ups, and large tech companies, who want to push their thinking forward and build with confidence.

Help Me Shape It

I’m still refining the curriculum, and I’d love your input.

If you're a PM, product lead, or product-minded founder working on AI, take 2 minutes to share your thoughts in this short survey:
👉 https://maven.com/forms/341112

If you leave your email I’ll also send early access invites + priority seats before we open public enrollment.

Let’s build the next generation of AI-native product leaders together.

Prompt Frameworks for Unlocking ChatGPT's Deep Research Capabilities

Pranav Pathak — Sun, 13 Apr 2025 10:20:45 GMT

I love deep research—done right, it transforms surface-level questions into strategic clarity. After some structured experimentation, I've distilled my personal approach into 10 reusable prompt frameworks, explicitly designed to leverage ChatGPT’s deep research and search capabilities.

The frameworks reflect my own prompting style:

Goal-Oriented and Context-Rich:
I clearly set the strategic context and intended outcome upfront—prompts become precise tools rather than generic queries.
Builder’s Intent:
Every prompt is deliberately outcome-driven, optimized to directly inform products, refine strategies, or enhance decision-making workflows.
Specificity Balanced with Flexibility:
I provide clear constraints while leaving room for nuanced, creative exploration—essential for deep research.
System-Aware Prompting:
Frameworks explicitly leverage how ChatGPT works best, maximizing deep research accuracy and insight quality.
Structured for Workflow Integration:
These prompts aren't standalone queries; they're strategically sequenced tools, designed for deliberate use at different workflow stages to progressively deepen strategic insights.

A note before we jump in — my prompt writing style is VERY iterative. I rarely write one detailed prompt and let it do the heavy lifting. I like to see the intermediate steps and build from there. I often interrupt the LLM, making adjustments and changing the requirements on the fly. While this is a bit less efficient, I’ve found that it brings clarity to my thought process while giving me the outputs I want from the LLM.

Recommended Integration into Your Workflow:

Early-Stage Exploration: Use frameworks 1, 2, and 7 (First-Principles, Socratic Inquiry, Rapid Evidence Review) with deep research and search to quickly build foundational knowledge.
Mid-Stage Refinement: Deploy frameworks 4, 5, and 6 (Scenario Futurecasting, Six Hats, Analogical Synthesis) alongside your own documents and search for rich, well-rounded strategic exploration.
Late-Stage Decision & Communication: Apply frameworks 3, 8, 9, and 10 (Inversion, Pros-Cons-Remedies, Role-Play Simulation, Strategy Canvas) in conjunction with your own documents, deep research, and visual canvas tools for sharp clarity, risk mitigation, and executive-ready communication.

With that, let’s dive straight into the frameworks, starting with one of my foundational favourites—the First-Principles Breakdown.

1. First-Principles Breakdown

The First-Principles Breakdown approach systematically deconstructs complex problems, strategies, or technologies into their fundamental truths or building blocks. It deliberately challenges existing assumptions, conventions, and perceived constraints to isolate core elements. By doing so, it empowers clear thinking, reduces cognitive biases, and reveals innovative paths that traditional, analogy-based thinking might miss. This foundational clarity then allows you to logically reconstruct robust, defensible strategies or solutions from the ground up.

At What Point: Use this one for early-stage exploration or when confronting stuck or biased thinking.

Where: Clarifying product ideas, complex strategies, or new technologies.

How: Break down concepts to their fundamental building blocks, removing assumptions and biases, then rebuild solutions logically.

Prompt Template:

"Break down [complex problem, idea, technology, or strategy] into its fundamental principles. Use insights from deep research and the provided supporting documents ([e.g., technical specs, market research, user feedback]) to clearly identify hidden assumptions, perceived constraints, and foundational truths. Then, logically reconstruct this into a structured, unbiased analysis, explicitly highlighting unexplored opportunities."

Example Prompts:

"Break down the business model of subscription-based SaaS products into their fundamental principles. Identify hidden assumptions about customer behavior and pricing strategies, then rebuild a logical model highlighting unexplored opportunities."
"Deconstruct the key technological barriers to autonomous driving using first principles. Clearly list the fundamental components and assumptions, then logically identify areas where significant innovation is still needed."
"Explain the core challenges of scalability in blockchain technologies from first principles. Highlight fundamental constraints, and propose logical pathways for overcoming these obstacles."

Use in combination with: Your own supporting documents (technical specs, market research) - add these to the prompt - and ‘deep research’ to validate foundational truths.

2. Socratic Inquiry Chain

The Socratic Inquiry Chain is an iterative approach inspired by classical Socratic dialogue. It leverages ChatGPT to progressively deepen understanding by prompting the model to continually question and refine its own explanations. Each subsequent question reveals new nuances, surfaces overlooked subtleties, and challenges superficial assumptions. This method helps quickly build thorough expertise in complex or unfamiliar subjects by ensuring no stone remains unturned in the analysis.

At What Point: After initial baseline research, before diving into deeper expert-level understanding.

Where: Quickly mastering unfamiliar or complex scientific/technical topics.

How: Iteratively prompt ChatGPT to question its own initial explanations, progressively uncovering deeper insights.

Prompt Template:

"Briefly explain [complex or unfamiliar topic]. Then, iteratively deepen this explanation by critically questioning your initial response through [number of iterative cycles, e.g., 3–5] follow-up questions and answers. In each iteration, explicitly surface hidden nuances, clarify common misconceptions, and incorporate authoritative insights verified through deep research."

Example Prompts:

"Briefly describe CRISPR gene editing. Now, deepen your explanation by self-questioning and answering through five iterative cycles, explicitly addressing risks, ethical issues, and technical complexities."
"Summarize the main concepts behind Large Language Models. Follow up by iteratively challenging your summary with detailed questions that address potential misunderstandings, limitations, and future directions."

Use in combination with:
Search and deep research to verify accuracy and add authoritative context.

3. Inversion (Devil’s Advocate)

Inversion, or playing the Devil’s Advocate, deliberately engages ChatGPT in skeptical, contrarian thinking to rigorously test strategic initiatives and decisions. This approach compels the model to systematically identify weaknesses, overlooked assumptions, and potential points of failure by imagining worst-case scenarios. In doing so, it helps decision-makers proactively mitigate risks, strengthen strategies, and guard against overly optimistic or biased planning.

At What Point: Late-stage decision-making, immediately prior to final commitment.

Where: Evaluating strategic initiatives, product launches, and significant business decisions.

How: Prompt the model to adopt a skeptical viewpoint and identify critical points of failure proactively.

Prompt Template:

"Imagine that [your strategic decision, product launch, or initiative] results in a significant failure ([specify worst-case outcome]). Using insights from deep research and the provided supporting documents ([e.g., strategic plans, market analysis, user feedback]), clearly outline the exact reasons behind this failure. Explicitly identify overlooked risks, hidden assumptions, and faulty logic. Finally, recommend proactive strategies or mitigations to prevent these points of failure."

Example Prompts:

"Imagine our new AI-based recommendation system completely fails within the first six months after launch. Clearly outline why this failure occurred, detailing specific blind spots, faulty assumptions, and overlooked risks."
"Consider our upcoming market entry strategy into Asia and assume it leads to significant financial loss. Analyze the exact reasons behind this outcome, emphasizing hidden assumptions or underestimated challenges."
"Assume the decision to switch our company entirely to remote work severely damages productivity and culture. Identify the critical assumptions we made incorrectly, and suggest ways we could mitigate these risks in advance."

Use in combination with: Your own supporting strategic planning documents and deep research for thorough risk assessment.

4. Scenario Futurecasting

Scenario Futurecasting involves generating multiple, diverse, and plausible visions of future developments related to a specific topic, technology, or market. It explicitly prompts the AI to explore optimistic, pessimistic, and moderate trajectories, enabling decision-makers to anticipate potential outcomes and their implications strategically. By visualizing and examining varied future possibilities, leaders can better prepare for uncertainties, allocate resources wisely, and create flexible strategic plans.

At What Point: Mid-to-late strategic planning stages, especially when uncertainty is high.

Where: Innovation foresight, long-term strategic planning, exploring market uncertainties.

How: Prompt for multiple plausible future scenarios, explicitly detailing strategic implications.

Prompt Template:

"Develop three distinct future scenarios for [specific technology, market, or trend] over [specific time horizon]:
Optimistic scenario: characterized by rapid positive developments.
Pessimistic scenario: involving significant setbacks or adverse conditions.
Moderate scenario: depicting incremental, steady progress.
Clearly outline each scenario using authoritative insights from deep research and provided industry analyses ([e.g., market data, trend reports, competitive benchmarks]). For each scenario, explicitly detail the strategic implications and recommended actions our organization should consider."

Example Prompts:

"Create three detailed scenarios for the evolution of generative AI adoption in the enterprise over the next five years: an optimistic scenario with rapid adoption and integration, a pessimistic scenario characterized by major setbacks and regulatory hurdles, and a moderate scenario of steady, incremental growth. Clearly outline strategic implications and recommended actions for each."
"Envision three possible futures for electric vehicle (EV) market penetration globally by 2030—high adoption, low adoption, and moderate adoption—and explain the strategic consequences for traditional automakers."
"Develop three distinct future scenarios for blockchain's role in supply chain management by 2028, covering optimistic widespread adoption, pessimistic regulatory pushbacks, and moderate incremental progress. Outline clear strategies our organization should consider for each scenario."

Use in combination with: Deep research and your own industry analyses to ensure scenario realism. Actually, ALWAYS use this one with deep research!

5. Six Thinking Hats Analysis

The Six Thinking Hats Analysis method systematically guides ChatGPT through six distinct analytical lenses—facts (white hat), emotions and intuition (red hat), risks and potential negatives (black hat), optimistic benefits (yellow hat), creative alternatives (green hat), and structured overview or process (blue hat). This structured perspective-taking ensures a holistic, balanced exploration of decisions, providing clear, nuanced, and actionable insights for complex issues.

At What Point: Before making critical decisions, ensuring balanced consideration of all perspectives.

Where: Comprehensive decision-making, product reviews, or market-entry assessments.

How: Explicitly guide ChatGPT through sequential analytical perspectives: factual, emotional, risk-oriented, optimistic, creative, and procedural.

Prompt Template:

"Conduct a comprehensive Six Thinking Hats analysis for [decision, strategic initiative, or idea], systematically addressing each perspective:
White Hat (Facts): Clearly outline relevant data and factual information.
Red Hat (Emotions): Summarize emotional reactions and intuitive responses of key stakeholders.
Black Hat (Risks): Identify specific potential negatives, risks, and critical obstacles.
Yellow Hat (Benefits): Highlight explicit, optimistic benefits and opportunities.
Green Hat (Creative Alternatives): Suggest creative or innovative alternatives to improve or enhance outcomes.
Blue Hat (Structured Recommendation): Provide a clear, structured summary and actionable recommendation.
Incorporate authoritative insights from deep research and provided supporting documents ([e.g., user surveys, market data, internal reports]) to substantiate each analytical perspective."

Example Prompts:

"Using the Six Thinking Hats method, evaluate our company's proposal to shift entirely to subscription-based pricing. Include clear insights covering factual data (market benchmarks, pricing models), emotional reactions (customer perceptions), risks (potential churn), benefits (stable revenue streams), creative alternatives (tiered models), and conclude with a structured final recommendation."
"Conduct a Six Thinking Hats analysis of our planned expansion into emerging markets. Clearly structure your response by assessing the facts (market size, competitors), emotional reactions (internal team morale), risks (political instability, economic volatility), benefits (growth potential), creative strategies (joint ventures, localized branding), and finally provide a concise recommendation."
"Evaluate the introduction of AI-driven customer service bots through the Six Thinking Hats framework. Methodically cover factual analysis (expected efficiency gains), emotional insights (customer satisfaction and frustrations), risks (loss of personal touch, technical issues), benefits (24/7 availability), creative alternatives (hybrid solutions), and deliver a structured strategic recommendation."

Use in combination with: Your own supporting documents (user surveys, market data) and optionally a canvas to visualise outcomes clearly.

6. Analogical Synthesis

Analogical Synthesis is a creative reasoning technique prompting ChatGPT to explicitly draw comparisons between your current strategic challenge and analogous situations from entirely unrelated fields or historical contexts. This method unlocks novel perspectives and innovative solutions by borrowing insights from outside conventional industry or domain boundaries, stimulating creativity when traditional problem-solving methods fall short.

At What Point: When conventional thinking has reached a limit and fresh perspectives are needed.

Where: Creative problem-solving, innovation ideation, tough strategic problems.

How: Prompt for explicit cross-domain analogies to uncover novel insights and solutions.

Prompt Template:

"Identify two clear analogies from entirely unrelated fields, industries, or historical contexts that parallel our current challenge of [describe specific strategic challenge or issue]. For each analogy, explicitly:
Explain the parallel clearly and concretely.
Highlight key insights, lessons, or innovations from the analogous situation.
Recommend specific, actionable strategies or ideas inspired by these analogies.
Incorporate authoritative insights verified through deep research and historical or cross-industry examples to ensure accuracy and applicability."

Example Prompts:

"Identify two clear analogies from historical innovations or disruptions in completely different industries that parallel our current challenge of increasing customer retention in SaaS products. Clearly explain each analogy and suggest actionable strategies inspired by these parallels.
"Our organization faces significant communication and coordination issues due to remote work. Provide two analogies from entirely unrelated fields (e.g., orchestras, space exploration, or military logistics) and suggest practical solutions we could adopt based on these analogies."
"Draw clear parallels between managing misinformation on social media platforms and controlling the spread of infectious diseases in public health. Outline specific strategies from epidemiology that we could effectively apply to manage misinformation."

Use in combination with: Search for historical or cross-industry examples, enhanced by deep research for accuracy.

7. Rapid Evidence Review

The Rapid Evidence Review approach directs ChatGPT to quickly and systematically summarize complex topics by highlighting key findings, central debates, research gaps, and actionable implications. It mimics a concise yet thorough literature or market review, offering rapid comprehension and actionable insights, ideal for informing strategic decisions or briefing stakeholders efficiently.

At What Point: Early research phase or when needing rapid, concise insights for stakeholder briefings.

Where: Quick summaries of scientific research, technological trends, or industry knowledge.

How: Prompt structured summaries explicitly addressing findings, debates, gaps, and implications.

Prompt Template:

"Rapidly summarize the current state of knowledge on [specific scientific topic, technology, trend, or issue]. Explicitly structure your response to address:
Key Findings: Clearly highlight the major established insights or breakthroughs.
Ongoing Debates: Identify central discussions or controversies currently shaping the field.
Research Gaps: Explicitly note significant unresolved questions or missing areas of knowledge.
Strategic Implications: Clearly outline practical implications or actionable opportunities for [relevant stakeholder or business context].
Incorporate authoritative insights verified through deep research and recent sources to ensure accuracy and relevance."

Example Prompts:

"Provide a structured overview of recent developments in renewable energy storage technologies, explicitly addressing key innovations, technical and economic debates, knowledge gaps, and potential business opportunities."
"Rapidly summarize the latest research on generative AI ethics and regulation. Clearly identify key research findings, highlight ongoing industry and academic debates, note significant knowledge gaps, and outline strategic implications for technology firms."

Use in combination with: Deep research and search for verifying accuracy and currency of insights.

8. Pros-Cons-Remedies List

The Pros-Cons-Remedies framework systematically guides ChatGPT to clearly outline advantages (pros), disadvantages (cons), and explicit practical solutions or mitigations (remedies) for identified disadvantages. This structured evaluation promotes balanced and rigorous decision-making by proactively addressing potential challenges alongside their solutions, ensuring well-informed and resilient strategic choices.

At What Point: Mid-stage decision-making, after shortlisting options, before final commitment.

Where: Evaluating strategic choices involving clear trade-offs, product features, or tech investments.

How: Prompt balanced, actionable assessments detailing pros, cons, and mitigations explicitly.

Prompt Template:

"Provide a structured Pros-Cons-Remedies analysis for [specific strategic decision, product feature, or technology investment]. Clearly outline:
Pros: Explicit advantages, opportunities, and potential positive outcomes.
Cons: Specific disadvantages, risks, and potential challenges.
Remedies: Practical, actionable solutions or mitigations for each identified disadvantage or risk.
Incorporate authoritative insights and verify accuracy through deep research and the provided supporting documents ([e.g., internal reports, market data, technical assessments])."

Example Prompts:

"List the pros and cons of integrating GPT-based chatbots into our customer support systems. Clearly provide realistic mitigation strategies for each identified disadvantage or risk."
"Evaluate adopting remote-first working permanently. Identify clear advantages, potential drawbacks, and explicitly outline effective remedies or strategies to manage each identified con."
"Analyze switching our technology infrastructure entirely to cloud-based solutions. Present a balanced list of benefits and risks, providing detailed practical solutions or mitigations for each of the risks identified."

Use in combination with: Your own supporting documentation and deep research to enhance the rigor of identified risks and remedies.

9. Role-Play Simulation

Role-Play Simulation prompts ChatGPT to realistically embody and simulate interactions between multiple stakeholders, capturing their distinct perspectives, reactions, and concerns. This approach uncovers interpersonal dynamics, identifies friction points, and predicts emotional and rational reactions to new decisions or changes. It's particularly effective for preempting stakeholder objections, aligning internal teams, and refining communication strategies before implementation.

At What Point: Late in planning stages, before rollout or communication, ensuring stakeholder readiness and minimizing pushback.

Where: Anticipating stakeholder reactions, user experiences, organizational dynamics around new decisions.

How: Simulate realistic dialogues or stakeholder interactions explicitly surfacing friction points.

Prompt Template:

"Simulate a realistic dialogue involving [list specific stakeholders involved] discussing [specific decision, change, or issue]. Explicitly:
Embody each stakeholder's distinct perspective, emotional reactions, practical concerns, and rational objections.
Clearly highlight friction points, misunderstandings, or areas of disagreement.
Surface key unresolved questions or objections requiring attention before rollout or communication.
Ground the dialogue in authoritative insights from provided stakeholder research, customer feedback, and deep research to ensure authenticity and accuracy."

Example Prompts:

"Simulate a detailed conversation involving a product manager, a senior engineer, and a skeptical long-term customer discussing our decision to sunset an old but popular feature. Clearly highlight their concerns, objections, and questions."
"Imagine a dialogue between HR, team leaders, and front-line employees discussing the upcoming shift to hybrid remote work. Explicitly surface each stakeholder’s emotional reactions, practical objections, and unresolved concerns."
"Create a realistic interaction involving a privacy-conscious customer, a data scientist, and a compliance officer discussing the introduction of advanced user-data-driven personalization in our product. Identify their differing viewpoints, ethical concerns, and objections."

Use in combination with: Your own stakeholder research or customer feedback, optionally visualized using a canvas.

10. Strategy Synthesis Canvas

The Strategy Synthesis Canvas guides ChatGPT to distill extensive research and analysis into a concise, structured strategic memo or executive summary. This approach clearly outlines strategic objectives, relevant market context, key insights, strategic options (each with explicit pros and cons), and concludes with a strongly justified recommendation and immediate next steps. It ensures clarity, coherence, and actionable guidance for executive-level decision-making and broader stakeholder alignment.

At What Point: After comprehensive research, immediately before executive-level communication or broader organizational dissemination.

Where: Concise strategic summaries, executive memos, translating extensive analysis into actionable strategy.

How: Prompt explicitly structured strategic summaries including goals, context, insights, strategic options, and clear recommendations.

Prompt Template:

"Convert the extensive analysis conducted on [specific strategic topic, decision, or initiative] into a concise, structured executive strategy memo. Explicitly structure your response into the following sections:
Strategic Objectives: Clearly define our core goals and intended outcomes.
Relevant Context: Summarize essential market, competitive, or industry context based on deep research and supporting documentation ([specify documents]).
Key Insights: Highlight the three most critical insights derived from the analysis.
Strategic Options: Clearly present viable strategic alternatives, explicitly listing pros and cons for each.
Recommended Strategy: Provide a strongly justified recommendation.
Immediate Next Steps: Clearly outline actionable next actions and responsibilities.
Ensure accuracy, clarity, and actionable recommendations by grounding your synthesis in the provided extensive supporting documentation, deep research, and visual summaries if available."

Example Prompts:

"Summarize our extensive analysis on entering the Asian market into a concise executive strategy memo. Clearly structure it around our key strategic objective, market context, the three most critical insights discovered, strategic options we considered (each with clearly listed pros and cons), the recommended strategy with rationale, and immediate next steps for implementation."
"Convert the detailed research conducted on integrating generative AI into our existing product ecosystem into a structured strategy memo suitable for executive review. Include clear sections on objectives, industry and competitive context, the top insights derived, strategic alternatives evaluated with explicit pros and cons, a robust recommendation, and clearly outlined next actions."
"Create a clear and concise strategic synthesis summarizing our recent customer research initiative. Clearly articulate our goals, customer context, key findings, strategic paths forward (detailing pros and cons for each), our recommended course of action with justification, and immediate steps for the product and marketing teams."

Use in combination with: Your own extensive supporting documents and analysis, summarized visually or structured clearly via a canvas.

Structured prompting can greatly enhance your strategic thought process. Combine these prompt frameworks strategically with deep research, search, canvas visualisation, and your own rich internal documentation for exceptional outcomes.

Which combination resonates most with your style or current project? Would love your perspective.

Demystifying the Model Context Protocol

Pranav Pathak — Fri, 04 Apr 2025 10:54:21 GMT

LLMs are no longer just text predictors — they’re turning into reasoning engines. But while model performance improves, our ability to manage their context has lagged behind.

As we build richer GenAI systems — personalized shopping assistants, agentic workflows, retrieval-augmented copilots — we’re increasingly duct-taping memory, tools, user state, and business logic into brittle prompt strings. It works, until it doesn't.

Model Context Protocol (MCP) is a response to this architectural gap. It offers a way to define, structure, and operationalize context so that models reason more intelligently — and systems scale more reliably.

This isn’t just about prompt templates. It’s about a design philosophy that treats context as data, not prose.

In this article, I’ll break down:

What MCP is and what it includes
Why it matters for production-grade LLM systems
A step-by-step example of MCP in a GenAI e-commerce assistant
Ownership and design implications across PM, eng, and infra teams
When and how to introduce MCP in your own stack

Let’s get into it.

What is Model Context Protocol (MCP)?

Model Context Protocol (MCP) is a proposed standard way to define, share, and manage the context that LLMs use during inference.

It’s meant to solve a growing problem: As we build complex GenAI systems — think multi-agent workflows, RAG pipelines, or tools interacting with LLMs — we need a reliable and interpretable way to pass “context” into models.

If you’re new to all this terminology (or to get the most out of this write-up), be sure to read through some other deep dives on LLM Agents, RAG systems, and general LLM techniques.

Why is this important?

LLMs don’t work in isolation anymore. Modern AI products involve:

Tool use (e.g., APIs, search, plugins)
Memory (long-term, short-term)
Retrieval-augmented generation (RAG)
User state, preferences, history
Chain-of-thought reasoning
Multi-step agents

Each component needs clear, structured context to work well. But today, most developers just… shove it all into a prompt. MCP aims to bring order and clarity.

What counts as "context"?

In MCP, context is any information that guides the model’s behaviour, such as:

System instructions
Examples / few-shot prompts
Documents retrieved via RAG
Tools available to the model
User history or preferences
Session memory

What does MCP actually define?

It defines a structured format (usually JSON or similar) to organize these components. Think of it as a schema or protocol — like how HTTP structures web communication, MCP structures LLM interaction.

Here’s a simplified example:

{
  "system_instruction": "You are a helpful personal shopping assistant.",
  "user_goal": "Find waterproof sneakers under €150 in a minimalist style.",
  "retrieved_documents": [...],
  "tools": ["price_tracker", "style_matcher"],
  "chat_history": [...],
  "memory": {
    "name": "Pranav",
    "shoe_size": "EU 43",
    "style": ["Minimalist", "Neutral colors"]
  }
}

Then, an MCP interpreter or router translates this context into a full prompt the LLM understands.

Who’s behind MCP?

It's being explored and developed by folks in the open-source AI infra community — particularly people behind projects like LangChain, LlamaIndex, and agent frameworks like AutoGen or CrewAI. It’s not yet a universal standard, but momentum is growing.

Why it matters for product builders (like you):

You can decouple logic from prompts — better modularity and reuse
Build multi-agent or tool-using systems more cleanly
Easier to debug and track what context influenced output
Better support for personalization, memory, and long-term interactions

Let’s walk through an end-to-end Model Context Protocol (MCP)-style flow in a GenAI-powered e-commerce assistant. Think of it as a personal shopping concierge that can:

Understand the customer’s preferences
Retrieve relevant product info
Use tools like price comparison or size guides
Maintain long-term memory for better recommendations over time

As with everything, examples best illustrate the concepts I’ve mentioned before. So let’s drill down with an e-commerce scenario.

Scenario: GenAI Shopping Assistant

User goal: “Find me stylish, waterproof sneakers under €150 for city use. I prefer neutral colors and minimalist design.”

Let’s break this into an MCP-style flow with structured context, tools, memory, and model behavior.

Step 1: Define the Context (MCP JSON)

{
  "system_instruction": "You are a stylish personal shopping assistant. Always recommend relevant, high-quality, fashion-forward products.",
  "user_goal": "Find stylish, waterproof sneakers under €150 for city use in neutral colors and minimalist design.",
  "user_profile": {
    "name": "Pranav",
    "style": ["minimalist", "neutral", "luxury-but-practical"],
    "shoe_size": "EU 43",
    "past_purchases": ["Common Projects", "On Cloudnova", "Nike Air Max 1"],
    "preferred_price_range": "100-200 EUR"
  },
  "retrieved_documents": [
    {
      "source": "ProductCatalog",
      "query": "waterproof sneakers neutral minimalist under 150",
      "results": [
        {
          "name": "Vessi Cityscape",
          "price": "€135",
          "color": "Slate Grey",
          "features": ["Waterproof", "Lightweight", "Minimalist"],
          "image_url": "...",
          "product_link": "..."
        },
        {
          "name": "Allbirds Mizzle",
          "price": "€145",
          "color": "Natural White",
          "features": ["Water-resistant", "Wool blend", "Sustainable"],
          "image_url": "...",
          "product_link": "..."
        }
      ]
    }
  ],
  "tools": [
    {
      "name": "PriceTracker",
      "description": "Checks if the product is on sale",
      "args": ["product_link"]
    },
    {
      "name": "StyleComparer",
      "description": "Compares visual style similarity to past purchases",
      "args": ["image_url", "past_purchases"]
    }
  ],
  "chat_history": [
    {
      "role": "user",
      "content": "I'm looking for shoes I can wear in Amsterdam — rainproof but still stylish."
    },
    {
      "role": "assistant",
      "content": "Got it! I’ll prioritize waterproof features with a clean aesthetic. Any preferred brands?"
    }
  ],
  "memory": {
    "long_term": {
      "Pranav_shoe_style": "modern minimalist, neutral colors",
      "purchase_behavior": "Willing to pay premium for quality"
    }
  }
}

Step 2: MCP Interpreter Builds Prompt

The interpreter compiles all this into a prompt like:

System: You are a fashion-savvy personal shopping assistant helping Pranav find stylish waterproof sneakers.
User Goal: Pranav is looking for waterproof, minimalist sneakers in neutral colors under €150.
User Profile: Pranav wears EU 43, prefers neutral colors, and has previously bought Common Projects and On Cloudnova. He likes quality over hype.
Retrieved Products:
Vessi Cityscape: Waterproof, €135, Slate Grey, minimalist
Allbirds Mizzle: Water-resistant, €145, sustainable, Natural White
Use Tools: PriceTracker, StyleComparer
Based on the above, suggest the best options. Justify your picks with reasoning and style compatibility.

Step 3: Tools Are Called

Model suggests using tools:

Call StyleComparer to compare product images to Pranav's past taste
Use PriceTracker to flag any products on sale

Example outputs from tools:

Vessi: 92% style match, price stable
Allbirds: 85% style match, currently 10% off

Step 4: Assistant Response (LLM Output)

“Based on your style and past purchases, I’d recommend the Vessi Cityscape (€135). It’s highly waterproof, clean in design, and very close to your previous aesthetic.
The Allbirds Mizzle is another solid choice — a bit more casual and sustainable, and currently 10% off. Want me to check sizing availability or delivery times?”

Step 5: Memory Updated

Long-term memory can be updated with:

"Pranav liked Vessi design."
"Responded positively to sale info."
"Might be open to sustainable materials."

Why this matters (and why MCP helps)

Structured, modular context → Better model reasoning
Tool-use integration → Richer assistant behavior
Memory-aware → More personalized long-term experiences
Easy to debug + maintain → Each part of the context is traceable

Actors and Actions in the MCP context

So MCP can be quite useful to log, keep track of, and manage large LLM systems. But who’s building what? Who has input in what part of the system? Let’s break down the ownership of each part of the flow: who defines what in an MCP-driven GenAI system, especially in an e-commerce context like your shopping assistant.

We'll break this into the main phases, then for each component, map out who is responsible:

MCP itself doesn’t define the values — it defines the structure and interfaces to handle these cleanly.

Who is MCP useful for, why, and when

MCP is especially useful for teams building complex LLM-powered systems — whether you're working on:

Personalized shopping assistants
Multi-step agent workflows
Retrieval-Augmented Generation (RAG) systems
Memory-aware conversational bots
LLM-powered tools and internal copilots

Why use MCP

MCP brings clarity, structure, and scalability to LLM interactions by separating out and formalizing all the "hidden" context behind model behavior — such as goals, memory, tools, history, and retrieved documents.

Instead of shoving everything into one monolithic prompt, MCP turns context into a modular, inspectable object. That means:

Easier debugging and observability
Better reuse across different agents and tasks
Cleaner personalization and memory updates
More consistent and maintainable prompt engineering

When to use MCP

You should reach for MCP when:

Your LLM use case is growing in complexity (e.g. agents, RAG, tools, long-term memory)
You're working in a production or multi-user environment
You want reliability and clarity in how context flows into your LLM
You're building a platform or shared infra where multiple apps or agents interact with models

What’s Next?

Model Context Protocol is still in its early days, but it’s quickly becoming essential infrastructure for building scalable, modular, and intelligent GenAI systems.

If you're working on:

A personalized shopping assistant
A multi-agent research tool
An LLM-powered internal copilot
Or anything that combines RAG, tools, and memory...

…consider adopting MCP principles now — even informally — to future-proof your architecture.

In upcoming posts, I’ll dive deeper into:

MCP templates and schema design
Live examples across e-commerce, HR tech, and SaaS copilots

Want a reusable MCP starter kit or LangChain-compatible boilerplate? Drop a comment or DM — I’m putting something together soon.

What LLM technique is right for you?

Pranav Pathak — Wed, 22 Jan 2025 21:44:30 GMT

Wuth a growing suite of techniques available to product builders—from system and user prompts to fine-tuning, Retrieval-Augmented Generation (RAG), tools, and agents—it can be confusing to decide which approach to adopt for your use case — and the most performant option isn’t always the most expensive one! Each method comes with its own trade-offs, costs, and expertise requirements. Lets break down these techniques, understand when to use each, and whether to combine them for an effective outcome.

TL;DR: Here’s a handy decision matrix:

The Foundations: System and User Prompts

The best place to get started, and often the most useful, is the simple user prompt / system prompt setup. This is pretty much just using prompts to do your bidding. When you write a prompt, split it by a system behavior definition and a user input, and that’s it.

What They Are: System prompts define the behavior of the LLM (e.g., “You are a helpful assistant”), while user prompts encapsulate the input provided by the end-user. Together, they form the simplest way to interact with an LLM.

When to Use:

Rapid Prototyping and quick first launches: Perfect for proving immediate value from LLMs, this approach is especially useful when the LLM’s world knowledge is sufficient to interact with users. Some additional context can be added into system prompts, but it’s usually limited.
Limited Complexity: Best used when user needs are straightforward (e.g., answering FAQs or generating content).
Minimal Budget: Ideal for use cases where cost is a major constraint, as prompts alone require no additional infrastructure or model modifications.

Use Case Examples:

Travel: A chatbot answering common travel queries like “What’s the best time to visit Paris?” or “Can I carry a power bank on a flight?” - you would typically rely on the LLM’s world knowledge for this.
Ecommerce: Generating product descriptions or providing instant answers to “What are the dimensions of this sofa?” - you would typically pass on the product context (the catalog page) along with the user prompt.
Customer Service: Addressing FAQs such as “What is your return policy?” or “How can I track my order?” - you would typically pass topic specific customer support workflows along with the user prompt.

Business Outlook: This approach offers a quick and cost-effective way to enhance customer experience without extensive development or infrastructure investment. However, it lacks depth for handling complex or highly personalized interactions.

Advantages:

Low Cost: No need for additional computational resources.
Simplicity: Easy to set up and iterate on.
Flexibility: Quickly adaptable to new tasks.

Limitations:

No Context Retention: Each prompt is treated as a standalone interaction. If you want to have conversation continuity, you need to pass the previous user and LLM messages along with the user prompt.
Lack of Customization: Limited ability to tailor responses for domain-specific needs. Since all info is contained within the prompt, it’s difficult to answer from a large knowledge base, or build recommendations based on inventory etc.

Cost: Lowest among all methods; typically only API call costs.

Expertise Needed: Basic understanding of prompt design.

Retrieval-Augmented Generation (RAG)

What It Is: RAG integrates external data sources into LLM responses by retrieving relevant documents or data and appending it to the prompt. For example, user manuals, support manuals, and even supply data can be used alongside the LLM. The basic premise is, given a user prompt, search through a dataset and retrieve the most relevant user context. Append it to the prompt, and pass this to the LLM to respond.

When to Use:

Dynamic Data Needs: When the LLM must access real-time or frequently updated information (e.g., current news, weather updates).
Domain-Specific Knowledge: For proprietary or highly specific datasets that the model wasn’t trained on, or if the corpus of information is exceptionally large.
Regulatory Compliance: When responses need to be based on verified or auditable information.

Use Case Examples:

Travel: Providing real-time flight status updates or suggesting activities based on current weather conditions.
Ecommerce: Recommending products based on a user’s purchase history or inventory data.
Customer Service: Answering questions like “What are the latest promotions?” or “Is my order eligible for free shipping?”

Business Outlook: RAG enables businesses to bridge the gap between static LLM knowledge and dynamic, real-world data. It’s particularly useful for enhancing personalization and ensuring responses are accurate and up-to-date, making it a strong choice for competitive differentiation.

Advantages:

Enhanced Relevance: Incorporates the latest or domain-specific data.
Scalability: Handles large datasets effectively.

Limitations:

Latency: Retrieval can introduce delays in response times by adding one more step before generating a response.
Complexity: Requires a retrieval system (e.g., Pinecone, Weaviate) and integration with the LLM.
Incomplete Data Issues: If the retrieval system fails to fetch relevant data, the LLM’s response might lack accuracy.

Cost: Medium; involves additional infrastructure for indexing and retrieval.

Expertise Needed: Moderate; knowledge of retrieval systems, embeddings, and prompt tuning.

Fine-Tuning

What It Is: Fine-tuning involves training an LLM on a domain-specific dataset to specialize it for particular tasks or industries.

When to Use:

Highly Specialized Tasks: When the LLM needs deep expertise in a narrow domain (e.g., legal, medical, or scientific contexts), or when the outputs need to be structured and predictable (eg JSON formatted outputs).
Brand Consistency: To ensure outputs align with a specific tone or style.
Improving Base Model Limitations: When the base model struggles with task performance despite prompt optimization.

Use Case Examples:

Travel: Tailoring an assistant to provide expert advice on niche destinations or luxury travel planning.
Ecommerce: Ensuring product descriptions align with brand guidelines and customer preferences.
Customer Service: Creating a specialized bot for handling complex policy inquiries or multilingual support.

Business Outlook: Fine-tuning offers precision and long-term ROI for businesses with specialized needs, though the upfront investment can be significant. It’s ideal for companies that require high-quality, domain-specific outputs consistently.

Advantages:

Precision: Produces highly accurate and tailored responses.
Efficiency: Reduces the need for overly verbose prompts.

Limitations:

Upfront Cost: Fine-tuning requires significant computational resources.
Maintenance: Updated versions of the base model require re-tuning.

Cost: High; includes computational costs and potential licensing fees for access to base models.

Expertise Needed: High; requires expertise in machine learning, data preparation, and evaluation.

Tools and Function Calling

What They Are: Tools extend LLM capabilities by allowing them to invoke external functions, APIs, or workflows dynamically (Read more about tools and agents here). For example, an application that needs the LLM to do simple math operations may give it access to a calculator. The LLM, when it encounters a math problem, passes the relevant variables to the calculator, and then uses the output in its response.

When to Use:

Task Automation: For applications that require executing specific actions, like booking tickets or checking inventory.
Contextual Responses: When the LLM needs to gather information dynamically (e.g., fetching user details from a database).
Interactive Workflows: Ideal for use cases requiring multiple steps (e.g., assembling reports, running calculations).

Use Case Examples:

Travel: Booking flights or hotels based on user preferences and available options.
Ecommerce: Checking product availability and placing orders directly from a chatbot interface.
Customer Service: Initiating refunds, modifying orders, or providing account-specific details.

Business Outlook: The ability to automate actions increases operational efficiency and reduces friction in user interactions. While initial setup costs can be high, tools unlock significant value in automating repetitive or complex workflows.

Advantages:

Dynamic Capabilities: Extends the LLM’s usefulness beyond static text generation.
Modularity: New functions can be added without retraining the model.

Limitations:

Rigidity: Limited to predefined functions.
Development Overhead: Requires backend integration and robust validation.

Cost: Medium to High; depends on the number of API calls and backend requirements.

Expertise Needed: Moderate; familiarity with API integration and LLM prompt design.

Agents

What They Are: Agents are advanced LLMs that use dynamic reasoning to plan and execute multi-step tasks, leveraging multiple tools or APIs autonomously (read more about agents here).

When to Use:

Complex Problem Solving: When workflows involve multiple interdependent steps.
Adaptable Systems: For tasks that require reasoning and decision-making on-the-fly.
End-to-End Solutions: When the LLM needs to autonomously achieve goals with minimal human input.

Use Case Examples:

Travel: Creating personalized itineraries by combining weather forecasts, user preferences, and activity options.
Ecommerce: Assisting users with end-to-end purchase decisions, from product discovery to delivery tracking.
Customer Service: Handling escalations by dynamically pulling data, invoking tools, and proposing solutions autonomously.

Business Outlook: Agents offer transformative potential by automating complex tasks with minimal human oversight. However, output unpredictability and high operational costs mean they’re best suited for high-value or experimental use cases.

Advantages:

Autonomy: Reduces human oversight for repetitive or complex workflows.
Context Retention: Maintains continuity across multi-step interactions.

Limitations:

Cost: Agents may make multiple API calls, significantly increasing costs.
Unpredictability: Requires rigorous error handling to avoid unexpected behaviors.

Cost: Very High; due to complexity, API usage, and monitoring requirements.

Expertise Needed: High; includes prompt engineering, tool integration, and error handling.

Combining Techniques

What It Is: Using multiple techniques together—e.g., combining RAG with tools or fine-tuning with agents—to maximize capabilities.

When to Use:

Hybrid Needs: When no single technique fully addresses your requirements.
Scale and Specialization: To handle diverse user needs (e.g., personalized recommendations + real-time data retrieval).
Future-Proofing: For systems that need to adapt as LLM capabilities evolve.

Use Case Examples:

Travel: Combining RAG for real-time flight data with agents for personalized travel planning.
Ecommerce: Using fine-tuned models for tailored product descriptions alongside tools for inventory management.
Customer Service: Integrating RAG for policy retrieval with agents for handling multi-step issue resolution.

Business Outlook: Hybrid approaches deliver comprehensive solutions but come with increased complexity and costs. They’re best for organizations seeking to build robust, scalable, and future-proof systems.

Advantages:

Comprehensive Solutions: Addresses multiple challenges simultaneously.
Resilience: Reduces reliance on a single point of failure.

Limitations:

Complexity: Increases development and maintenance efforts.
Cost: Combines the costs of all involved techniques.

Cost: Highest; involves layered infrastructure and operational expenses.

Expertise Needed: Very High; requires deep technical understanding across multiple domains.

Decision Matrix

Choosing the right technique—or combination of techniques—is critical to building impactful and cost-effective systems. Start with the simplest approach that meets your needs, and scale up as your requirements become more sophisticated. By understanding the trade-offs of each method, you can optimize for both business value and technical feasibility.

Understanding LLM Agents

Pranav Pathak — Mon, 06 Jan 2025 04:45:40 GMT

I’m assuming basic familiarity with large language models (LLMs) as I write this one. If you’d like a refresher, check out ‘A product builder’s introduction to large language models’.

We’ve come quite far with using LLMs in their raw capabilities to build some very cool products. Better content personalisation, search, question answering, etc have proven great customer value. What we haven’t seen much of in B2C production yet is LLM agents (and this post will start with the basics and cover what agents are, and how to implement them!) - and for good reason!

Thanks for reading BuilderLab.ai! This post is public so feel free to share it!

The Agent landscape is about to change though, and I predict that in the next year or so we will all be very comfortable using LLM agents in our production systems. Before we jump in, let’s dive a bit deeper into how we’ve been using LLMs in production so far.

We’ve seen so far 3 distinct ‘waves’ of LLM usage evolution, each increasing in complexity and usefulness.

Wave 1: Basic text generation

The first wave, starting early 2023, focused on basic text generation by leveraging the capabilities of LLMs to:

Understand text intent (what is the topic being discussed in this text, what is the sentiment, …)
Summarise text (summarise this very long article into a TL;DR – pretty handy for the article you are reading now ;) )
Generate new text (describe ‘Paris’ to a young family wanting to go for the first time)
Answer questions (are there long queues at the Eiffel tower?)
Through a combination of these, have chat dialogues with users (plan a trip to Paris!).

From a B2C industry perspective, this led to some pretty useful stuff, including enhancements to search, personalised content generation, question answer capabilities, and level 1 frontline customer support for chatbots. These capabilities remain useful today, and some of the largest realised impact of LLMs throughout the industry builds on these capabilities.

Why we need more

One limitation these capabilities present is the LLM’s lack of real-time world knowledge - for example, LLMs by themselves cannot answer “what’s the weather in Paris today?”. Because LLMs are trained on data snapshots, real-timeness is not a possibility.

The other limitation is, the LLM has world knowledge - but not business specific knowledge that’s proprietary to you and your business - for example, what’s your refund policy in case of a cancellation?

Many different approaches have been tried to work around the problem - including augmenting LLMs with pre-retrived, real time data through putting it in prompts (retrieve the weather for Paris from a weather API and make it available in the prompt), or retrieval augmentation systems to provide more bounded context (give the LLM access to your specific customer service procedures via data retrieved from your customer service manuals and added to prompts).

These approaches work fine for a number of use cases, but as the size of data increases, they begin to hit limitations. Imagine if you want to respond to that weather query for any location across the globe, you need to retrieve the weather all over the globe at every point in time and pass it on to the prompt - pretty unmanageable. Plus, you can’t really ‘do’ much with this approach in terms of taking action. For our customer service use case, we can retrieve what the correct procedure is (let’s say it’s ‘initiate a refund’), but we can’t actually act on it.

The second wave of using LLMs, relying on function and tool calling, is a response to these limitations.

Wave 2: Function calling

We’ve defined in the last wave a few ‘functions’ or tools the LLM needs to be able to use dynamically in order to get over its limitations. For our weather use case, it needs to be able to dynamically access a weather API. For our customer service use case, the LLM needs to be able to dynamically access product flows to initiate cancellations and refunds.

With function calling, we give the LLM access to these capabilities. We rely on the LLM’s discretion to ‘call’ these functions when necessary, by analysing what our users want, and what the right function to call is that might generate this response.

Ok, that sounds like a beautiful system - why haven’t we been doing this extensively for the last couple of years? Because of this: We rely on the LLM’s discretion to ‘call’ these functions when necessary. At the dawn of the ChatGPT era, LLMs were simply not good enough at using their discretion and following structure / prompts effectively. My early attempts (as well as many other’s) at function calling, way back in 2023, did not yield good enough outputs. That landscape has since drastically changed with better performing LLMs.

So how does this work in practice? In wave 1, our LLM prompts had two components - a ‘system prompt’ that defined the system behaviour for the LLM (“you are a helpful customer service agent”), and a ‘user prompt’, which was the user’s dialog / context (“can I cancel my order?”). Function calling introduces a third input - a list of functions available to the LLM. Here’s an example:

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[

        {"role": "system", "content": "You are a customer service assistant..."},
        {"role": "user", "content": "I want to cancel order #12345"}
    ],


    functions=[
        {
            "name": "check_order_status",
            "description": "Validates order exists and gets current status",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string"}
                },
                "required": ["order_id"]
            }
        },
        # ... other function definitions ...
    ]
)

"role": "system", "content": is our system prompt.

"role": "user", "content": is our user prompt

functions = is our list of functions available to the LLM, notably in our example, the ‘check order status’ function. Conceptually, just as simple as normal prompt writing.

The beauty of this setup? The LLM knows exactly what tools it has (from the function definitions) and how to use them (from the prompt), but can still maintain a natural conversation flow with users. Your backend then implements these functions to actually perform the requested actions.

This approach gives you the best of both worlds: structured, predictable function calls for your backend, and natural, helpful interactions for your users.

The key to good function calling prompts is being specific about the tools available and crystal clear about how they should be used. It's like giving someone a toolbox and an instruction manual - they need to know both what tools they have and how to use them properly.

Limitations of function calling

The main limitation of function calling is that the LLM can only work with the functions you've defined. If a user asks "what's the best time to visit Paris" but you haven't defined a function for seasonal tourism data, the LLM is stuck with generic knowledge from its training data.

This is why function calling, while powerful, can feel rigid. Every possible action needs to be anticipated and defined upfront. Want your assistant to help with hotel bookings too? That's a whole new set of functions to define. Want it to check local events? More functions. The system doesn't figure out new capabilities on its own - it can only use the tools you've explicitly given it.

The beauty of LLMs so far has been their non-deterministic, natural, ask-me-anything kind of outputs. Function calling, while making LLMs more useful, takes away from this non-deterministic beauty and brings back the rigidity of deterministic coding.

LLM Agents are an effective response to this limitation.

Wave 3: The era of LLM Agents

The idea behind agents is simple. In the last few examples in Wave 2, we’ve defined a list of functions and given our LLM access to these tools. The LLM could, in our previous example, determine which tool to use.

To make the limitations of function calling clearer, let’s say we give our LLM access to two tools - a weather API (‘what’s the weather like in Paris on a given day), and a booking API (a list of hotel bookings made by our user). In our function calling setup, we can’t answer the question “should I carry an umbrella today?”, because our weather function needs a definite day, and our booking API has no weather information.

An agent, however, is smarter. It will first check if the user is already in Paris today, by calling the booking API. If it finds that the user is in Paris today, it will check the weather for Paris today. If the weather indicates rain, it will respond with ‘yes, it’s wise to carry an umbrella today!’.

So, Agent LLMs go beyond simple function calling by being able to:

Plan steps to achieve a goal
Choose their own tools and approach
Learn from and adapt to results
Maintain context and memory
Make decisions about what to do next

Setting up an Agent

The setup is quite uncomplicated, at least conceptually. We have the same 3 components - the system prompt, the user prompt, and the list of tools.

Agent prompts are different though - they focus on goals and reasoning rather than specific steps. Let’s look at this through an e-commerce example, here is what an agent prompt would likely look like:

You are a shopping assistant helping customers find and purchase products.

Your goal is to help customers find what they want and complete purchases.

For each customer interaction:
1. Understand their needs and preferences
2. Think through the best approach to help them
3. Use available tools as needed
4. Adapt your approach based on results
5. Keep track of important details they share

Explain your thinking process and what you're doing to help them.

This prompt gives the LLM a lot more agency to reason and figure out what combination of tools can help.

Agents use the same fundamental structure as function calls. The prompts make them markedly different, by enabling the agent to perform:

Dynamic Planning
- Function calling: Follows preset workflows
- Agents: Create their own plans based on goals
Tool Usage
- Function calling: Uses tools in predefined ways
- Agents: Creatively combines tools to solve problems
Memory and Context
- Function calling: Handles one interaction at a time
- Agents: Maintains context across conversations
Problem Solving
- Function calling: Fixed decision trees
- Agents: Adaptive problem solving

Here's how an agent handles a complex request:

User: "I need gift ideas for my mom who loves gardening"

Agent Thought Process:
1. Gather more information about preferences
2. Search multiple categories
3. Consider seasonality
4. Check reviews and ratings
5. Verify availability

[Asks clarifying questions about budget and specific interests]
[Searches across multiple categories]
[Checks inventory for top options]
[Creates curated list with reasoning]
[Adapts suggestions based on user feedback]

A practical example

Lets build that e-commerce agent a little bit fuller.

Here is how I would define the tools:

{
  "tools": [
    {
      "name": "search_products",
      "description": "Search product catalog, returns matching items",
      "parameters": {...}
    },
    {
      "name": "check_inventory",
      "description": "Check current stock levels for a product",
      "parameters": {...}
    }
  ]
}

Here’s the prompt:

You are a shopping assistant that can help customers find products, check availability, and place orders.

You can:
- Search the product catalog
- Check inventory levels
- Place orders
- Calculate shipping times

You cannot:
- Modify existing orders
- Apply custom discounts
- Access customer payment details

Some error handling

{
  "error": "NOT_AUTHORIZED",
  "message": "Agent cannot access customer payment information",
  "allowed_actions": ["redirect_to_payment_page"]
}

And that’s it! Here’s how the agent would behave in practice:

The user prompt:

User: "I need running shoes under $100 that can be delivered by next week"

The agent's thinking process might look like:

Thought: I need to find suitable shoes and check delivery options. Let me break this down:
1. Search for running shoes with price filter
2. Check inventory for quick delivery
3. Verify shipping times
4. Present options to user

Action: Let me search first
[calls search_products({
    "query": "running shoes",
    "filters": {
        "price_max": 100,
        "category": "athletic/running"
    }
})]

Thought: Found several options. Need to check stock and delivery for each
[calls check_inventory({
    "product_id": "shoe123",
    "location": "nearest_warehouse"
})]

Thought: This option has stock and can be delivered in time. Let me show it to the user and ask about size preferences...

As you can see in the example above, the power of agents comes from their ability to:

Break down complex problems
Choose appropriate tools
Adapt their approach based on results
Maintain context throughout the interaction
Make decisions about what information they need

This flexibility makes agents particularly good for tasks where the path to the solution isn't straightforward or needs to adapt based on what's discovered along the way.

Why don’t we have agents everywhere yet?

Despite being so radically helpful, we haven't seen agents take over yet, despite their apparent potential.

The main challenges fall into three buckets:

Technical Reliability
- Agents are unpredictable - they might take unexpected actions or get stuck in loops. We’ve essentially got only the prompts guiding the agents, and we rely on the reasoning capabilities of the underlying models to ‘figure out’ the problem.
- Error handling is complex because agents can attempt things in ways you didn't anticipate - given the amount of freedom and agency, you have far less control on the outcome.
- Cost management is tricky, since agents might make multiple API calls to figure things out - if they’re stuck in a loop or if the prompt is suboptimal, the cost will really spiral out of control.
- Each tool integration needs robust validation, rate limiting, and error states - adding further complexity and need for trial-and-error, domain expertise, etc.
Product Complexity
- Building good agent UX is hard - users get frustrated when agents take too long to think, and can abandon workflows or even churn away from the product. A few workarounds for the UX do exist, including providing the user greater visibility into ‘what’ the LLM is doing.
- Setting the right expectations is crucial - users either expect too much or too little with agents. A well made agent is often underutilised because the user is not primed to explore all of its capabilities. An agent that is good at one task can often appear to overpromise because the UX / copy is not well done.
- Recovery flows are complex - what happens when an agent fails halfway through a task? Users can get frustrated when these flows are not well managed, as they may need to start all over again without guarantees of success.
- You need extensive logging and monitoring to understand what your agents are actually doing, to make prompt adjustments and debugging even remotely possible.
Business Risk
- Agents can make costly mistakes if not properly constrained - imagine if your agent determines that the best possible result for every customer’s cancelled order is a full refund.
- The development overhead is significant compared to simple function calling - as I mentioned, you need some degree of experience handling LLMs before venturing into agents.
- Many use cases work fine with simpler approaches - while we have discussed quite a few examples using agents here, there are more pragmatic ways to start - for improving user experience as well as for gaining efficiencies.
- The ROI isn't clear yet for many applications, and the costs are high. Industry estimates currently forecast that it will take an agent upto $20 to accomplish a task at human quality, that a human can do for $6. The unit economics will get better, but are not there yet!

That said, I think we're at an interesting inflection point. As the tooling improves and we figure out the right patterns for building with agents, we'll likely see more adoption in specific use cases where the benefits outweigh the complexity.

For the next writeup, I’ll focus a bit on some examples of agents in the wild, as well as when and for what problems you should consider using agents - and also when not to use them. We will go into the details of the product building flows, the metrics and setups to use, etc, as I continue writing this series.

Meanwhile, if this was helpful,

Subscribe now

If you would like to start with the basics of LLMs, check out this writeup:

BuilderLab.ai

A Product Builder's introduction to Large Language Models

This is the first in a series of posts covering Generative AI for product builders…

3 years ago · 25 likes · 1 comment · Pranav Pathak

If you would like to read about writing good prompts, check out this one:

BuilderLab.ai

The Elusive Art of Prompt Writing

The quality of your prompts determines the success of your interactions with AI systems. Whether you're a product manager leveraging AI for ideation or a developer using it to build tools, writing effective prompts is both an art and a science. This guide will walk you through the principles of crafting high-quality prompts, with real-world examples and…

a year ago · 11 likes · 1 comment · Pranav Pathak

Think Big, Start Small

Pranav Pathak — Mon, 30 Dec 2024 02:30:44 GMT

LLMs are reshaping the way we work, but turning big ideas into actionable solutions can feel overwhelming. As builders, we can sometimes get stuck between ambition and execution. This post will guide you through identifying impactful areas to apply generative AI and provide a detailed exercise to spark actionable ideas for integration.

🧐 Why start small?

Simply put, the possibilities with Generative AI are overwhelming. If you’ve got an idea or are passionate about a problem, there likely exists a genAI tool that can help you build your solution. From no code to LLM driven coding interfaces to entire flows managing and orchestrating agentic LLMs, a tool exists to facilitate building your product. If you’ve never coded before, and this was your barrier to building, well, now you don’t have to, as long as you understand the fundamentals. If you’re decent at software development, used right, LLMs can bring down your coding time to as low as 20%. From ideation to production, several new products entirely built by LLMs have now hit our app stores. If you’re leading an existing product or service, everyone including your customers, your employers, and your employees, expects you to use AI – have you seen the newest line of AI powered vaccum cleaners (!)?

If building excites you, all this can be overwhelming. If you’re anything like me, this feels like a time of endless possibility, but also a time of rapid paradigm shifts. Things are changing so fast it’s hard to keep up. There are so many ideas, so many experiments you want to try, that you’re starting things just to abandon them halfway before moving on to the next big idea. If that’s been your mental state, I’d urge you to breathe, and focus on starting (and finishing) small.

Focusing on, and iterating on, a small problem to solve, is usually a great way to develop the toolkit required to build a larger solution. It’s tempting to aim for the moon when exploring GenAI. But revolutionary change often begins with evolutionary steps. By identifying a small, impactful area to implement generative AI, you reduce the risks of overreach, speed up iteration, and build confidence in your ability to execute.

For example, an AI-driven customer support chatbot might start with FAQ generation. Once the feature demonstrates value, it could expand to handling complex customer interactions, learning from live agent interventions, and even having access to tools over time, creating incremental value at each step.

The key is to let the technology serve the problem, not the other way around. Instead of building a flashy, overcomplicated solution, focus on solving a single, meaningful pain point. Iterate, gather feedback, and expand incrementally.

🎯 Identifying Impactful Areas for Generative AI

Before diving into brainstorming, it’s crucial to understand where generative AI can make the most impact. Start by asking:

What are the repetitive tasks in this product?
- Example: In a customer support platform, generating responses to common queries can automatically reduce the workload on agents and improve response time. LLMs could generate comprehensive templates for troubleshooting, or dynamically create email drafts tailored to customer concerns based on past interactions. Beyond this, they might analyze call logs and automatically create scripts for escalation scenarios, or even invoke tools to directly guide users to finish their tasks, instead of requiring a human agent.
Where does personalization matter most?
- Example: In an e-commerce app, using generative AI to create dynamic product descriptions based on individual user preferences or past purchases can elevate the customer experience. For instance, LLMs could rewrite descriptions to highlight eco-friendly attributes for sustainability-conscious buyers, or suggest complementary products for repeat customers.
What unmet needs exist for better content creation?
- Example: In a marketing tool, LLMs can enable drafting social media posts or ad copy tailored to specific audience segments. The could help test variations of campaign content in real-time, learning what resonates most with each audience segment. This could bt taken further by integrating performance data from past campaigns to adjust tone and style dynamically, offering iterative improvement over time.
Where could insights lead to better decisions?
- In a financial analysis platform, LLMs (augmented with existing data systems and reports) can be used to summarize trends from large datasets and generate reports with actionable recommendations. In the product discovery phases, they can synthesise user research documents to drive insights and provide meaningful recommendations for what to build next.

By focusing on these areas, you’ll uncover opportunities where GenAI’s capabilities align with real-world needs.

🧠 A Comprehensive Brainstorming Exercise

Let’s take a more in-depth approach to ideation. This extended exercise will help you explore opportunities for generative AI integration through a small step, iterative approach.

Step 1: Pick a Product You’ve Thought About Deeply

Choose a product you’ve worked on, use frequently, have studied, or really want to build. It could be anything—a project management tool, an e-commerce platform, or even your favorite fitness app. If you’re unsure where to start, pick the product you interact with daily, as familiarity often leads to deeper insights.

Step 2: Map Out Core User Journeys and Supporting Workflows

Document the main workflows accomplished by the product. For example:

For a project management tool: creating tasks, assigning resources, generating progress reports, integrating with time-tracking apps.
For an e-commerce platform: browsing items, comparing products, completing a purchase, post-purchase support, and feedback collection.
For a fitness app: tracking workouts, planning meals, analyzing progress over time, connecting with fitness communities, and sharing achievements.

As much as possible, break these workflows into micro-tasks to reveal nuanced inefficiencies AI could address. Additionally, consider how workflows intersect—for instance, how the browse and save behavior on an ecommerce catalog can affect future purchase behaviors (spoiler – more saves = more purchases).

Step 3: Identify Pain Points or Inefficiencies

For each task, ask:

Where do users spend unnecessary time?
Where do errors or bottlenecks occur?
What’s currently “good enough” but could be exceptional with GenAI?
How does scaling impact these workflows?

For example, in a fitness app, LLMs could streamline planning by generating meal suggestions that align with dietary restrictions, nutritional goals, and user preferences—all while learning from logged meals to improve future suggestions. Scaling this across millions of users could offer hyper-personalized plans without additional manual effort. Additionally, the system could proactively recommend substitutions based on seasonal ingredient availability.

Step 4: Map GenAI Capabilities to Each Pain Point

Using the questions from earlier, brainstorm how generative AI could enhance or replace parts of these workflows. Here are some examples:

Content Creation: Automatically generate templates for recurring tasks. Example: GenAI could create interactive project timelines that adapt dynamically based on real-time input from multiple collaborators.
Summarization: Summarize meeting notes or user reviews. Example: In a conferencing app, LLMs could identify patterns across multiple meetings and provide strategic insights, like recurring blockers or themes.
Personalization: Suggest tailored next steps based on user behavior. Example: In a fitness app, AI could evolve into a virtual coach, generating not just workout plans but also motivational prompts based on detected patterns in user engagement.
Prototyping: Quickly draft mockups or design ideas for review. Example: In a design tool, GenAI could produce end-to-end workflows based on wireframes, including user onboarding sequences.

Step 5: Prioritize and Refine

Evaluate:

Scalability of the GenAI solution.
Expected ROI based on user feedback and market trends.
Technical complexity and data requirements for implementation.

Consider creating a feature backlog where each proposed AI enhancement is tagged by difficulty, data dependency, and potential user value. Use this to map out short-term experiments and long-term strategic goals.

🚀 From Ideation to Execution

Once you’ve brainstormed and prioritized ideas, it’s time to bring them to life. Here are some possible next steps:

Prototype and Test: Use existing tools or APIs to create a minimum viable version of your idea. Leverage APIs and low code tools to generate functional but fast prototypes that can be immediately validated with real users.
Learn and Iterate: Analyze feedback to refine your solution. Think of where your prototype users got ‘stuck’, what challenges did they face in usage, and what they found easy enough. Define a ‘viable’ product that solves for pain points and stumbling blocks.
Scale Strategically: Once validated, scale your features thoughtfully. Consider localization, compliance with regulations, and infrastructure optimizations.
Monitor and Maintain: Build ongoing monitoring pipelines for your AI features to ensure quality over time. Use dashboards to track user engagement, AI-generated errors, and performance metrics.

The transformative potential here isn’t about creating something that wows people on paper. It’s about delivering tools and experiences that feel indispensable in practice. Start small, but keep a clear vision of how your solution could scale and evolve into a broader ecosystem of AI-driven innovation.

🏁 Ready to Start?

Generative AI offers a playground of possibilities for builders. By thinking big but starting small, you can tackle meaningful problems, unlock new value, and set the stage for innovation that scales. Remember that GenAI is more than a tool—it’s a mindset shift. The intersection of user insights, AI capabilities, and strategic execution can lead to breakthroughs that redefine industries.

So, how would you integrate GenAI into a product you know well? The answer might be the seed of your next big breakthrough.

Read these to dive deeper into building —

BuilderLab.ai

The Product Builder's Guide to Machine Learning — Part 1 of 3

If you haven’t already, consider subscribing to FoundIt for more product building content…

8 years ago · 6 likes · Pranav Pathak

BuilderLab.ai

Building Metrics that Matter

Metrics are my thing. I love thinking about, talking about, and working with metrics. I believe metrics can make or break products. When done right, they can supercharge a product and create a great user experience, but when done wrong, they can create a recipe for disaster…

5 years ago · 16 likes · 8 comments · Pranav Pathak

BuilderLab.ai

Pricing AI: What Actually Works

The Litmus Test That Matters More Than Any Framework

Four Models, and When Each One Kills You

Seats: The Bridge, Not the Destination

Usage-Based: Honest but Dangerous

Outcome-Based: The Ideal That Most Products Can’t Reach

Hybrid: Where Most Companies Should Be (For Now)

Why the Margin Problem Is Structural, Not Temporary

Adding AI to a Product That Already Has Pricing

Five Things That Kill AI Pricing

The Agentic Frontier

The Decision Sequence

The Bottom Line

The Builder Bubble

The Ability Bubble

Why the Bubble Expands (Not Contracts)

The 2-3 Year Window

The Three Psychological Barriers Nobody Talks About

How to Capitalize: Five Moves for Builders

The Uncomfortable Truth

The Close

Most AI UX Is Lazy

The Four Lazy Defaults

Why Teams Default to Lazy

The LLM Primitives

1. Intent Detection

2. Generation

3. Summarisation

4. Classification

5. Extraction

6. Memory

The Agent Primitives

1. Planning and Decomposition

2. Autonomous Execution

3. Environment Interaction

4. Monitoring and Reaction

Smart Filters: Intent Detection Done Right

The Real Problem

Growth Machines: The Cursor Story

Decision 1: Fork, Don’t Build

Decision 2: Custom Models for Workflow, Foundation Models for Reasoning

Decision 3: Serve Power Users First, Everyone Else Later

Decision 4: Let the Product Be the Marketing

Decision 5: Usage-Based Pricing for the AI Layer

The Copilot Counterpoint

The Moat Question: Is This Sustainable?

The Fork-and-Deepen Playbook: What You Can Steal

Where you should exercise caution

The Number That Matters Most

Your Product Now Has Two Users

The Numbers That Should Scare You

The Old Playbook Is Broken

What AI Intermediaries Actually Reward

The Second-Order Consequences

The Trust Layer Underneath

What This Means for Builders

The Opportunity Behind the Disruption

What Comes Next

ROI or Die.

The Lie

You Already Know How to Do This

The Framework

A concrete example

The Rules

The Transition

$252 billion corporate AI spending in 2024. Source: Stanford HAI AI Index 2025 Annual Report.

95% of enterprise AI pilots delivered zero measurable P&L impact. Source: MIT Project NANDA, “The GenAI Divide: State of AI in Business 2025” (July 2025).

42% of companies abandoned most AI initiatives, up from 17%. Source: S&P Global Market Intelligence, “Voice of the Enterprise: AI & ML, Use Cases 2025.”

Gartner: 30% of GenAI projects abandoned after POC by end of 2025. Source: Gartner press release, Data & Analytics Summit, Sydney (July 2024).

AI projects fail at twice the rate of non-AI IT projects. Source: RAND Corporation, “Root Causes of Failure for AI Projects” (2024).

AI personalization lifting ecommerce conversion by up to 23%. Source: Multiple industry reports (Envive AI, McKinsey retail research).

Contact centers using AI see 30% operational cost reductions. Source: Multiple industry sources (Fullview, AmplifAI, McKinsey CX research).

AI users completed tasks 25% faster at 40% higher quality. Dell’Acqua et al., “Navigating the Jagged Technological Frontier,” HBS Working Paper 24-013 (Sept 2023).

Early GenAI adopters report $3.70 per dollar; top performers $10.30. Source: Widely cited (CIO Dive, BotsCrew, Fullview).

6% “AI high performers” attribute 5%+ of EBIT to AI. Source: McKinsey, “The State of AI in 2025” (Nov 2025).

GitHub Copilot: 55% faster coding in controlled studies. Source: GitHub Research (2022). Controlled experiment, P=.0017.

88% of accepted Copilot suggestions stay in the codebase. Source: GitHub usage data, confirmed across multiple compilations.

Coding AI: $4B category in 2025, up from $550M. Source: Menlo Ventures, “2025: The State of Generative AI in the Enterprise” (Jan 2026).

IBM spent ~$5B building Watson Health. Source: IBM acquisition costs (Truven, Phytel, Explorys, Merge). Sold to Francisco Partners for ~$1B (2022).

$252 billion corporate AI spending in 2024.
Source: Stanford HAI AI Index 2025 Annual Report.

95% of enterprise AI pilots delivered zero measurable P&L impact.
Source: MIT Project NANDA, “The GenAI Divide: State of AI in Business 2025” (July 2025).

42% of companies abandoned most AI initiatives, up from 17%.
Source: S&P Global Market Intelligence, “Voice of the Enterprise: AI & ML, Use Cases 2025.”

Gartner: 30% of GenAI projects abandoned after POC by end of 2025.
Source: Gartner press release, Data & Analytics Summit, Sydney (July 2024).

AI projects fail at twice the rate of non-AI IT projects.
Source: RAND Corporation, “Root Causes of Failure for AI Projects” (2024).

AI personalization lifting ecommerce conversion by up to 23%.
Source: Multiple industry reports (Envive AI, McKinsey retail research).

Contact centers using AI see 30% operational cost reductions.
Source: Multiple industry sources (Fullview, AmplifAI, McKinsey CX research).

AI users completed tasks 25% faster at 40% higher quality.
Dell’Acqua et al., “Navigating the Jagged Technological Frontier,” HBS Working Paper 24-013 (Sept 2023).

Early GenAI adopters report $3.70 per dollar; top performers $10.30.
Source: Widely cited (CIO Dive, BotsCrew, Fullview).

6% “AI high performers” attribute 5%+ of EBIT to AI.
Source: McKinsey, “The State of AI in 2025” (Nov 2025).

GitHub Copilot: 55% faster coding in controlled studies.
Source: GitHub Research (2022). Controlled experiment, P=.0017.

88% of accepted Copilot suggestions stay in the codebase.
Source: GitHub usage data, confirmed across multiple compilations.

Coding AI: $4B category in 2025, up from $550M.
Source: Menlo Ventures, “2025: The State of Generative AI in the Enterprise” (Jan 2026).

IBM spent ~$5B building Watson Health.
Source: IBM acquisition costs (Truven, Phytel, Explorys, Merge). Sold to Francisco Partners for ~$1B (2022).

MD Anderson spent $62M; system never used on a patient.
Source: University of Texas System audit report (Jan 2016). Confirmed by JNCI, Medscape, Forbes.

MD Anderson original budget $2.4M, ballooned to $62M.
Source: UT audit. IBM contract extended 12 times.

Contracts structured below board approval threshold.
Source: UT audit. “Fees consistently set just below the amount that would have required Board approval.”

System built on legacy EHR (ClinicStation) already replaced by Epic.
Source: UT audit.

Zillow iBuying lost $304M in Q3 2021; $500M+ write-downs; 2,000 laid off; $9B market cap loss.
Source: Zillow Q3 2021 earnings and public filings.

Opendoor: $170M profits, 7.3% gross margins same quarter.
Source: Opendoor financial filings, same period.

McKinsey: workflow redesign strongest predictor of AI impact.
Source: McKinsey, “The State of AI in 2025.” Note: McKinsey says “one of the strongest”.

High performers 3x more likely to have redesigned workflows.
Source: McKinsey, “The State of AI in 2025.”