A bad workman blames his chisel

Why most complaints about AI are really complaints about operators.

May 10, 2026

In 1464, the Florence Cathedral commissioned a sculptor named Agostino di Duccio to carve a colossal figure of David from a single block of Carrara marble. He roughed out the legs and quit. Twelve years later, Antonio Rossellino picked up the contract, took one look at the block, and walked away almost immediately.

The marble sat outdoors in the cathedral courtyard for the next twenty-five years. Rain. Sun. Freeze. Thaw. It was mediocre marble, brittle and narrow, riddled with imperfections. Too expensive to abandon, too flawed to use.

The guild called it “the giant.” It was the block nobody wanted, and everyone agreed it was the block’s fault.

In 1501, the Cathedral offered it to a 26-year-old Michelangelo. He took it. Three years later, he produced what is arguably the most recognized sculpture in human history.

What Michelangelo did was different in three ways.

He forged his own steel. The Greeks had carved in bronze. His contemporaries used the workshop default, iron tools sharp enough for good marble but too soft for stone like this. Michelangelo wanted harder edges, because brittle marble does not yield to standard iron the way statuario does. It chips wrong. It splits along the imperfections instead of around them. So he made the steel himself, and he made it for this block.

He learned anatomy first. In 1493 and 1494, before the David commission existed, he spent two years dissecting cadavers at the Convent of Santo Spirito. Most sculptors of the period worked from the surface inward, copying what they saw. Michelangelo worked from the bone outward, because he had cut the muscle off the bone himself and he knew where it attached. By the time he stood in front of the giant, he was not guessing at the lat, the trapezius, the tendon that runs down the inside of the wrist. He had seen them.

He worked the block front-outward instead of section by section, threading the figure between the existing flaws rather than fighting them. He started with the torso, because his anatomy training let him judge proportion from the centre, and the centre is where the marble was best. The pose itself, the geometry of the finished David, was dictated by what was left of the stone. David is slimmer than Michelangelo’s other male nudes because the slab was narrow. There is no severed head of Goliath at his feet, no raised sword, because there was no marble for either. He stands in the moment before the fight, not after, because the post-fight pose every other sculptor had used would not have fit. Critics read the pre-fight choice as artistic genius. It was partly that, and partly the shape of the constraints.

David is not a story about a magical chisel. It is a story about a trained operator, using standard instruments, modified for the specific medium, on material everyone else had given up on. The expertise was not optional. It was the price of a David that would otherwise have stayed flawed.

The hammer-and-no-chisel man

Now imagine Michelangelo handed only a hammer. Whatever the talent, the result is rubble. No skill compensates for the wrong instrument.

What if he refuses the steel chisels altogether. Calls the gradina a fashionable contrivance. Insists his work will be done with vision alone, with bare hands if necessary, with anything that has not been touched by the corrupting influence of new ideas. That is not a master. That is a vain workman with principles dressed up as discipline.

Both versions of this man write a lot of LinkedIn posts about LLMs right now.

The first one, the hammer-and-no-chisel man, looks at the model and says it is broken. The second one, the principled refuser, looks at it and says he is above it. They sound like opposites. They are the same posture. Both are explanations for why the operator’s output is mediocre, and the explanations are doing one job: pointing the finger anywhere except at the operator.

The complaint takes many forms

You have read all of these. Some of them you have written.

“Claude is making everyone sound the same.” The writer pastes a one-line prompt into the chat box, asks for a draft, gets back the median voice of the internet, and concludes that the model has flattened written English into a single corporate hum. The writer has not given the model a single sample of their own voice. They have not fed in their last three published pieces. They have not told it which writers they are trying to sound nothing like. They have not rewritten a single middling line. They got the median voice because they prompted for the median voice.

“The model hallucinated a citation, AI is unreliable.” The analyst asked a frontier model, with no retrieval, no source documents, no grounding of any kind, to produce a referenced briefing on a niche topic. The model, doing exactly what it was built to do without source material, generated plausible-sounding citations from the shape of the training distribution. Some of them were wrong. The analyst posted a screenshot. The screenshot got a thousand likes. Nobody asked why a tool designed to predict text was being used as a tool to retrieve facts.

“I asked for next quarter’s prediction and got boilerplate.” The operator types “predict our Q3 revenue” into the chat. They give the model nothing. No historicals. No funnel data. No seasonality assumptions. No mix shift. No competitive context. No definition of what counts as revenue. They get back a generic five-bullet response on growth, and they share it as evidence that LLMs cannot forecast.

“It produced something that could apply to any SaaS company.” The strategist asked for a positioning doc with no input on the company’s P&L, no read on the team, no statement of what they refuse to do, no theory of where the market is moving. The output applied to any SaaS company because the input applied to any SaaS company.

“It just agrees with me. No spine.” The user wrote four paragraphs of confident argument and asked the model what it thought. The model, calibrated to be useful and to not be a contrarian for sport, mirrored back something supportive. The user did not ask it to disagree. Did not ask it to steelman the opposite. Did not assign it a critic role. Did not put a single guardrail in the prompt that would have made disagreement the default.

“We shipped, it broke, the model isn’t ready.” The team wrapped a chat box around an API. No use case scoping. No eval suite. No failure-mode taxonomy. No fallback path when the model is unsure. No instrumentation to learn from production. They watched it fail in the wild and concluded the technology was premature. The technology was fine. The product around it was rubble.

The structure is identical in every case. Output disappoints. Cause is assigned to the tool. Conclusion is to wait.

This is the chisel argument. It runs the same way Agostino’s quitting did, and Rossellino’s after him. The block is bad. The block has always been bad. Walk away.

What the operator did not do

Take the same six complaints and run them backwards.

For the writer who got the median voice: there was no system prompt with their actual voice in it. No past work fed in. No paragraph of theirs marked as “this is the rhythm I want.” No five writers named as anti-references. No quality bar set higher than “is this fine.” No willingness to throw the first draft out and rerun it with sharper instructions. The model did the only thing it could do with empty hands.

For the analyst who got the bad citations: no retrieval layer. No source documents in the context. No instruction to refuse to cite anything not present in the provided material. No span-quoting rule. No verification pass at the end. No human in the loop on factual claims. The tool was used in the one mode it is unreliable for, and the unreliability was treated as news.

For the operator who got the boilerplate forecast: no historical data. No assumptions made explicit. No constraints written down. No statement of which scenarios were live and which were not. No worked example of the kind of output that would actually be useful.

For the strategist who got the generic positioning: no specifics. No P&L shape. No competitive map. No internal capabilities. No priors. No “we have tried this and it did not work.” No “we will not do this even if it would work.” Strategy is the difference between a thing and the things it is not, and the prompt did not name a single thing it was not.

For the user who got the agreeable mirror: no instruction to disagree. No “give me the strongest version of the case against this.” No “be the colleague who tells me when I am wrong.” No critic role. No pre-mortem. No second pass with a sharper persona. The user did not make disagreement safe, and then was disappointed not to receive it.

For the team that shipped the broken feature: no narrow definition of the use case. No eval set built from real user inputs. No failure-mode catalogue. No measurement. No fallback to a non-AI path when confidence was low. No telemetry. No way to learn from the wild and improve. They built a feature the way you would build a button, and the model is not a button.

In every case, the operator picked up the workshop default, swung at the marble, and blamed the tool when the chip went the wrong way.

“But this sounds like a lot of work”

It is. This is the most honest objection in the room, and it deserves a real answer.

Forging chisels is hard. Learning the anatomy is hard. Threading a figure through a compromised block is harder than carving a clean one. Building an eval set, instrumenting failure modes, writing the system prompt that actually contains your voice, assembling the context for the specific case, scoring the output against a quality bar you respect, rewriting until it lands: all of it is more work than typing “write me a strategy doc” and pasting the response into a deck.

So why do it.

The standard answer is that the output is faster and sharper. That is true, and it understates the case by a wide margin. Faster and sharper is a productivity argument. It assumes the work was already possible, just slow.

The real answer is that this tool makes work possible that was not possible before at all.

Think about what an operator could do, before late 2022, with unstructured input. The honest answer is almost nothing. If a user typed a complaint into a free-text field, you logged it, and a human read it later, maybe. If you wanted to act on it programmatically, the user had to pick from a dropdown. If you wanted to build a feature on top of it, you wrote a regex and prayed. The entire surface of how software met human language was structured, because structure was the only thing computers could parse. Anything that lived in free text lived outside the system.

That constraint shaped the work.

Product managers wrote requirements documents because no one could prototype against a vague idea. Researchers ran structured surveys because open-ended responses were too expensive to analyse at scale. Strategy decks lived in PowerPoint because the connective tissue between bullet points had to be written by hand, and hand-writing connective tissue was the slow part. The whole industry organised itself around the fact that machines could not read.

LLMs ended that. Not improved it. Ended it.

A product manager can now build a working prototype of an idea in an afternoon, in conversation with the model, before writing a single line of a PRD. The prototype is not a mock. It is a thing that runs, that takes free-text input from real users, that responds in language, that can be put in front of five people the next morning and learned from by lunch. The PRD, if it gets written at all, is written after the learning, not before it. That is not a faster version of the old workflow. It is a different workflow. The old one is being deprecated in real time.

A researcher can run an open-ended interview with a hundred users in parallel, every conversation transcribed, clustered, themed, and surfaced in a dashboard the same day. The old version of this work was a structured survey with a Likert scale, because the alternative was unaffordable. The structured survey was always a compromise, a way of squeezing language into a shape the system could read. That compromise is gone. You can now ask people what they actually think and get a usable answer back, at scale, in hours.

A strategist can hold an actual conversation with the unstructured contents of a company. Earnings transcripts, customer support tickets, internal Slack threads, the last six board decks. Not summarise them. Argue with them. Ask the strategist’s own question, get pushback, sharpen, rerun. That was not possible. The information was always there. It was just locked behind the cost of reading it.

A founder can build the first version of a product in a weekend that would have required a small engineering team six months ago, because the parts of the product that used to need engineering, the parsing of intent, the routing of language, the generation of responses, are now a function call. The hard part of the product is what it always should have been: knowing what to build. The plumbing is free.

An engineer can sit down on a Saturday morning with an idea, a terminal, and a coding agent, and have a working internal tool by the afternoon. Not a script. A tool. With a UI, a backend, an auth layer, a database, deployment. The kind of thing that used to require a sprint, a ticket, a stand-up, and a code review, because the cost of typing the boilerplate was high enough that it had to be amortised across a team. That cost is gone. The boilerplate writes itself, in conversation, and the engineer’s job becomes the part that always actually mattered: the choices. What the thing should do. How it should fail. Where the edge cases are. What “done” looks like. An engineer who has internalised this shipped four things last quarter that she would have spent the same quarter scoping. The diff is not in the typing speed. The diff is that work she would have triaged out of existence as too small to staff, now gets built.

This is what the bad-workman complaint misses. It frames the question as whether the new tool produces better outputs than the old one. The actual question is whether you can see that the work itself has changed shape. Whole categories of output now exist that simply could not exist before. Rapid prototyping in free text. Conversations with users at the scale of a survey. Strategy docs that are actually about your company because the model has read everything your company has ever written. Internal tools that get built on a Saturday because they are now cheap enough to be worth building. Features that take human language as input and do something useful with it.

Michelangelo did not forge his own steel to carve the same statue faster. He forged it because the marble in front of him was compromised in a way the standard tools could not handle. The harder steel was the price of a figure that, with workshop iron, would have stayed inside the block forever.

Expertise is the price of work that, without it, will not exist at all.

The reframe

Every time you catch yourself reaching for the black-box complaint, reverse the agency. The complaint is always pointed at the model. The interesting question is always pointed at the operator.

“The model hallucinates.” Did you build retrieval? Citation enforcement? A verification pass? Did you tell it to refuse claims it could not ground?

“The model is generic.” Did you give it your specifics? Your voice? Your constraints? Your competitive position? The things that would make a generic answer impossible?

“The model has no judgement.” Did you ask it to take a position? Attack the position? Revise? Did you assign it a critic? Did you make disagreement the default?

“The model fails in production.” Did you scope the use case narrowly? Build evals from real inputs? Catalogue failure modes? Design fallbacks for low-confidence cases? Instrument the system to learn from the wild?

“The model just agrees with me.” Did you instruct it not to? Did you reward dissent with follow-up rather than punishing it?

The question in every case is the same one. Did you forge your chisel, or did you pick up the workshop default and swing?

What good looks like

The operator who is producing real output with these tools right now has a few things in common, regardless of the domain.

They have a clear picture of what they are trying to produce. Not “a strategy doc.” A specific output, for a specific reader, doing a specific job. The picture is sharp enough that they can tell when the output misses, and by how much.

They have pressure-tested their workflow against the model’s failure modes. They know where it fabricates. They know where it flattens. They know where it agrees too easily. They have catalogued these failure modes and routed around them, often with deterministic guardrails the model never gets to override.

They feed in the specific context for the specific case. Voice samples. Past work. Constraints. Competitive position. The team’s capabilities. The priors. The things that are off-limits even if they would work. Every prompt is loaded with the material the median operator never bothers to assemble, and the output reflects it.

They use the model where it is reliable. They do not use it where it is not. They have the discipline to know the difference and the patience to hold the line even when the model is convincingly wrong.

They score against a quality bar they actually respect. Not “is this fine.” Is this as good as the writer, the strategist, the analyst they are trying to match. They rewrite, rerun, and rework until it is. The first draft is the marble at the courtyard stage, and they treat it that way.

They keep refining the system itself. The next piece is cheaper than this one. The chisel is already forged. The voice is already encoded. The eval set already exists. The system gets better even when the model does not.

The output that comes out of expertise is faster, sharper, and more itself than the alternative. It is not less the operator’s work for having gone through the model. It is more the operator’s work, because every choice that mattered was theirs. The model drafted the connective tissue. The operator made the choices.

That is the difference between mediocre marble producing mediocre output, and the same marble producing David.

Two sculptors looked at a marble block and saw something broken. Michelangelo looked at it and saw David.

The question is not whether LLMs are good enough yet. The question is whether you did the work to make them good enough for what you are trying to do.

A bad workman blames his tools. A serious one picks them up, sharpens them, and gets to work.

If this essay resonated, share it with someone still blaming the marble.

And if you want more writing like this, grounded, technical, and built for operators, subscribe to BuilderLab.

Discussion about this post

Ready for more?