CodeSubmit EditorialApril 23, 2026

Stop Screening for Vibe Coding. Start Screening for Harness Engineering.

A polemic and a playbook. “Vibe coding” is the wrong name for the craft. Here is the right one, and how to hire for it.

Dominic Phillips

Founder, CodeSubmit

There are memos that take years to reach the interview panel. The one Pete Flint published this month was about software engineering, but the company that needs it most is yours.

In an essay called AGI Is Here, Flint made a careful argument. The threshold worth arguing about is not the one philosophers keep drawing, where a machine reasons like a human across every domain. The threshold that actually matters, the one that rewires markets and dislocates labor, is behavioral. It is the moment a user stops auditing the intelligence and starts relying on it. Google Maps passed that threshold a decade ago. Nobody debates whether to use it anymore. They just tap the blue dot and drive.

That moment, for a meaningful slice of software engineering, has arrived. The engineers you have been paying for the last two years quietly rewired their workflow around AI assistants and CLI agents. They did not ask permission. They did not schedule the change. They just started a terminal, typed the name of an agent binary, and never really went back.

The interview loop you inherited from 2018 does not know this. It is still asking candidates to invert a binary tree on a whiteboard while the job they would actually be hired to do is steer three agents across a monorepo, ground their context, and catch a hallucinated import before it ships to production.

The gap between those two has never been wider, and the language we have settled on for the new craft is actively making it harder to close.

“Vibe coding” is the wrong name for the thing.

The phrase was coined in February 2025 by Andrej Karpathy, in a single Twitter post about coding where you “fully give in to the vibes” and “forget that the code even exists.” It was a good riff. On its first anniversary, Karpathy publicly described the tweet as “a shower of thoughts throwaway”, which is fair. It was a throwaway. What he could not have predicted is that the phrase would be adopted as a general label for the entire new category of work, and stick.

The term stuck because the worst end of the new distribution is the loudest. People shipping brittle code. Not reading the diff. Trusting model output that halfway works. Writing three days of glue and pronouncing the problem solved.

The backlash is real, and it is loud. Hacker News periodically crests eight hundred points for a blog post titled “After two years of vibecoding, I’m back to writing by hand.” A Bram Cohen essay called “The cult of vibe coding is dogfooding run amok” cleared six hundred in a day. Alex Kondov’s “I know when you’re vibe coding” made the rounds on the strength of a single claim, that the output has a smell.

These essays are correctly describing a real phenomenon. They are also, in aggregate, naming the symptom instead of the craft. The good end of the distribution is doing something different and disciplined, and it does not deserve to share a name with the bad end.

Call it what it is. Harness engineering.

A harness is the configuration around a model that decides how much of a codebase gets loaded, which tools are available, which model answers which kind of question, how long a session can run before context needs to be pruned, and what gets persisted between sessions. Well-set harnesses are boring to run and expensive to replace. Poorly-set harnesses are exciting, expensive, and produce code a human reviewer is reluctant to merge.

This post is two things at once. It is a polemic, because the current vocabulary is making it harder to think clearly about what modern engineers do all day. And it is a playbook, because once you name the craft correctly, a set of clear observations about hiring follow from it.

Where engineers actually spend their day.

The leading edge of AI-heavy engineering has moved out of the editor. Engineers open Claude Code, Codex CLI, Aider, opencode, or some combination of them, describe a task, and then spend most of their attention orchestrating rather than writing. They narrow context. They route to a smaller model for grunt work and escalate to a frontier model only when the problem earns it. They review the diff. They run the failing test. They ground the agent in a specific file when it starts hallucinating. The editor is still on screen. They open a file once an hour to fix something by hand. Mostly they watch, narrate, verify, and approve.

The numbers on the shift are no longer controversial. Stack Overflow’s 2025 Developer Survey found that 92.6% of professional developers now use an AI coding assistant at least monthly, and 47% use one daily. Anthropic’s March 2026 Economic Index went further, reporting that 79% of Claude Code conversations are now classified as automation, where the AI directly performs the task, rather than augmentation, where it collaborates with a human who is still doing most of the work. GitHub’s Octoverse 2025 describes the same transition from the other end, framing advanced AI users as “strategic orchestrators” who have shifted from producing code to delegating and verifying it. This is a different shape of job than the one being screened for in most interview loops.

Inside the engineering teams we work with, the distribution looks like this. On one end, engineers spend most of their day typing into a composer pane the way they used to type into VS Code, treating the agent as a faster autocomplete. On the other end, the stronger engineers have essentially transitioned into a new job description: a semi-technical director of three or four parallel agent sessions, a meticulous reviewer of generated diffs, and an aggressive pruner of context windows. The transition is still underway even among the best. Almost all of it is self-taught.

The 10x spread.

Earlier this month I wrote a piece for Cade called Your Engineers Have a Burn Rate Now. It opens on an Anthropic Console screenshot showing that two of our engineers, doing roughly the same kind of work on the same codebase, were burning an eightfold spread on AI compute. Same output. Different harness.

The variance is almost entirely a configuration story. A well-set-up Claude Code or Codex session uses prompt caching, loads only the files it needs, summarizes old conversation turns instead of replaying them, and reserves high-reasoning modes for problems that actually need them. A poorly-set-up session does none of that. It re-ingests the monorepo every turn. It fans out to half a dozen MCP servers that each pull thousands of tokens of tool definitions into every prompt. It leaves reasoning budgets on maximum for tasks that needed a two-line edit. The bill grows. The quality drops. Both.

A team where the average spend is $200 per engineer but half the cost comes from two people is a team paying for misconfiguration, not for work. Among the engineers you already employ, the spread between the best and worst configured is not two times. It is roughly ten times. On the best teams I have seen, it is narrower. On the worst, it is wider. Almost no CFO is modeling any of this correctly yet, and almost no hiring manager is screening for the thing that closes it.

The levers that separate the spread.

I will not re-litigate the whole argument, since the Cade piece does it at length. But the observable levers are worth naming, because each one is a thing you can watch a candidate reach for, or fail to reach for, in an interview.

Grounding discipline. Does the engineer hand the agent the specific files it needs, with a short honest task description and one or two real examples of the pattern to follow, or do they paste the whole monorepo and hope the model figures it out?
Model routing. Do they default to a cheap model for grunt work and escalate to the frontier only when the task warrants it, or do they leave everything on the biggest model because they forgot to configure anything else?
Prompt caching. Do they have caching on by default, taking the roughly ninety percent discount on repeated system prompts, or are they paying retail for the same context thousands of times a week?
MCP hygiene. Do they treat MCP server installs as team-level decisions that enter the shared prompt budget of every team member, or do they install whatever a tweet told them was cool last Thursday?
Context discipline, written down. Do they maintain a short, current CLAUDE.md or AGENTS.md that captures the hard-won rules their team has learned, or do they re-explain the same thing to the model on every new session?
Verification instinct. Do they read the diff before approving it? Do they run the failing test? Do they catch the hallucinated import?
Tokens per merged pull request, not dollars per month. Do they think about the ratio of spend to shipped, reviewed, merged work, or do they measure themselves on a summary number that has nothing to do with output?

None of this is exotic. It is roughly as complicated as writing a good Dockerfile. The problem is that almost nobody writes it down, and almost nobody interviews for it.

The harness is the new build system.

The analogy I keep coming back to is structural rather than rhetorical. In April 2026 a commenter on a Hacker News thread about agent tooling made the clearest version of it I have seen.

“LLMs like Claude are like V8. Agent harnesses like Claude Code are like Node.js.”

HN user s314, April 2026

Sit with that for a second. V8 is a frontier piece of infrastructure. It is also, by itself, almost useless for actual application development. What made Node.js mainstream was not V8 being impressive. It was the developer-facing runtime that wrapped V8 with a usable API, a module system, a package ecosystem, and an opinionated set of defaults for what you would build with it. V8 is the engine. Node.js is the product.

Agent harnesses are the same. The frontier models are V8. Claude Code, Codex CLI, Cursor, Aider, opencode, and the internal harnesses companies are quietly building behind a CLAUDE.md and a fleet of skills, those are the Node.js. The shipping unit of modern engineering is not the model. It is the harness. Which means the craft worth hiring for is not “prompt engineering,” and it is not “vibe coding.” It is the disciplined design and operation of the harness itself.

Every decade, engineering organizations acquire a new discipline that used to be optional and is now non-negotiable. Source control in the 1990s. CI/CD in the 2010s. Observability a few years later. Harness engineering is the one they are in the middle of acquiring now. The question for hiring is whether you acquire it by screening for it, or by discovering six months after you hired someone that they do not have it.

The legacy screen is testing for recall in an era that rewards judgment.

A decade of technical hiring infrastructure was built around a specific belief, which was that if you could solve a Leetcode-adjacent problem in forty-five minutes, you were a person worth hiring. The belief was always wrong, because the skill being tested bore almost no resemblance to the skill being paid for. It was tolerated because it was easy to grade, legible to non-engineers, and harder to game than a take-home.

The entire case for it has now collapsed. The Stanford 2026 AI Index reports that performance on SWE-bench Verified, the canonical coding-task benchmark, rose from roughly 60% to near 100% in a single year. Any problem solvable by a few functions in one hour is a problem a modern frontier model solves in five minutes, and the candidate knows it. HackerRank, the category leader for that model, still powers the screens at a significant share of large engineering organizations, and I say this as politely as I can manage: those organizations are now paying real money to screen for a skill that the job itself does not require anyone to have alone.

This is not a claim about whether engineers still need CS fundamentals. Fundamentals are the floor. You cannot evaluate an AI-generated solution for correctness if you cannot tell correct code from incorrect code. You cannot ground an agent well on a complex problem if you do not understand the problem. Fundamentals have not become less important. They have become more important, in the same way that reading has not become less important now that writing is easier. What has collapsed is the usefulness of grading the fundamentals alone, in isolation from the tools the candidate will have open the day they start.

The detection arms race is the tell.

If you want to know whether an interview format is obsolete, watch who is building tools around it. For the Leetcode-style screen, the current market cap looks like this. Cluely, founded by a Columbia student who was suspended over building the first version, raised a $5.3M seed and then a $15M Series A led by Andreessen Horowitz, to scale a browser overlay that quietly reads a candidate’s screen and feeds them answers. Interview Coder, the original in that genre, was funded by name-brand investors. FairScreen and a half dozen others raised money to detect the detectors. On any given week a Hacker News thread about AI-assisted cheating surfaces a senior hiring manager describing, in detail, how they can see candidates’ eyes tracking left and right across an invisible prompt.

The data now backs up the anecdote. An analysis of 19,368 technical interviews in early 2026 found that AI-assisted cheating more than doubled in six months, from roughly 15% of candidates in mid-2025 to 35% by year end, with the rate inside technical roles specifically at 48%. Almost two thirds of those who cheat score above the pass threshold. Meanwhile 59% of hiring managers now report that they suspect AI misrepresentation in their pipeline, and one in three say they have caught a candidate using a fake identity or proxy.

The arms race is itself the admission of obsolescence. A job interview that fails the moment the candidate opens a browser tab is an interview for a job that does not exist anymore. No amount of proctoring software will roll that back. The only coherent response is to stop screening for a skill the candidate is going to use a tool for on the job, and start screening for the skill of using the tool well.

The chat-sidebar fallacy.

There is a second, quieter failure mode, which is when a platform acknowledges that AI is part of the job and then bolts a chat sidebar into its own proprietary in-browser editor and calls it done. This is the move most of the legacy assessment platforms have converged on. It is theater, for both sides of the interview.

Here is why. Your candidate already has Claude Code open in another terminal. They have Cursor or Zed on their own laptop with their own keybindings, their own CLAUDE.md, their own skills, and whatever agents they have been iterating on for the last eighteen months. The boxed chat sidebar in your vendor’s web IDE is, at best, a much weaker version of what the candidate already uses. At worst, it is a constrained toy that the candidate has to manually translate back into how they actually work. Either way, what you are observing is not how they would approach the job.

The right response to AI in hiring is not to lock it inside a proprietary box. It is to let the candidate bring the harness they actually use, in the environment they actually use it in, and watch them work.

What to actually screen for.

A harness-engineering assessment looks different from a Leetcode screen at almost every layer. You can think of it as six observable signals, each of them something you can see in a real-repo take-home plus a live follow-up in the same codebase.

Grounding discipline. Give the candidate a non-trivial repo. Ask them to add a feature or fix a bug that requires finding the right three files out of two hundred. A disciplined engineer narrows context. A brute-force engineer hands the agent the world and hopes.
Escalation judgment. Watch whether the candidate knows when to hand a sub-problem to an agent, when to write five lines by hand because that is faster than describing them, and when to stop and read the existing code before asking for a change.
Verification instinct. Did they run the tests? Did they read the diff before submitting? Did they check that the agent did not invent an import? A strong candidate catches a hallucination before the reviewer does.
Tool fluency. Git, a terminal, a CLI agent of their choice, an editor of their choice. Muscle memory should be visible. Someone who keeps asking how to open the integrated terminal has been interviewed poorly.
Recovery from failure. At some point the agent produces something broken. A strong candidate rewinds, regrounds, and tries a different approach. A weak candidate keeps re-prompting the same broken context and wonders why it keeps failing.
Configuration sensibility.The rudest, most clarifying interview question you can ask right now is one sentence long: “Can you show me your CLAUDE.md?” The answer, and the way they talk about it, is half the interview.

None of this requires a new category of platform. It requires an assessment shape that can actually contain these signals, and most cannot.

Why take-home plus live follow-up in a real repo is the shape that fits.

CodeSubmit’s version of this argument is plain, and I run the company, so take the bias where you like. The assessment shape that matches harness-engineering work is a repo-based take-home where the candidate uses their own tools, followed by a live interview inside the same codebase with a human reviewer who can see how the submission was built and ask about the decisions in it. This is not a gimmick. It is the only shape I know of that contains the six signals above.

A take-home in a real repository gives you grounding, verification, and escalation judgment in one artifact. An AI-assisted review highlights the structure of what was submitted, gaps in testing, and follow-up topics worth a human’s time. A live interview in the same repo (we call ours CodePair) lets you watch the candidate extend their own work in real time, with their tools, in a shared environment. And a short coding screen, if you want one (we call ours Bytes), can stay short because you are no longer trying to get all your signal out of forty-five minutes of puzzle.

Human judgment stays central. The hire decision is always the hiring team’s. What changes is what they are deciding based on.

One question, this week.

If you are running a hiring loop right now and you want to do exactly one thing differently based on this essay, do this. In the next interview, ask the candidate to share their screen and open their CLAUDE.md, or its equivalent in whatever agent they use. Do not ask them to explain it. Just let them scroll through it and narrate.

You will learn more about whether they are a modern engineer in ninety seconds than you learn in an hour of Leetcode. The ones who can do it cleanly have shaped their tools, which is the craft that matters now. The ones who cannot do not yet have the craft. Either answer is useful. The old screen cannot give you either.

How we built the alternative

CodePair puts the real tools on the table. No boxed-in chatbot.

CodePair is the live interview environment we built for this exact world. Real foundation models and coding-agent workflows inside the same shared project. The candidate and the interviewer watch prompts, responses, harness actions, and accepted edits in real time. Full terminal, real repo, Git, previews. You see how the candidate grounds, routes, verifies, and decides what to own. It is closer to how engineers actually use AI than any boxed-in interview chatbot.

See CodePair Schedule a walkthrough

Written by Dominic Phillips, founder of CodeSubmit. Related reading at Cade Partners: Your Engineers Have a Burn Rate Now.