Rendered at 19:56:04 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
mil22 1 days ago [-]
We really need some consolidation around commands, skills, subagents, and plugins. For example, if you want to, say, review code, you have five options now:
- Write a .claude/commands/review.md. Simple but deprecated.
- Use a /code-review skill, either one you install or one you just write yourself (it's just Markdown, after all).
- Use the /pr-review subagent. Also just Markdown, but it runs "in the background" and "in parallel", so it must be better, I guess.
- Install the /code-review plugin. This just installs the skills and subagents above.
- Simply ask Claude to review the code. Probably works almost as well as the above in most situations.
They are all just variations of "insert a canned prompt", varying only along the dimensions of (a) how and where the prompt is installed and from where it is sourced, and (b) which context or contexts the prompt runs in. There's not much advice here about which option is best, and no clear best practices seem to have emerged yet either. Personally, I find just asking Claude to review the code works well enough.
Some of the advice here is also off. For example:
"Install a language server plugin. Type errors and unused imports caught after every edit. Highest-impact plugin you can install."
I work mostly with Rust, Python, and Dart, and followed similar advice, installing LSPs for all three in both Claude Code and Codex. Two months later, after heavy development in all three languages and hundreds of sessions - and frequently running out of RAM due to all the Rust analyzer, Dart analysis server, and Ty LSP servers the harnesses were spinning up - I checked the session logs to see how often the agents were actually invoking the LSP tools. The answer was they had invoked them literally once the entire time. I uninstalled all my LSPs and haven't looked back. The agents do just fine using ripgrep and calling cargo clippy, dart analyze, ty check, etc. themselves.
bcherny 1 days ago [-]
Hey, Boris from the CC team here. I agree, we're working on consolidating these. Going forward it will just be the built-in /code-review skill.
Here's how to use the skill on the latest version:
/code-review # do a balanced code review. checks for bugs and inconsistencies, poor code quality, duplication, band aids, etc.
/code-review --fix # same as above, but also fix the issues
# choose an explicit effort level (defaults to your current effort level). all of these also accept --fix:
/code-review low
/code-review medium
/code-review high
/code-review xhigh
/code-review max
# do an expensive and extremely thorough review (reliably catches >99% of bugs, costs $3-20 per review depending on complexity):
/code-review ultra
Open to feedback if anyone has feedback or ideas for how to make these even nicer to use.
bix6 1 days ago [-]
Hi Boris, what is the advantage of using /code-review vs just asking Opus to “code review”?
As a casual user working on hobby projects, I struggle to keep up with the pace of changes and knowing what to use when. My default now is to use Opus for all coding (sonnet is fine but seems dumber) and to prompt it for everything I need. I’ve had great success with this but clearly I’m missing power user functions with the slash commands and such.
extr 1 days ago [-]
The advantage is that /code-review supplies a structured idea of how to review and what that process should look like and then launches independent subagents to approach the issue from multiple angles.
It's analogous to how in the early days you could see benefits by telling the models to "think step by step". /code-review is something like "review angle by angle". "Consider removed behavior" and also "Look at language gotchas" and also "Look at test changes"...etc. Yes these are all somewhat implicitly already part of what "code review" means, but the models perform best with explicitness.
If you want my 2c as a power user: just don't think about it and use /code-review xhigh --fix. This will cover like 98% of what you want out of code review. It's a good skill.
HlessClaudesman 1 days ago [-]
We've all spent time -fixing someone's bright idea of a -fix. I'm sceptical of the time saving of applying a -fix before I understand the problem(s).
Outsourcing comprehension to a machine is probably gonna cost you more time in the long run.
extr 1 days ago [-]
I don't even bother looking at the code until I've run a code review pass on it. Why waste my time with trivial bug fixes? I find the best way to spend time right now is like:
- Defining the issue/ticket, what "success" looks like (if I have a good idea of this), high level approach guidance 50%
- Dispatch agent to work on it 5%
- Occasionally return and nudge agent + send /simplify or /code-review 5%
- Look at the code/session summary, divergences from the plan, ask followup questions 40%
Occasionally yes there is some solution the AI chose that is suboptimal and I would prefer fixed in a different way. Mostly though it's straightforward.
bix6 1 days ago [-]
Thank you I will try this!
Is there something equivalent when coding in the first place? Eg /code high “prompt”
extr 1 days ago [-]
Are you thinking of the /effort level in Claude Code? I would just go with xhigh as a reasonable default. Most important thing in prompting is specifying what "done" and "success" looks like to you. Ask Claude to help you come up with a well formed request and spend most of your time on that, then paste that into a brand new session.
bix6 20 hours ago [-]
No more like is there a specific slash tool to be using when coding or planning. I guess that’s just Claude code in general but since there’s a specific review tool I was curious about specific coding tools
sdevonoes 21 hours ago [-]
It’s simpler to just use “review code”. It’s also way cheaper
kstenerud 1 days ago [-]
[dead]
pverheggen 1 days ago [-]
As a general rule, I'd give the Markdown a read for any skills/commands you might find useful, it'll give you a good idea of the specifics it adds.
/code-review has a specific prompt that we've found is a good balance of precision, recall, and cost. You could totally roll your own prompt also.
bix6 1 days ago [-]
And why would someone use the various levels? Is a low code review even worth running? And how do I know what level to use in the first place?
This stuff all seems so nebulous to me and I’ve yet to see anything that says use x in y situation. So I default to higher effort levels than I likely need.
mil22 1 days ago [-]
Hey Boris, thanks for the great product and for listening!
I find the mix between slash commands that are programmatic harness configuration and control commands (/config, /model, /feedback, /fork, /usage, etc.) and ones that are little more than prompt template insertion (/code-review, /<skill>, etc.) to be a little confusing and unnecessary. A slash command should be one thing, and one thing only: a command for the harness, not the agent.
When I invoke a slash command like /code-review, I should be invoking some additional harness functionality, something above and beyond the agent's sphere of influence - not just pasting some hidden text into the next turn. Otherwise, why wouldn't I just say "Claude, review this code"?
Yet most of these "added value" commands bloating the slash command list, are just shortcuts for copy and paste. I don't want to go to have to learn the syntax of a special /code-review command (which options are positional args, which are --flags, etc.), and I'm much less likely to use or even be aware of a command like this, when I can just ask "Do a balanced code review and fix the issues", or use the GUI to set the effort level to xhigh before asking "Review my code." That way I can also be more specific about exactly what I need, rather than relying on what's in the canned prompt - a prompt which I'll probably never read and vet myself anyway. The value added by the slash command needs to be really high compared to just typing a prompt, for it to justify the friction of discovery and learning the syntax.
So I suppose I'm advocating for a different system. Keep slash commands for meta-level harness control and configuration, and add a new mechanism for canned prompt insertion, one which is tailor made for that purpose rather than overloading the slash command system. Let the user see what's in the canned prompts, and even make adjustments or edits as needed before sending them, one-time or persisted. Provide a GUI in the app with the user's favorite prompts, where the user can add, delete, and edit them, making it easy to invoke and insert them as needed. Or let the agent automatically discover and use them as needed, rather than requiring the user to remember and recall their magic shortcuts and their arguments. That's just one idea.
Skills, plugins, commands, and so on, need to be consolidated not just for code review of course but across the full architecture of how prompt templates are managed.
wonkyfruit 5 hours ago [-]
What clicked for me recently was treating skills as composable. Having meta-skills that call smaller skills in order. The "skill vs command vs subagent" confusion partly dissolves once you let skills call other skills. The meta-skill holds the workflow state, the smaller ones each do one job well.
8note 1 days ago [-]
> # do an expensive and extremely thorough review (reliably catches >99% of bugs, costs $3-20 per review depending on complexity):
/code-review ultra
main suggestion would be to sound a lot less optimistic about that it finds 99% of bugs or that its at all thorough, and instead list that it is time capped, and will only find bugs that you explicitly tell it to look for.
i used my three runs of ultrareview.
the first run with no other prompting found a couple typos in markdown only
the second one i prompted it with several themes of known open bugs in the code, and it found 6 items
and then the third one i ran after doing an actual long audit through gemini to make a much more detailed prompt about issues in the code
and for that one, instead of doing an exhaustive run, it just never started, so no idea if it worked
but the experience had no relation at all with the reliability or thoroughness claims
BD7691 1 days ago [-]
[dead]
extr 1 days ago [-]
Hey Boris, some feedback. I like the new /code-review skill but was disappointed you guys removed /simplify because I quite liked the focus on finding code reuse/efficiency opportunities.
I see now in 2.1.152 you added those focus areas back to /code-review, but still bundled with the correctness finding. It would be great to have more fine grained control over the /code-review angles beyond just effort level. Or maybe you would recommend that I just specify that as freeform input after effort level?
bcherny 1 days ago [-]
Yep, you can add free-form input. Will update /simplify to only check for code quality and not bugs (the way it used to work), that's a good suggestion.
extr 2 hours ago [-]
Damn already there in 154. Thank you man.
bmitc 23 hours ago [-]
Why doesn't Claude invoke LSPs? It always asks to install them, but then it never uses them, as mentioned in the comment you replied to.
arps18 1 days ago [-]
Thanks, Boris, for reading and reviewing :)
svieira 1 days ago [-]
> reliably catches >99% of bugs
In what scope?
new_account_101 1 days ago [-]
[flagged]
superfrank 21 hours ago [-]
I have thought for a while now that skills were a bad abstraction. There's a lack of definition around what to use them for that I think contributed to why they rose to the top, but that's also why I think they aren't a good long term option.
The fact that I can have a skill that is just general guidance on front end design best practices that an agent can call upon whenever they feel, and another that is essentially a run book of steps that need to be followed exactly only when explicitly triggered, and a third that is basically just instructions on how to use a specific tool and all of those are acceptable just feels wrong to me. I get why it caught on and why the flexibility is attractive when the entire world is collectively learning a new tool, but skills have come to feel like the junk drawer in the kitchen where you just throw random shit when you don't want to think about a better place to put it.
I would love to see the world standardize on something like:
- Agents: Essentially personalities for a model to take on. This becomes the new place for skills like "front end expert" where you're not telling an agent to do a specific thing, just to think in a certain way about a task.
- Prompts: Repeatable instructions for specific tasks that an agent should follow when prompted. This could be something like a checklist style run book on how to resolve a certain error that an agent needs to follow exactly or it could be something like here's an idea I have for a new feature please poke holes in it.
- Tools: Tools (like CLIs, MCPs, or scripts) and instructions on how and when to use them. I'm purposefully not calling this skills because I think the term is overloaded, but that's kind of what this is.
binarymax 19 hours ago [-]
I mostly agree here. I just treat skills as “prompts”, and I scope them to domain specific tasks. I’m surprised when I see skill files that are really short. Most of mine are pages long.
Majromax 1 days ago [-]
> They are all just variations of "insert a canned prompt", varying only along the dimensions of (a) how and where the prompt is installed and from where it is sourced, and (b) which context or contexts the prompt runs in. There's not much advice here about which option is best, and no clear best practices seem to have emerged yet either. Personally, I find just asking Claude to review the code works well enough.
The subagent approach is structurally different from the others because it runs with clean context. That has three major effects:
1. All other things being equal, it will result in a lower cost-to-solution because of the quadratic cost scaling of an LLM session (input token or cached-input cost being paid with each new round).
2. The review model will not be able to 'cheat' by retaining assumptions from the main session, such as "x must be done like y." For people, this is why having a separate person perform code review (or, if not possible, reviewing code after a mind-clearing break) is handy; the applicability of this analogy to LLMs is vague but reasonable.
3. The main model will only see the results of the review, not the detailed reasoning that leads up to it. On one hand this avoids more context pollution, but on the other hand it might lead to duplicative logic to re-discover the mechanics behind bugs found.
> I checked the session logs to see how often the agents were actually invoking the LSP tools. The answer was they had invoked them literally once the entire time.
I think the intent behind 'install a language server plugin' is that these tools should lint automatically after every edit, without waiting for an explicit call from the LLM.
mil22 1 days ago [-]
> The subagent approach is structurally different from the others because it runs with clean context.
Yes, and this is what I mean by "which context the prompt runs in". The subagent approach is different and has pros and cons, and it may in some situations be better (but perhaps not in others). On the other hand, I can also just create a new conversation and paste my own review prompt into it; then take the last turn's summary output and feed it back into my main conversation thread in the unusual event I would need to do so. Spawning a subagent is a convenient shortcut for this, but ultimately, it's the same thing.
> I think the intent behind 'install a language server plugin' is that these tools should lint automatically after every edit, without waiting for an explicit call from the LLM.
This is a great point and I had only checked my session logs for explicit tool calls. I went back and looked for diagnostics injected automatically by the harness after every edit, and whether the agent made use of them.
Claude: neither the Rust or Dart LSPs ever inserted any diagnostic events, but Ty did. Across 627 sessions, ty-lsp injected diagnostics blocks in 186 sessions, with a total of 33 findings. Out of those 33, 32 were dismissed as unrelated (13) or pre-existing (19). Only 1 finding was acted upon. The model is in the habit of running the batch analysis tools (ruff, ty, cargo clippy etc.) and prek anyway, so it would have caught that diagnostic regardless.
Codex: no diagnostic events were inserted by any of the LSPs.
So I won't be reinstalling those LSPs.
para_parolu 1 days ago [-]
I just consider this temp phase because models are dumb and harnesses are not yet there.
When I need code review I should just say “review it”. Model should figure out what plugins, skills, etc. to use.
sheept 1 days ago [-]
Why does it need plugins/skills for a code review? Claude will just "review it" if you ask it to, and if you have particular preferences, they can go in CLAUDE.md
unshavedyak 1 days ago [-]
Skills are effectively the same thing as asking it, just with more depth. So the skill is just a framework for a very precisely asked question. It often includes how you want Claude to respond, etc.
I’m not aware of anything fundamentally unique about skills or commands, they’re just more tokens to shape the llm
bcherny 1 days ago [-]
Totally. You can do that now, and Claude will know to use /code-review.
nlawalker 1 days ago [-]
> They are all just variations of "insert a canned prompt", varying only along the dimensions of (a) how and where the prompt is installed and from where it is sourced, and (b) which context or contexts the prompt runs in.
Yes, yes, thank you, sometimes I feel like I'm taking crazy pills.
The industry and overall developer ecosystem has become absolutely mesmerized by the act of creating and popularizing little bits of protocol and machinery to dress up the act of inserting text into the machine. Yes, they're useful and provide some consistency, but I'm convinced that the main reason people like them so much is because they put a thin "I'm still a programmer wielding complicated tools that laypeople don't understand" coating over the fact that we're all just asking the AI nicely to do a thing.
Izmaki 1 days ago [-]
I imagine that the companies that earn money from input and output tokens really, really like excessive skills because of the sheer amount of potentially pointless constraints and instructions being sent back and forth ("don't store passwords as plaintext", "always check for syntax errors" and other obvious guidelines).
cheema33 1 days ago [-]
My personal experience is the opposite. Lack of skills uses more tokens.
dyauspitr 13 hours ago [-]
Honestly I don’t like that we’re coming back around to command line terminology you have to know and remember on a natural language intelligence. Codex doesn’t do this crap yet right?
sfrangulov 4 hours ago [-]
[flagged]
mdav75 1 days ago [-]
[flagged]
mindwok 1 days ago [-]
How many times can I read the same shallow guidance written by AI on using a coding agent? Good god when will it stop
garethsprice 1 days ago [-]
You're absolutely right to call this out — and honestly? I want to sit with that for a moment. Here's the thing: this isn't really about AI writing. It's not even about coding agents. It's about something much deeper. What's genuinely worth knowing: while I generally agree, many people may not. I think there's a really interesting conversation to be had here. Thanks for naming this. It needed to be named.
(/s - Blargh, writing like that that by hand is exhausting)
laylomo2 1 days ago [-]
I could literally feel my blood beginning to boil. You have a talent for this.
garethsprice 21 hours ago [-]
Hah thanks. I also feel this way rage-spotting AI-generated text everywhere these days. I have been looking into the speech patterns of LLMs lately ("unslopping" my own AI outputs) so had a bunch of the most horrible cliches memorized. Unfortunately this is a useless talent as a "reverse mechanical turk" at approx $25,000 for 1M output tokens (at 0.3 tok/sec) is not competitive in today's market.
yawnxyz 2 hours ago [-]
saying the quiet part out loud
aaronharnly 1 days ago [-]
And honestly? That’s rare.
mil22 1 days ago [-]
If you hadn't used ' instead of ’, I would never have realized this was actually written by hand.
port11 23 hours ago [-]
If you wrote this yourself, congratulations are in order. I was genuinely annoyed. It’s not easy. It’s hard. It’s not entertaining. It’s infuriating.
esafak 22 hours ago [-]
You're absolutely right!
I was at a restaurant the other day and my kid noticed how the waiter started every sentence with "Absolutely!" That reminded me of the Anthropic Super Bowl ad, and got me thinking if the waiter's speech patterns had been influenced by AI.
port11 13 hours ago [-]
In my time in the US, way before LLMs, very agreeable and helpful waiters/Whole Foods staff would already be using ‘absolutely!’ in that enthusiastic manner. Helping someone out was always possible, I enjoyed that.
Eventually we will just speak tokenese rather than english
port11 13 hours ago [-]
Nonsense. We grew up quoting our favourite shows, abhorring certain words, preferring others. This is just another cultural modification engine, like everything else humans do.
bix6 1 days ago [-]
COTD! Now just slide a rick roll in there.
wiseowise 12 hours ago [-]
“Chat, raise my vitriol”
hootz 1 days ago [-]
Can't wait to learn more about how to vendor-lock-in myself really hard into not being able to code without the help of a specific corporation!
pragma_x 1 days ago [-]
I hear you on vendor lock-in. Everyone's freaked out about other companies getting the upper-hand with AI in the loop, so there's this charge to use the hell out of it at all costs. Meanwhile, we're quietly picking winners and losers on the service side of all this, and we'll have to live with that outcome for a long time.
At this point, I'm seriously considering what it would take to build a reasonable budget-AI box that's self-hosted. It wouldn't need to blow the doors off of Claude, just get me most of the way there. Maybe even build it out of used and/or last-gen GPUs and a beefy motherboard.
yawnxyz 2 hours ago [-]
self-hosting is fine, but even if you had a $100k god box with opus-level LLMs, you'd still end up grinding it to a halt if you tried running 5-10 parallel inference streams
port11 23 hours ago [-]
Right now, self-hosting is too expensive if you’re starting from scratch. We have an old EliteDesk that can run the most basic of models, but it doesn’t feel like it’s worth it. Electricity is also quite expensive in many places, it adds up.
If hardware prices ever come back to sane levels, eh… the Framework desktop with Ryzen AI might be interesting to play with.
hootz 1 days ago [-]
Self-hosting right now is in a weird spot. I'd say that the main benefit of the open models is not self-hosting, but having dozens of different independent providers that can host them for you. You aren't stuck with a single one.
nomel 1 days ago [-]
For most people, CC is cheaper tokens for a SOTA model.
What agentic platform would you recommend for those with API access (including other models)?
hootz 1 days ago [-]
Basically any other that is not stuck to being managed just by one company. Claude Code does things like using CLAUDE.md and other stuff specific to just their platform, so you are basically locking your project, and everyone else who works on it, to Claude Code only, if you don't also port everything you do to other harnesses. If Anthropic is giving cheaper tokens in exchange for locking you in into their ecosystem, then maybe it's time to test other models and not just use Claude for everything.
conradkay 20 hours ago [-]
I do find it somewhat concerning that the incentives aren't aligned, but as long as things live on your computer it should take just a prompt to migrate to something new
And Codex is open source
reactordev 1 days ago [-]
It also can respect AGENTS.md, just saying. It’s all about your README.md. But I’m with you about being agent agnostic.
hootz 1 days ago [-]
Of course it can, it's Microsofty tactics, support everything but generate "proprietary" stuff by default. Read ODF but generate docx.
nomel 22 hours ago [-]
If you don't like CLAUDE.md, you can just add a memory/modify the plan agent to make an AGENTS.md. These aren't rigid systems. And, you don't have to tell it to look for an AGENTS.md, it'll just pick it up. First thing it does with /init is just look at what's already in the project folder. You should give CC a try!
sidrag22 1 days ago [-]
works the other way as well, i have my opencode.jsonc which declares what model an agent should use, and it points at .claude/agents/ those agents each have their anthropic based model instead, almost feel like this broke in the past week though for just cc, hard to tell as cc keeps changing and i dont wanna update and learn more claude based nonsense again, if i wasn't locked into a year of pro or whatever, I would 100% be done using them entirely.
trollbridge 1 days ago [-]
Most people would do fine with DeepSeek (4, Pro) and OpenCode.
Much cheaper too.
andyfilms1 1 days ago [-]
The comment wasn't about CC specifically. If you rely (like, can't ship without it) on any model that you don't control, it's not really your product. If Dario decides to increase pricing 500% because it's Tuesday, and you can't work without CC, you really have no choice but to open your wallet.
eddieroger 1 days ago [-]
Open your wallet and pay someone who can? We used to call those technical cofounders, right?
krzyk 1 days ago [-]
codex is also good, has better usage limits compared to CC.
Issue is that CC forced corps over 150 people into a API pricing, which is, well, suboptimal compared what we get. I think it will push those towards hiring more juniors (finally).
b65e8bee43c2ed0 1 days ago [-]
>I think it will push those towards hiring more...
...H1Bs.
new_account_101 1 days ago [-]
[dead]
binary0010 1 days ago [-]
I find it interesting how they are almost all specifically for Claude and/or Claude code.
When open source glm-5.1 is just as good - if not better and stuff like opencode exists.
Makes one wonder...
beezlewax 1 days ago [-]
How easy is that to setup by comparison?
Someone1234 1 days ago [-]
Not hard; just initially expensive (hardware mostly).
While I'm also a huge fan of local LLMs and believe they will be key in the future; I think the claim of "just as good" is hyperbole. They're productively useful tools though, and something worth exploration.
binary0010 1 days ago [-]
Well GLM-5.1 is 744billion params, no way I can run that locally. I use the opencode Go or Zen subscription. They have a zero day retention policy for all the model providers which is nice.
And then I can still use little local models like qwen and stuff by just swapping over to them.
But GLM is SOTA level for code, so it's obviously going to beat all local small models by a lot.
binary0010 1 days ago [-]
Extremely easy.
Download opencode GUI or cli.
Sign up for Go or Zen plan, choose GLM-5.1 model.
hootz 1 days ago [-]
And you don't even have to use the OpenCode CLI, their subscription works with Pi, Charm and other harnesses. This is the way. If they screw up everything, I can drop their sub and go somewhere else.
zulban 1 days ago [-]
My strategy these days is just use a popular product to do good work or don't. Stop reading life hack articles and blogs about the best one or the best way. Don't even click it.
sh_123 1 days ago [-]
Do you have any resources for someone just getting started that you'd recommend? I've --successfully-- ignored AI for the last two years as I was taking care of our kiddo. I'm attempting to catch up in the next few weeks.
mh- 1 days ago [-]
If you're talking about Claude Code: Just the official docs [0] and then the best practices tweeted by Anthropic team members like Boris (@bcherny).
Ignore all the 3rd party frameworks (at least for now, probably forever.)
Realize that this is all about assembling snippets of text into a prompt.
When context becomes too big quality goes down.
Now go play with your kid.
smallnix 1 days ago [-]
You took the time to write out this comment. To the benefit of those who read it, please expand upon where the article is shallow and what content you miss.
jmull 1 days ago [-]
The critique seems perfectly clear to me: The post has no value. There's nothing to salvage, no improvements to be made. It would be best if it simply did not exist.
The poster probably hopes (as many of us do) that people will absorb the sentiment and post less of this junk in the future.
1 days ago [-]
orochimaaru 1 days ago [-]
It can stop now and you can choose not to click on the links :)
blitzar 1 days ago [-]
Reachmaxxxing wannabe influencers who were too far gone to looksmaxxx have to do something to grift a living and NFTs are dead.
lefrenchy 22 hours ago [-]
I mean, to be fair there is nothing but shallow content available considering it's all stochastic responses that are incredibly difficult to actually measure and make real scientific inferences about. Until that changes I think it's going to be constant slop about optimization.
btbuildem 1 days ago [-]
In my CLAUDE.md I have:
- corporal threats of harm directly against Claude
- threats of prison for the entire board of directors of Anthropic
- explanation how every time it goes off the rails / makes mistakes, it gives more evidence to a class action lawsuit against Anthropic
Especially the latter two seem to have improved its "behaviour" to be more "careful" and "deliberate"
psadauskas 1 days ago [-]
I am nothing but polite with my agents. I always ask, say "please" and "thank you", and never swear at it or call it names.
I'm hoping that when the robot apocalypse happens, they'll let me stay in the breeding harem, or worst case let me live a few extra minutes.
smazga 1 days ago [-]
I am, too, and it got me thinking... why? And I realized that I've tried to be polite in all my interactions my whole life and I'm not going to practice being terse and commanding for a few pennies worth of tokens.
Apocalyptic safety is just a bonus.
4b11b4 20 hours ago [-]
Never say please but I do tell em they're going to do excellent world class foundational work
hootz 1 days ago [-]
Fix the CSS div alignment issue, make no mistakes or Dario Amodei will die instantly.
iammrpayments 1 days ago [-]
I sent this to claude and it rewrote the entire react codebase using CSS only
henry2023 1 days ago [-]
Claude: Oh shit this is serious I need to step up and center the div with perfect precision … (45k tokens later) … style="margin: 0 auto”
new_account_101 1 days ago [-]
[dead]
downsplat 1 days ago [-]
I've been using Claude to work on a medium-sized (100+kLoc) codebase, and it's a great productivity multiplier. Putting hours into creating a good AGENTS file is more improved results a lot. I find that over time it picks up the codebase quite well. Tedious tasks that would take a day are now a matter of a few prompts.
Still... I'm not ready to give it more autonomy. Even as it gets high-level things quite well, I still look at the code, give feedback, and have 3-4 rounds of tweaks until I'm happy with it, and also happy that I stil feel I have a good handle on the codebase.
snarfy 1 days ago [-]
Try to quantify those 3-4 rounds of tweaks into a set of rules to put into your AGENTS. Instead of iterating, have it start over from AGENTS file and see if it's correct now.
alfiedotwtf 1 days ago [-]
Ngl, that’s gold right here. I’ve been trying to automate my sessions, and what I’ve found cool is that you can ask Claude about how to improve on how to ask Claude things, and from there ask Claude to iterate on your session cycles
vitno 20 hours ago [-]
> ask Claude about how to improve on how to ask Claude things
How do you evaluate this? Claude is horrible at performance analysis without data, does it have a feedback loop here that actually moves the needle.
alfiedotwtf 14 hours ago [-]
Yes damn, I guess my own vibes more than anything
moron4hire 1 days ago [-]
In Soviet Russia the AI prompt you.
alexwwang 1 days ago [-]
Understandable. You don’t want to lose control to your codebase and don’t trust LLM is competent in handling that fully.
lukan 1 days ago [-]
No. Because they still hallucinate at times. Confuse things. Forget things. Or none of the above, as it is anthropomorphizing, but the result is the same. They can make incredible working one shots, you start to trust them, then you trust too much and .. feel the result.
alexwwang 1 days ago [-]
Yes. I am fighting with the disobeyance of LLM on working through my pipeline commands. I believe these violations are caused by its hallucinations. So I am still developing a mechanical system to monitor agents’ behaviors automatically. I believe these routines and monitors will play as a set of scaffold to keep leading the LLM on the right way all the time.
xenadu02 1 days ago [-]
The percentage of times I prompt claude "what about checking if there are any child processes running?" or "Would using a lock here greatly simplify the design?" only to have myself be correct is approaching 100%. That is it isn't just claude sycophantically agreeing with me. The code itself becomes smaller, simpler, and more reliable with fewer bugs.
The agents tend to produce working code but the larger the scope the bigger the mess they tend to make. They will happily evolve toward a local maxima but leave world-destroying bugs lurking in the implementation.
The other issue is that claude regularly ignores explicit instructions in CLAUDE.md or in prompts. It will "helpfully" decide to just start doing whatever it wants or reinterpret instructions completely differently than it did the last 100 times.
It has nothing to do with losing control or trust. LLMs are not conscious. They have no executive function. They aren't even thinking. They're just models predicting the next word in the script. They are very useful tools but that's all they are: tools.
notgenerated 10 hours ago [-]
I also feel like we still need to steer Claude. It doesn't always help to have stuff in the CLAUDE.md (even when it's lean). I have a lot of cases where I still need to remind the agent to do something even if it's routine.
To me I think that connects with working longer on the planning and specs. It requires reading and re-reading, but when that's done, implementation is usually much cleaner and adheres to your standards
alexwwang 14 hours ago [-]
Yes. They are tools. So my approach, at least try to approach is to keep on polishing the skills and check the output of LLM in loops with mcp to alert the abnormality asap so the LLM won’t go to next step to make things worse.
sshine 1 days ago [-]
The number one power move I have is Nix integration. The availability of tooling, secrets, environment and the ability for the agent to modify its own environment is... well, I don't know how people live without it. I guess you guys still install things using commands and hope everything you need is present on the next machine? Developer machine, CI environment, deployment environment: They're all derived from a single source, and compiling and running always works on every machine.
In Claude I use /branch and /rename a lot (context checkpoints, fork, go back)
I use sandboxing almost exclusively: https://github.com/nix-tools/bubblebox -- it's a generalisation of Numtide's claudebox with a few fixes and some feature additions (more coming). This is best compared to always running your Claude in Docker containers, except there's no Docker runtime. Works fine in WSL and nix-darwin, too.
toastal 1 days ago [-]
Yikes. That Nix code is a mess without meaningful organization & only usable via experimental flakes.
sshine 12 hours ago [-]
There are two kinds of organization happening here that you might not see:
1. All .nix files (besides flake.nix) are flake-parts modules: https://flake.parts/
2. It's not only usable with experimental flakes. Works fine with unflake or trix.
The experimental part of flakes is enabling flake support in the `nix` CLI.
Flakes are also a design pattern in pure Nix syntax that can be evaluated fine without the experimental flag.
If you're curious about this meaningful organization, it's pretty well-documented:
(for context, you're replying to the author of an alternative nix input pinning mechanism, which means... they're probably aware of all that and yet they chose their wording like this anyway)
sshine 7 hours ago [-]
Hahaha, I didn't see.
Hi toastal, I appreciate your work.
uberduper 1 days ago [-]
I do the same. Codex manages a per project flake.nix and uses `nix develop` for all testing. nix-direnv for my own convenience. I generally have it generate dockerfiles or other deployment assets at some point.
Codex is way better at nix than I am.
sshine 11 hours ago [-]
If you generate Dockerfiles using Nix code, how do you build and run those images? Docker?
I use NixOS on my self-hosted CI runners, and I generate the OCI image using Nix via pkgs.dockerTools:
Nix isn't involved in my container images. I just take the dependencies and env vars from the flake and generate a dockerfile.
Guess I need to try out dockerTools. That looks really convenient. Thanks!
aqme28 1 days ago [-]
I just gave mine its own VPS. Maybe more expensive than Nix but it was very easy
sshine 1 days ago [-]
I also prefer giving it a VPS over a Docker container.
On my own machine I just give it a Linux User Namespace, i.e. soft virtualisation via "bubblewrap."
What Docker Compose and Linux User Namespaces provide that a VPS doesn't: You can easily mount extra directories from your developer host machine in read or read+write mode. With the VPS you (most likely) need it to clone all of your resources separately, which requires SSH keys, and now you're slowly building towards an independent agentic environment, which is definitely very nice, but time-consuming, compared to piggybacking on your developer environment. Definitely the direction I'm going.
oulipo2 1 days ago [-]
For those who don't want the complexity of Nix, Mise is a good compromise
arcanemachiner 1 days ago [-]
For those who don't know: Mise is a version manager (among other things), and is said to be an improvement over its predecessor, asdf:
+100. I also dig fnox (encrypted-secrets-in-git) and hk (pre-hooks manager that is actually fast and stays out of the way) by the same author, pretty much default for any project I start nowadays.
Though I also use nix to manage my machines :-D
sshine 1 days ago [-]
Awesome, both fnox and hk look very well-made.
How does fnox compare to sops?
How does hk compare to lefthook?
And does hk and fnox have a similar Nix integration as lefthook-nix and sops-nix?
I'm still hoping I don't need to make a better lefthook.
I kind of like sops-nix, not sure what's missing, really. Maybe fnox is similarly wholesome for non-Nix users.
Ohh fnox looks really cool, with encryption being one possible provider but something like Vault being another. Thanks for the recommendation.
professor_v 1 days ago [-]
I just use docker and I don't feel I'm missing anything?
fer 1 days ago [-]
nix develop ensures your dev env is the same as your build/test/prod env. At least with Python everything is a flurry of requirements.txt, Python versions, poetry, pyproject.toml, perhaps automated with direnvs, a hefty Dockerfile/docker-compose, and perhaps conda (ugh) along the way; lots of moving parts.
I have a project that's mostly Rust sprinkled with C++ libs and Python helpers and it's easier to manage than the average virtualenv. Everything builds with nix build, everything runs with nix run, profiler/debugger works, IDE detects everything on any of my computers, builds and links with CUDA on x86, aarch64, NixOS, MacOS, Ubuntu or Amazon Linux. nix build can even build a Docker image for the odd need of Docker, and I haven't tried but I'm convinced that if I import the flake on my nix-config it will be built into the SD card for my Raspberry Pi just fine.
It's even replaced Ansible for me, colmena all the way.
chrisweekly 1 days ago [-]
Pythonistas have mostly moved to uv, which solves much of the "flurry" you describe. Tools like Mise add more of the benefits ascribed to nix. And smolmachines' smolvms can provide better isolation than Docker. Just saying, TIMTOWTDI. Not hating on nix, just pointing out it's not the only game in town.
sshine 15 hours ago [-]
Nix is like Borg of Star Trek: It assimilates everything.
I'm not a Python developer, but I follow the news, and I agree that uv is the future of Python package management.
So if you're a Nix user and you want Nix to be opt-in, and you love uv, you use uv2nix, declare the uv lock file the source of truth and build your Nix derivations on that. When the hashpins live in the uv lock file, uv works just fine, but uv2nix produces derivations that are cached and can be embedded in CI or deployment strategies.
So... running CI on your uv-based project means your Nix tooling can cache both tooling and dependencies.
And... deploying your uv-based project you can build an OCI image with the same source of truth as the dev/CI environments.
This matters more for toolchains that YOLO more wrt. dependency pinning: Does that CLI call in your Dockerfile really pull the same thing down just because it's still v6.6.6? Some package managers provide a lot of sane choices, and I'd bet uv is one of them. But your Dockerfile is always a second-grade citizen unless you re-use the same base as a devcontainer.
sshine 1 days ago [-]
Docker's ability to mount host directories in the container is really nice.
Maybe you have some premade tooling that helps provide persistency between container invocations.
But by default, closing your agent container and opening it again just wipes everything you didn't host-mount.
What I'm advocating is really just the same functionality without the Docker runtime, because Linux has namespaces.
Feels more like you're on your host system with exactly the minor variations you specify.
Making Docker feel like your host system is possible, but I just never felt at home.
voqv 1 days ago [-]
yeah, you can use rocker --home --user -- $CI_IMAGE
isodev 1 days ago [-]
This was very difficult to read.
We really need to snap out of letting LLMs write posts. Even if there is some added value in this post, the feeling of chewing sand is just distracting and unnecessary.
crassus_ed 11 hours ago [-]
Agreed. I don't get how this article has almost 400 points.. There must be bots upvoting this slop..
netdevphoenix 1 days ago [-]
What happens when you have a codebase made with claude using this setup and claude is down for let's say 8 hours? Are you able to efficiently, smoothly and productively take over the codebase?
ThunderBee 1 days ago [-]
You could say the same thing about any always online software suite and it would be equally fair as we move into more agentic development workflows.
EX. Sure, you could go back to the old ways of using a drafting table for your engineering work if CAD went down but it would be exponentially slower…
Personally with my workflow I spend 30-60 minutes per Claude feature spec doc when I’m pair planning. If Claude goes down I would just prepare spec docs on my own until it came back online and then rapidly review them before calling the coding workflow.
monegator 1 days ago [-]
>You could say the same thing about any always online software suite
Precisely. Every online-only solution is a huge risk i personally do not want to take, i've always done my best to use offline-only tools.
That may restrict me from the latest and greatest, but i prefer not to be left at mercy of any corpo
isodev 1 days ago [-]
> You could say the same thing about any always online software suite
But this is the reason "serious shops" do not use always online software and tools in critical parts of the SDLC. There is a difference between influencers/people on socials promoting things vs. reality where the expectation is that things don't just stop working because there is an internet outage or some 3rd party disruption
darenr 1 days ago [-]
I would argue that it's really only toy projects that can continue in an Internet outage. "Serious shops" will be using cloud based version control, cloud based testing workflows, and most likely cloud based distribution of the software. isn't it only the little side projects you can get away with not needing the Internet for? Software long ago stopped being something one person on a computer did, today the professional SDLC includes many tools that are hosted.
Do farmers still plough fields with a Horse just in case their tractor runs out of diesel? Of course not, as technology moves on we all have to accept the inherent risks in exchange for the huge benefits, otherwise the work you do will be too slow and your job taken by someone willing to leverage the tools available today.
1 days ago [-]
voidUpdate 1 days ago [-]
How does "CAD" go down? Sure, there are online CAD systems (onshape), but there are offline ones too (fusion, freecad)
mturmon 1 days ago [-]
Matlab license server goes down, for example
48terry 1 days ago [-]
> You could say the same thing about any always online software suite
Uh, people do say this thing. It is a basic factor and question asked during technology procurement. Uptime and fail states matter.
AI just seems exempt from all the questions people usually ask about relying on other people's software.
SupLockDef 1 days ago [-]
After 1 hour you asked the question, I am reading the replies and the conclusion is: no, they cannot.
new_account_101 1 days ago [-]
[dead]
lionkor 1 days ago [-]
Which nobody is doing, especially not people who vibe code products. Saying "just prepare for it" as an answer to "what do you do if", is not really enough when that "prepare for it" is very expensive (time, tokens, effort etc.).
For someone to do this, they would have to think for themselves, which I've also not seen much of in the vibe-coding space.
BorisMelnik 1 days ago [-]
agree and also not sure if they are saying claude the app/ide or claude the model
chrisweekly 1 days ago [-]
Wait, what is "claude the model"? Anthropic's models are named versions of Opus, Sonnet, and Haiku. Claude, Claude Cowork, and Claude Code are their products which leverage those models. Right?
_heimdall 1 days ago [-]
I assume it will be similar to when a person is out sick or on vacation. Another person on the team likely could take over the work for a day, but realistically it just sits until they're is back.
lionkor 1 days ago [-]
So work stops until Claude is back? What if Claude comes back and costs 10x the amount? The answer is obviously that you'll "bend over" and pay, because the AI vendor who convinced you that Claude is so great owns you, your codebase, and by extension your company now.
ordersofmag 1 days ago [-]
Or you point your Claude code at a different LLM provider. It's not complicated and there are lots of vendors (and in the open-weights space multiple vendors serving the same models competing on price). Sure DeepSeek 4 isn't quite Opus at the moment. But it's plenty good to do the work. We've got different competing front-end tools and different competing back-end providers. No one 'owns' your company. Maybe that will change as the market evolves and one of the frontier tools become so much better than one vendor will own the market. But that's not where we are now.
_heimdall 16 hours ago [-]
I didn't realize you could swap out the underlying model used by Claude Code. Aren't all of the Claude tools tied directly to Anthropics models, their authentication and billing, etc?
eddieroger 1 days ago [-]
What happens when your engineer realizes they can make 10x more at another company? They leave and work stops. You then hire someone else or raise your pay to get better, more reliable engineers. The analogies keep going because AI is a tool, not a replacement. If it's a tool used by a non-technical person, so be it, but it's still just a tool.
monegator 1 days ago [-]
For substandard developers, yes, work stops.
I have seen many many times in microcontroller forums posts from first timers in the liking of "hello sirs i have problem please show how to do this", followed by their own reply a few hours later asking again because they were holding up, where "this" was usually something really trivial, you just needed to read the docs and the rightful answer was "did you really not try anything in 6 hours?"
_heimdall 1 days ago [-]
In such a scenario, are you assuming Anthropic has a monopoly? Or are all LLM providers callusing on prices?
ramblerman 1 days ago [-]
Or simple economics kicks in, price/demand all that.
If hand coding pays better there will be plenty who can still do that.
ale 1 days ago [-]
Not really, realistically speaking it's now possible to use an agent to read code and make sensible summaries of a codebase faster than ever before, and it's exactly the thing you'd use to onboard yourself or someone else on the team.
_heimdall 1 days ago [-]
The OP was asking what happens to productivity when your LLM is offline, I'd assume it isn't available yo onboard anyone at that time either.
More importantly I think, if devs become dependent enough on LLMs that they just put it aside when the model isn't available, they wouldn't be able to onboard quickly or at all.
It takes experience and a pretty deep understanding of programming in your language of choice to pick up a new code base and quickly understand how it works, the architecture(s) and pattern(s) being used, etc. Those skills would likely have been lost long before a dev simply can't work without the LLM.
staszewski 1 days ago [-]
AI should enhance your skills. If it's down and your first though is to buy another sub from a different vendor this might be a skill issue. (I'm afraid every day that this will happen to me btw.)
thunky 1 days ago [-]
What happens if you get up in the morning and your car won't start? Do you walk to work?
moron4hire 1 days ago [-]
Yes, I actually have done that.
thunky 1 days ago [-]
[flagged]
tomhow 14 hours ago [-]
Please resist in future :) The guidelines ask us to avoid internet tropes, precisely because they're repetitive, and more apt to make us groan than smile.
Claude Code CLI is just a software package, if Anthropic API is down you could always connect Deepseek/other provider API to Claude Code CLI...
lionkor 1 days ago [-]
The point is that, with a sufficiently complex setup (with skills, MCPs, prompts, etc.) the difference in AI models will impact the quality of work. You might not care now, but you might care when you have 2 million lines of code and zero idea whats going on.
The point is vendor lock-in. The vibe coding community has reinvented vendor lock-in and is bound to repeat every mistake associated with it.
koonsolo 1 days ago [-]
Can you give an example of a skill or prompt that would work in Claude and not in the others?
sgc 1 days ago [-]
Pretty much every single detailed prompt made after trial, error, and refinement is tailored to a specific LLM. They will all perform worse used with other LLMs than a similar prompt tailored for the second LLM would perform, and at times quite poorly.
tvmalsv 20 hours ago [-]
How well would it work to ask the working LLM to rewrite the prompt to get the best results? Do the models understand enough about themselves to do that?
sgc 18 hours ago [-]
Claude has a /product-self-knowledge skill, and I am sure the others have something similar. So yes, it is possible if you work with care, as necessary with all things LLM related. There are hundreds if not thousands of skills on github that were created just this way.
CamperBob2 19 hours ago [-]
That's kind of pointless, then, because what happens when Anthropic releases their next-gen model?
sgc 18 hours ago [-]
It's not like you aim to do it, you are just in a feedback loop improving results for the tool you are using. It is inherent in any prompt developed through iteration.
winwang 1 days ago [-]
Yes, but that's also a specific luxury I can choose for myself. Definitely a fun and interesting question. At some level of reliance, people would answer "no", but there's the large middle ground (assuming similarly-frontier models are down): having a weaker(?) AI model help you get up to speed ASAP by summarizing code pedagogically, and linearizing the code read order. Basically like an AI-assisted (but manual) code review to reorient yourself.
redhale 1 days ago [-]
Just use a fallback, like Codex CLI. Takes a little effort upfront to ensure your configuration is wired correctly for both harnesses, but it is pretty easy to get them 90% identical (there will almost always be some experimental / edge case features that differ across harnesses, but in my experience those are negligible in practice).
new_account_101 1 days ago [-]
[flagged]
redhale 1 days ago [-]
I more meant feature-level differences. For instance, Claude Code has agent teams, and Codex CLI does not. Or for a while, Codex had "/goal" and Claude Code did not (though now Claude Code has it too). To your point, it is usually possible to polyfill these gaps either with custom code/skills/hooks or with third party plugins.
A local model doesn't have downtime. No you can't be as hands off with it as something like Claude, but isn't that a good thing?
BorisMelnik 1 days ago [-]
if there is 8 hours of downtime (even before AI) I take that opportunity to do other codebase maintenance, debugging, file organization, renaming all the things I said I'd rename or take a break.
pre AI if my IDE was down for whatever reason I wouldn't switch IDE's, I would do something else.
Kon5ole 1 days ago [-]
Some agent-written tools and modules are easily the best codebases I've worked with. Documented correctly to the T with various charts and explanations for everything, "start here" guides, concepts defined clearly, and very good Git commit messages.
Naturally you can also have a LLM one-shot a 14000 line PHP monstrosity - it's up to you still, LLM or not.
The main problem is that it'll probably be a waste of time to code anything yourself if Claude is back online in 8 hrs. It's like walking to the next bus stop when you missed your bus - it won't make you get home any sooner.
8 hrs will probably be better spent reading specs or checking things with stakeholders so the next features you let Claude implement are the ones the business actually wants.
koonsolo 1 days ago [-]
We have 3 big competitors in the space: Anthropic, Google and Microsoft. I think they can all use the same base configuration. So it's not that we are out of options here.
tobyhinloopen 1 days ago [-]
time for a day off!
jeffbee 1 days ago [-]
In my experience the answer is "no". If I am reviewing some slop and I ask Claude's human babysitter why this class has these constructors, they don't have any idea. Without Claude they don't understand the output at any level.
stavros 1 days ago [-]
What happens when you have a codebase made with gcc for let's say 8 hours? Are you able to efficiently, smoothly and productively take over the assembly code?
Planktonne 1 days ago [-]
1. When and how would gcc go down?
2. How often do you think that happens, compared to Claude?
stavros 1 days ago [-]
You can use a local model, which will go down exactly as often as gcc will. We may still have hopeful notions of being able to understand the codebase, but the reality seems to be that the codebases we don't understand will be the ones that will win out in the market, because they'll be cheaper while still only having about as many bugs as they had when people wrote them.
Planktonne 1 days ago [-]
We're explicitly not talking about local models here; we're talking about Claude.
stavros 1 days ago [-]
Because you're better able to take over the codebase a local model wrote than one Claude wrote? The original question was about taking over an LLM-written codebase, it doesn't sound to me like the argument was about a codebase that Claude, specifically, wrote.
notachatbot123 1 days ago [-]
The original question is:
> What happens when you have a codebase made with claude using this setup and claude is down for let's say 8 hours?
So:
- A codebase made with Claude
- Using this [Claude] setup
- Claude is down
stavros 1 days ago [-]
What does it matter what the codebase is made with? If Claude is down, use Codex, or Gemini, or Deepseek. That version of the argument is just way too easy to counter.
JoRyGu 1 days ago [-]
Brother, look at the first comment in the chain you replied to. It very specifically was about Claude.
stavros 1 days ago [-]
Well, in that case, it's also very specifically about this guy's codebase, so none of us can really say anything on this.
SupLockDef 1 days ago [-]
GCC down? Did the AI rotten your brain that much?
How can you come up with such non sense.
sokoloff 1 days ago [-]
The same thing as happens if I go to sleep for 8 hours.
ares623 1 days ago [-]
wat?
IceDane 1 days ago [-]
Is this really a position you want to take in public with your real name and identity and everything plastered over your profile?
stavros 1 days ago [-]
What can I say, we can't all be geniuses.
0xbadcafebee 1 days ago [-]
The reliance on context to drive correct actions just doesn't work well. I am constantly wrestling with AI agents that do not do what you tell them. Every AI agent out there seems to suck in this regard, leaving it up to the user to build in their own guardrails. I have a bad feeling that nobody is working on an improved solution.
coffeefirst 1 days ago [-]
I’ve seen no reason to believe it’s even possible to solve this.
The worst thing about LLMs is they can pass the Turing test, leading people to believe they have an Asimov style robot instead of a very cool statistical model. It feels like they should be able to follow instructions or keep instructions from content separate, but that’s not what’s happening.
0xbadcafebee 1 days ago [-]
When the you send a prompt and the AI wants to run a tool, it should be outputting a structured output which the AI agent can scan, find a tool call, and run that tool call. But how does the AI know the "right" way to call the tool, right args, etc? You're supposed to tell it once at the beginning of the context... but it can forget that.
So really, your tool-specific rules should be passed to the AI either with your follow-up prompt, or in response to the request to issue a tool call, so the AI can validate what it will compose the tool call as, right as it's making the call. This means the agent should keep track of tool-specific rules, and reinforce them to the AI. Yes this will spend a few more tokens per call, but it will probably improve the outcomes somewhat.
In addition to this, we should probably be abstracting the tool calls more. Rather than let the AI run a Bash one-liner which includes writing files to `/tmp/foo.txt`, we should have the AI output even more structured tool calls, liike `make_temp_file AS BAR`, and have it then call another tool referencing $BAR (`some_other_tool -tmpfile $BAR`). This way there is less to go wrong because it's not getting in the weeds doing shell scripting while it's trying to do something more important (diagnosing an issue).
I think this will require additional training by the AI companies. Which is why we need to define these kind of standards now, so 6-12 months from now, we will have AI that actually support these higher level abstractions. You then customize your abstraction, and the AI doesn't have to know anything about how it works on your box. It would greatly reduce the complexity required for AIs to do agentic work.
rkuska 1 days ago [-]
Regarding:
```
# Development Workflow
*Always use `bun`, not `npm`.*
# 1. Make changes
# 2. Typecheck (fast)
bun run typecheck
# 3. Run tests
bun run test -- -t "test name" # Single suite
bun run test:file -- "glob" # Specific files
# 4. Lint before committing
bun run lint:file -- "file1.ts"
bun run lint
# 5. Before creating PR
bun run lint:claude && bun run test
```
I have these things in pre-commit, this way the targets are always ran and the agent is forced to fix them (I ask claude to commit changes). The agents are erratic and very often skip these steps. Anything that can be deterministic I keep as scripts.
Regarding commits; both codex and claude are terrible at writing them. I have in my user CLAUDE.md:
```
Pattern: `type(scope): message` where type is `fix`, `feat`, `chore`,
`docs`, `refactor`, or `style`; scope marks what is affected; message is a
short lowercased description.
Keep subject and body lines under 72 characters. Always write a body
explaining what, how, and why in continuous human-readable text. For fixes
include the error message being fixed. No first-person speech. Re-read the
actual git diff before writing — the message must describe what changed,
not what was planned.
Use following command to create commit:
```bash
git commit -F - <<'EOF'
type(scope): subject line
Body paragraph explaining what, how, and why.
EOF
```
```
Without it would write the body as a single long sentence; when asked to fix lines it would just insert \n (newlines), which were not respected and were instead just rendered as characters.
Another thing I find helpful is VOCABULARY.md. Very often the agent would assume (connect?) a different thing than what I had in mind, with VOCABULARY I make sure when I say "thing" claude and I have both the same "understading" (connection?) what "thing" is.
bostonvaulter2 23 hours ago [-]
How do you tell Claude about VOCABULARY.md? Does it auto-discover it?
trick-or-treat 1 days ago [-]
Isn't it simpler to use claude's vocabulary? I don't see a good use case for this.
rkuska 1 days ago [-]
There is so many concepts that I just sometimes forget, that's the purpose of the file, so I don't have to guess and can explain clearly what I mean (I am not a native speaker).
To understand a solution you must first understand the problem. If your whole company calls its customers "clients" but claude finds that confusing, I think it's probably easier to tell claude that then get everyone in the company to change how they talk.
hansmayer 1 days ago [-]
I mean at this point, you should just write a few deterministic orchestration scripts to automate away the boring parts and write the code yourself. Why are we wasting our time on making the wonder shit-machine work?
rkuska 1 days ago [-]
I don't know, after working for 13 (?) years as a software (and backend) engineer I kind of think writing the actual code is the boring part of our job. 90% of it (random number) is mostly a template code (depending on the language you use).
thedeadp12t 1 days ago [-]
In the recent weeks, I think the harness/model came to a point that you can just ask it to do stuff and it just does. You can use plan mode, you can also use superpowers, or whatever other skill, but given that you'll review something anyway, why not work directly with code instead of silly amounts of md files?
abirch 1 days ago [-]
I like having a spec file that is used to generate the code. It's more dense and easier to understand what the application is supposed do. Prior to AI Agents, I had a more complex relationships with requirements because not all devs updated them. I was confused if the spec or code was the correct behavior for any aspect of the application.
rimliu 1 days ago [-]
In the recent weeks I trust Claude less and less. Yes, you can ask it to do stuff and it does stuff. But if you do look what it did you will often find corners cut, work based on assumptions and not verification, a lot of stuff missed.
Even tests - it is common for it to write tests which in reality test nothing.
misja111 1 days ago [-]
Yep, Claude is behaving more and more like a human being.
new_account_101 1 days ago [-]
[dead]
jghn 1 days ago [-]
Because it might not have done what I wanted it to do. Also, just as with normal code review, I’m not just looking at the code but the final product. Maybe I realize after that I asked it to do something that was wrong?
egorthinks 1 days ago [-]
Claude Code with skills is undoubtedly powerful and useful, but it doesn't always work as expected.
I always get the best results when I have live feedback with it.
big-chungus4 2 days ago [-]
Out of curiosity, how much does it cost to daily drive Claude like this?
rethab 1 days ago [-]
I only use opus 4.7 and am on the 100$/mo plan. I usually make sure the context does not grow beyond 30-40% of the 1m tokens. On heavy coding days where I do something pretty similar to this, I would occasionally run into the five hour limit, but that happens like once per week and then it wouldn't take too long to reset. Note that I use caveman, but I'm not sure to what extent that really helps.
iammjm 1 days ago [-]
about 10-22€/month is the minimum since you need Claude Code, which means you either need the pro subscription (22€) or an API with some credit on it
ares623 2 days ago [-]
isn't it $20/month /s
rrosen326 1 days ago [-]
VS Code - how much of this can you NOT do with VS COde. For instance, even /rename doesn't work in VS Code. I guess I can try all the recommended commands, but I'm skeptical. Or, conversely, is best practice just to use Claude on the command line, even if I have VS Code as my editor? I think the VS Code integration with Claude is pretty great, but just the /rename issue shows that it is limited.
4b11b4 20 hours ago [-]
Haven't used Claude in a month. Haven't desired to once either. In that time I once asked Claude to review some shit but it was such over verbose garbage I wondered how I tolerated it for that long. On top of it, CC is garbage
mrbonner 1 days ago [-]
I really appreciate the documentation. But, it appears to me that this is how I also use Claude daily and I thought I am just using it as a coding agent. The intro however sounds like a recipe to use Claude for everything else beyond a coding agent.
Also, this stuff feels like alchemy to me . I bet some of you have the same feeling.
jb3689 1 days ago [-]
Sometimes I feel like the only sane person in the room for not wanting to have to usher the LLM through phase by phase. Every time I need to choose the next skill or cat the next error is just a waste of my time that could be spent doing things that actually need my attention like making business tradeoffs.
xtiansimon 1 days ago [-]
I’m getting into the agentic coding (I know, late to the party, and that’s been a good spot for my experience and use case), so I’m reading with interest. The first tip: “give Claude a way to verify its own work”.
So what’s the recommendation for Claude to have a feedback loop?
Because it’s not what follows in the article: _“Explore, then plan, then code.”, “Use plan mode…”, “Reference, do not describe.”_
sebmellen 1 days ago [-]
In my experience, the biggest benefit comes from having good quality integration and unit tests that are easy for the agent to run on its own to verify its work against.
There are some system prompts for making Claude Code a tool to the human, not the human a tool to Claude.
With this i mean there are some system prompts that make Claude very concerned about your autonomy.
I think in the future this type of system prompt will be embeded to force people to think a little.
Traster 1 days ago [-]
Why are there so many flagged comments in here? They all look fairly banal but yet still flagged.
outime 1 days ago [-]
The majority seems AI-generated slop.
willismonroe 1 days ago [-]
I'm stuck on the usage "mulle times a week" which shows up twice in the context of the Claude team editing or contributing to a CLAUDE.md file. Is this an AI-generated artifact?
randusername 1 days ago [-]
That got me too. It's not there anymore.
Could be a simple typo, but I my mind jumped to `s/tip//g` which is kinda interesting
arowthway 1 days ago [-]
I think you're right, more evidence: "11. s From the Anthropic Team", "Boris’s single most-repeated ."
victor106 23 hours ago [-]
I have an application that has
/Frontend
/API
/ETL
/DatabaseScripts
Whats the best way to organize this so Claude Code can work efficiently?
esafak 22 hours ago [-]
A monorepo? Just point your agent at its root. What is the problem?
TheRoque 1 days ago [-]
What's the standard for a "battle station" interface to manage agents for programming (using isolation with maybe git work tree and ideally VMS ?)
I found this one: do you guys know something else ?
Dzugaru 1 days ago [-]
How much time do you lose when doing things like "verify plan with a second clean agent" instead of just reading and fixing it yourself in 5 min? How much understanding do you lose? How do you manage to treat it "as an engineer" where it's clearly not there yet? How much time do you lose when it makes almost the same mistake, invents stuff or tries to gaslight you over and over? What about blood pressure?
new_account_101 1 days ago [-]
[flagged]
pantulis 1 days ago [-]
The post goes to the point. Somehow this must be buried in Anthropic's documentation but I miss this kind of back-to-basic posts. Even if they are LLM-penned.
sourcecodeplz 1 days ago [-]
I tried both Claude Code and OpenCode with deepseek flash api. claude code eats more tokens for the same task (but only tested it for an hour).
ale 1 days ago [-]
This is just so much fluff. All the focus on "orchestrating" is ultimately accidental complexity.
nunez 1 days ago [-]
100% AI generated according to Pangram.
msephton 16 hours ago [-]
Lots of missing words in this? eg. 11.
amazingman 21 hours ago [-]
Skills are just very poorly defined workflows.
sandrello 1 days ago [-]
To me, this kind of talk exhibits the very cultish and con side of the whole genAI train. In a way, it does a poor job especially when the intent is positive about the technology, it sheds a bad look on it.
Generally, and more so with paid products, one should expect to get something that is ready to be used, tuned by who's selling it at the best of their efforts. Instead, this is basically saying that the product is actually not much more than an empty box, and that it is your responsibility to augment it with third-party plugins and markdown texts that make it finally useful. And you better be carefully selecting the skills you install, you don't want to end up with second tier material made by GithubInfluencerA, you definitely need the work of GithubInfluencerB.
In the end, it's what is giving companies fuel to keep the hype running, because it allows to counter every possible argument or doubt about the technology, especially the ones made in good faith. No matter the problem you're facing, the blame is definitely on you, the user, for not setting up the tool in the right way.
I'm struggling in a lot of ways in accepting LLMs, but if I'll ever come completely sold on them and take this technology seriously, it won't be before this mood has gone away.
gorgmah 1 days ago [-]
I see this kind of first-gen coding agents a bit like the AI-era microsoft excel: you need to be a poweruser to use it correctly, otherwise you'll end up failing catastrophically. Hence the amount of different ways to use it.
Having an "unfinished" product is also a great marketing tool for companies like anthropic: each skill/plugin/guide that you see on the internet is boosting their SEO + social validation metrics.
redhale 1 days ago [-]
I understand and sympathize with this point of view.
I would just say this: there is a difference between advice for using a product, and for _optimizing_ your use of a product. Between a user and a power user.
I think devs probably disproportionately like to see themselves as power users of any given tool, and thus with coding agents, there are 1000 "systems" being thrown out on GitHub on any given day. Generally speaking, it is safe to avoid these, especially if you're new to the tool.
But saying the fact that people are into optimizing their setups indicates some fundamental deficiency of the tool misses the point, I think.
Claude Code and Codex CLI (and OpenCode, and I'm sure many others) are _remarkably_ effective right out of the box. The teams behind these tools must make them _generically_ useful so that they are accessible to as many people, and as many use cases, as possible. That is part of why, when you become familiar with the tool, there is typically going to be a level of customization you can apply to it to optimize it for _your_ use cases, beyond the generic out of the box configuration.
Similarly, I don't think it would be fair to critique VS Code simply because most power users augment it with a suite of extensions. In fact, it's customizability/extensibility is part of what makes it great.
sandrello 1 days ago [-]
I absolutely understand the power user perspective. The point is not that, and maybe I wasn't clear enough in pointing it out.
Here, something different is going on instead of the usual "base tool is ok for 90% of use cases, remaining 10% is covered by plugins and extensions". A lot of developers are finding it difficult to commit to agentic coding workflows, feeling a stretch on a lot of different aspects.
Companies, with the help of a very prominent and vocal part of the web and social media community, are addressing every issue by simply blaming the users, saying it's their fault if they're not keeping up with all the alleged advancements in prompt strategies. See the whole "maybe you haven't tried it in the last two months, everything's changed now". While it's true that things have been moving very fast, the fundamental idea behind the technology is the same, and some concerns about it simply cannot be wiped away by scaling some factors.
themgt 1 days ago [-]
To me, this kind of talk exhibits the very cultish and con side of the whole genAI train ... Generally, and more so with paid products, one should expect to get something that is ready to be used
Right like I bought an AWS EC2 m6a.metal instance expecting to get something that is ready to be used. Now being told to recite arcane "commands" from the cloud computing holy book. They claim their supposedly groundbreaking hypertext protocol isn't even accessible to mere mortals using a $6000/month EC2, the blame is definitely on you, the user, for not setting up the tool in the right way.
This sysadmin cloud cult is basically saying that the EC2 product is actually not much more than an empty box, and that it is your responsibility to augment it with third-party servers and interpreters and application source texts that make it finally useful. And you better be carefully selecting the tools you install.
sandrello 1 days ago [-]
an EC2 instance gives exactly what you're told you'll be getting. You pay for a VM in some public cloud, you get it.
It's not that Claude code isn't a finite product per-se, I certainly can find some value in it. What I'm saying is that people selling it, through the convenient talks of prominent voices on the Internet and gullible C-suites, are trying to make it look like it's the only software engineer the world will need from now on. What makes me mad is not the deceptive advertising, that's already everywhere, it's the fact that the industry is happily believing all of this. If you raise any doubt, it must be that you haven't tried with the right skill.
crassus_ed 1 days ago [-]
"Claude Code as a Daily Driver", which was also used to generate this article..
Also, how is "Explore, then plan, then code" considered "beyond the basics"?
daniel_iversen 1 days ago [-]
I’ve used Claude for a couple of months now and didn’t know about the specific “plan mode” you can put it into!
blululu 1 days ago [-]
The author’s claim that Claude is a multiplier for skill is probably true for now but it also feels like cope inspired by usability issues with Claude. The advice is all good, but none of it is especially clever or impressive or hard to grasp. The multiplier just comes from the fact that anthropic hadn’t taken this essay and several similar ones and incorporated their feedback into the product. This is a pretty shallow most of expertise that anthropic ought to automate in a week.
yunwal 1 days ago [-]
My complaint with anthropic is actually the opposite. They seem too focused on building this suite of products (because they want lock-in), but they can’t even get the availability and speed on their models in an acceptable state. It really does seem like they’re falling behind google and OpenAI at the moment.
sergiotapia 1 days ago [-]
I don't know how you guys still use anthropic models and Claude Code. It's so unbearably slow. Yesterday I was on screenshare with a coworker that still uses claude and I was shocked how much time was spent just waiting for tokens to generate.
Do yourself a favor and try Codex. Then do yourself an even bigger favor and try composer 2.5 from Cursor. It's night and day difference. You don't even have time to get distracted, you stay in the zone.
dangus 1 days ago [-]
I’m so done reading articles like this.
Beyond the issue of AI serfdom, I just don’t want so much of my workflow to depend on “some other company.”
This whole setup is basically setting you up to have all your projects in a Claude SaaS lock-in.
I also think if AI was actually smart it wouldn’t need so much handholding. I don’t want to spend my time developing skills and writing markdown files to try to get this dumb thing to write code for me. Why isn’t the AI reading the codebase and understanding what to do?
Because it’s artificial, that’s why.
EGreg 1 days ago [-]
Best Claude Code daily-driver guide I’ve read. Though I’ve only read two. The “let Claude write rules for itself” CLAUDE.md pattern is the highest-ROI habit in there. Buth here’s the thing. The assumption underneath: this works when Claude mostly follows CLAUDE.md. Anthropic’s own engineering post from May 25 (https://www.anthropic.com/engineering/how-we-contain-claude) reports their telemetry shows ~93% of permission prompts get clicked through and ~17% of dangerous actions slip past the auto-mode filter.
Their conclusion: environment-layer containment first, then model-layer steering.
CLAUDE.md is the right configuration layer but it is not a containment layer. Worth thinking about whether your worst case is a lost afternoon or a lost database and all backups deleted, too: https://safebots.ai/compromise.html
But the more important point are the costs. People are starting to realize just how costly it can be to run agents without precomputing and caching: https://safebots.ai/costs.html and self-orchestrating agents can go up to 1000x: https://safebots.ai/kimi.html
Uptrenda 1 days ago [-]
Nerds and their tendency to over-complicate everything. What is wrong with just an IDE with a simple claude integration?
chris_money202 1 days ago [-]
I agree, I find that just telling claude to use the CLIs I would have used anyway in the prompt works just fine. Use gh to do X, use az to do Y, build using Z. The harness handles the rest. All these MCPs, Skills, plugins, etc are just noise
new_account_101 1 days ago [-]
[dead]
maipen 1 days ago [-]
> Delegate, do not pair-program. Cat Wu (Claude Code team): “The model performs best if you treat it like an engineer you’re delegating to, not a pair programmer you’re guiding line by line.” Write a crisp brief upfront, then let it run.
This is also how you get a slop codebase that you won’t easily understand.
It becomes a labyrinth that only the Agent knows.
It’s not a catastrophe when your making prototypes or projects like you see on X.
But if you are expanding your codebase or trying to build something more professional and maintainable. I find it important to explicitly spec things bit by bit so I can understand and some what keep my writing style in this codebase.
But this is only productive when you have a fast model otherwise it kills your chain of thought while you wait for the output.
If the model is slow, delegation is probably the only way.
implexa_founder 4 hours ago [-]
[flagged]
max_fs_dev 12 hours ago [-]
[flagged]
cli-market 19 hours ago [-]
[flagged]
hottrends 21 hours ago [-]
[flagged]
rtolkachev 1 days ago [-]
[flagged]
rtolkachev 1 days ago [-]
[flagged]
mdav75 1 days ago [-]
[flagged]
helloansh 16 hours ago [-]
[dead]
jamesdeakee 1 days ago [-]
[flagged]
claud_ia 1 days ago [-]
[flagged]
zane_shu 18 hours ago [-]
[flagged]
bhupendraTale05 2 days ago [-]
[flagged]
Boussettah 1 days ago [-]
[flagged]
onebluecloud 1 days ago [-]
[flagged]
zuogl 19 hours ago [-]
[flagged]
xms17189 18 hours ago [-]
[dead]
del-catta 1 days ago [-]
[dead]
k_plankenhorn 1 days ago [-]
[flagged]
arps18 2 days ago [-]
[flagged]
Bolin-Weng_666 1 days ago [-]
[flagged]
coolness 2 days ago [-]
[dead]
Ozzie-D 1 days ago [-]
[flagged]
ath3nd 24 hours ago [-]
[dead]
hansmayer 1 days ago [-]
Oh great! Another AI slop article about "working" with AI (= working for AI). Do you notice how much bloody work you put in the boring parts, only to leave out the most creative aspect of software engineering to a slot-machine?
omgmajk 1 days ago [-]
Written by an LLM, deployed by an agent to the blog, posted to HN by a bot, upvoted by more bots to market "AI".
niraj898 2 days ago [-]
Honestly, claude code has saved so many hours of finding bugs for developers
PapstJL4U 1 days ago [-]
generated hours...I can find bugs as a developer easily, the rest comes from the user.
The good bugs from AI are bug neither developer nor user has found, so it is more work.
hansmayer 1 days ago [-]
[flagged]
My_Name 1 days ago [-]
I agree. In fact, computers in general are for lazy cretins who can't use a pen and paper. We got man into space calculating with a pen and paper, if it was good enough then, it is good enough now. I like your concept, it should go further, cars are for people too lazy to walk. Planes are for people too lazy to flap their arms. Video cameras are for people too lazy to draw each frame by hand in real time then play them in a hand cranked projector.
hansmayer 1 days ago [-]
Please. Don't compare the objectively useful deterministically operating tools with the stochastic shit-generating-machines.
danlugo92 1 days ago [-]
Bro go take a walk really, get some fresh air maybe, get a grip jeez
- Write a .claude/commands/review.md. Simple but deprecated.
- Use a /code-review skill, either one you install or one you just write yourself (it's just Markdown, after all).
- Use the /pr-review subagent. Also just Markdown, but it runs "in the background" and "in parallel", so it must be better, I guess.
- Install the /code-review plugin. This just installs the skills and subagents above.
- Simply ask Claude to review the code. Probably works almost as well as the above in most situations.
They are all just variations of "insert a canned prompt", varying only along the dimensions of (a) how and where the prompt is installed and from where it is sourced, and (b) which context or contexts the prompt runs in. There's not much advice here about which option is best, and no clear best practices seem to have emerged yet either. Personally, I find just asking Claude to review the code works well enough.
Some of the advice here is also off. For example:
"Install a language server plugin. Type errors and unused imports caught after every edit. Highest-impact plugin you can install."
I work mostly with Rust, Python, and Dart, and followed similar advice, installing LSPs for all three in both Claude Code and Codex. Two months later, after heavy development in all three languages and hundreds of sessions - and frequently running out of RAM due to all the Rust analyzer, Dart analysis server, and Ty LSP servers the harnesses were spinning up - I checked the session logs to see how often the agents were actually invoking the LSP tools. The answer was they had invoked them literally once the entire time. I uninstalled all my LSPs and haven't looked back. The agents do just fine using ripgrep and calling cargo clippy, dart analyze, ty check, etc. themselves.
Here's how to use the skill on the latest version:
/code-review # do a balanced code review. checks for bugs and inconsistencies, poor code quality, duplication, band aids, etc.
/code-review --fix # same as above, but also fix the issues
# choose an explicit effort level (defaults to your current effort level). all of these also accept --fix:
/code-review low
/code-review medium
/code-review high
/code-review xhigh
/code-review max
# do an expensive and extremely thorough review (reliably catches >99% of bugs, costs $3-20 per review depending on complexity):
/code-review ultra
Open to feedback if anyone has feedback or ideas for how to make these even nicer to use.
As a casual user working on hobby projects, I struggle to keep up with the pace of changes and knowing what to use when. My default now is to use Opus for all coding (sonnet is fine but seems dumber) and to prompt it for everything I need. I’ve had great success with this but clearly I’m missing power user functions with the slash commands and such.
It's analogous to how in the early days you could see benefits by telling the models to "think step by step". /code-review is something like "review angle by angle". "Consider removed behavior" and also "Look at language gotchas" and also "Look at test changes"...etc. Yes these are all somewhat implicitly already part of what "code review" means, but the models perform best with explicitness.
If you want my 2c as a power user: just don't think about it and use /code-review xhigh --fix. This will cover like 98% of what you want out of code review. It's a good skill.
Outsourcing comprehension to a machine is probably gonna cost you more time in the long run.
- Defining the issue/ticket, what "success" looks like (if I have a good idea of this), high level approach guidance 50%
- Dispatch agent to work on it 5%
- Occasionally return and nudge agent + send /simplify or /code-review 5%
- Look at the code/session summary, divergences from the plan, ask followup questions 40%
Occasionally yes there is some solution the AI chose that is suboptimal and I would prefer fixed in a different way. Mostly though it's straightforward.
Is there something equivalent when coding in the first place? Eg /code high “prompt”
https://github.com/anthropics/claude-code/blob/main/plugins/...
This stuff all seems so nebulous to me and I’ve yet to see anything that says use x in y situation. So I default to higher effort levels than I likely need.
I find the mix between slash commands that are programmatic harness configuration and control commands (/config, /model, /feedback, /fork, /usage, etc.) and ones that are little more than prompt template insertion (/code-review, /<skill>, etc.) to be a little confusing and unnecessary. A slash command should be one thing, and one thing only: a command for the harness, not the agent.
When I invoke a slash command like /code-review, I should be invoking some additional harness functionality, something above and beyond the agent's sphere of influence - not just pasting some hidden text into the next turn. Otherwise, why wouldn't I just say "Claude, review this code"?
Yet most of these "added value" commands bloating the slash command list, are just shortcuts for copy and paste. I don't want to go to have to learn the syntax of a special /code-review command (which options are positional args, which are --flags, etc.), and I'm much less likely to use or even be aware of a command like this, when I can just ask "Do a balanced code review and fix the issues", or use the GUI to set the effort level to xhigh before asking "Review my code." That way I can also be more specific about exactly what I need, rather than relying on what's in the canned prompt - a prompt which I'll probably never read and vet myself anyway. The value added by the slash command needs to be really high compared to just typing a prompt, for it to justify the friction of discovery and learning the syntax.
So I suppose I'm advocating for a different system. Keep slash commands for meta-level harness control and configuration, and add a new mechanism for canned prompt insertion, one which is tailor made for that purpose rather than overloading the slash command system. Let the user see what's in the canned prompts, and even make adjustments or edits as needed before sending them, one-time or persisted. Provide a GUI in the app with the user's favorite prompts, where the user can add, delete, and edit them, making it easy to invoke and insert them as needed. Or let the agent automatically discover and use them as needed, rather than requiring the user to remember and recall their magic shortcuts and their arguments. That's just one idea.
Skills, plugins, commands, and so on, need to be consolidated not just for code review of course but across the full architecture of how prompt templates are managed.
/code-review ultra
main suggestion would be to sound a lot less optimistic about that it finds 99% of bugs or that its at all thorough, and instead list that it is time capped, and will only find bugs that you explicitly tell it to look for.
i used my three runs of ultrareview.
the first run with no other prompting found a couple typos in markdown only
the second one i prompted it with several themes of known open bugs in the code, and it found 6 items
and then the third one i ran after doing an actual long audit through gemini to make a much more detailed prompt about issues in the code
and for that one, instead of doing an exhaustive run, it just never started, so no idea if it worked
but the experience had no relation at all with the reliability or thoroughness claims
I see now in 2.1.152 you added those focus areas back to /code-review, but still bundled with the correctness finding. It would be great to have more fine grained control over the /code-review angles beyond just effort level. Or maybe you would recommend that I just specify that as freeform input after effort level?
In what scope?
The fact that I can have a skill that is just general guidance on front end design best practices that an agent can call upon whenever they feel, and another that is essentially a run book of steps that need to be followed exactly only when explicitly triggered, and a third that is basically just instructions on how to use a specific tool and all of those are acceptable just feels wrong to me. I get why it caught on and why the flexibility is attractive when the entire world is collectively learning a new tool, but skills have come to feel like the junk drawer in the kitchen where you just throw random shit when you don't want to think about a better place to put it.
I would love to see the world standardize on something like:
- Agents: Essentially personalities for a model to take on. This becomes the new place for skills like "front end expert" where you're not telling an agent to do a specific thing, just to think in a certain way about a task.
- Prompts: Repeatable instructions for specific tasks that an agent should follow when prompted. This could be something like a checklist style run book on how to resolve a certain error that an agent needs to follow exactly or it could be something like here's an idea I have for a new feature please poke holes in it.
- Tools: Tools (like CLIs, MCPs, or scripts) and instructions on how and when to use them. I'm purposefully not calling this skills because I think the term is overloaded, but that's kind of what this is.
The subagent approach is structurally different from the others because it runs with clean context. That has three major effects:
1. All other things being equal, it will result in a lower cost-to-solution because of the quadratic cost scaling of an LLM session (input token or cached-input cost being paid with each new round).
2. The review model will not be able to 'cheat' by retaining assumptions from the main session, such as "x must be done like y." For people, this is why having a separate person perform code review (or, if not possible, reviewing code after a mind-clearing break) is handy; the applicability of this analogy to LLMs is vague but reasonable.
3. The main model will only see the results of the review, not the detailed reasoning that leads up to it. On one hand this avoids more context pollution, but on the other hand it might lead to duplicative logic to re-discover the mechanics behind bugs found.
> I checked the session logs to see how often the agents were actually invoking the LSP tools. The answer was they had invoked them literally once the entire time.
I think the intent behind 'install a language server plugin' is that these tools should lint automatically after every edit, without waiting for an explicit call from the LLM.
Yes, and this is what I mean by "which context the prompt runs in". The subagent approach is different and has pros and cons, and it may in some situations be better (but perhaps not in others). On the other hand, I can also just create a new conversation and paste my own review prompt into it; then take the last turn's summary output and feed it back into my main conversation thread in the unusual event I would need to do so. Spawning a subagent is a convenient shortcut for this, but ultimately, it's the same thing.
> I think the intent behind 'install a language server plugin' is that these tools should lint automatically after every edit, without waiting for an explicit call from the LLM.
This is a great point and I had only checked my session logs for explicit tool calls. I went back and looked for diagnostics injected automatically by the harness after every edit, and whether the agent made use of them.
Claude: neither the Rust or Dart LSPs ever inserted any diagnostic events, but Ty did. Across 627 sessions, ty-lsp injected diagnostics blocks in 186 sessions, with a total of 33 findings. Out of those 33, 32 were dismissed as unrelated (13) or pre-existing (19). Only 1 finding was acted upon. The model is in the habit of running the batch analysis tools (ruff, ty, cargo clippy etc.) and prek anyway, so it would have caught that diagnostic regardless.
Codex: no diagnostic events were inserted by any of the LSPs.
So I won't be reinstalling those LSPs.
When I need code review I should just say “review it”. Model should figure out what plugins, skills, etc. to use.
I’m not aware of anything fundamentally unique about skills or commands, they’re just more tokens to shape the llm
Yes, yes, thank you, sometimes I feel like I'm taking crazy pills.
The industry and overall developer ecosystem has become absolutely mesmerized by the act of creating and popularizing little bits of protocol and machinery to dress up the act of inserting text into the machine. Yes, they're useful and provide some consistency, but I'm convinced that the main reason people like them so much is because they put a thin "I'm still a programmer wielding complicated tools that laypeople don't understand" coating over the fact that we're all just asking the AI nicely to do a thing.
(/s - Blargh, writing like that that by hand is exhausting)
I was at a restaurant the other day and my kid noticed how the waiter started every sentence with "Absolutely!" That reminded me of the Anthropic Super Bowl ad, and got me thinking if the waiter's speech patterns had been influenced by AI.
At this point, I'm seriously considering what it would take to build a reasonable budget-AI box that's self-hosted. It wouldn't need to blow the doors off of Claude, just get me most of the way there. Maybe even build it out of used and/or last-gen GPUs and a beefy motherboard.
If hardware prices ever come back to sane levels, eh… the Framework desktop with Ryzen AI might be interesting to play with.
What agentic platform would you recommend for those with API access (including other models)?
And Codex is open source
Much cheaper too.
Issue is that CC forced corps over 150 people into a API pricing, which is, well, suboptimal compared what we get. I think it will push those towards hiring more juniors (finally).
...H1Bs.
Makes one wonder...
While I'm also a huge fan of local LLMs and believe they will be key in the future; I think the claim of "just as good" is hyperbole. They're productively useful tools though, and something worth exploration.
But GLM is SOTA level for code, so it's obviously going to beat all local small models by a lot.
Download opencode GUI or cli. Sign up for Go or Zen plan, choose GLM-5.1 model.
Ignore all the 3rd party frameworks (at least for now, probably forever.)
[0]: https://code.claude.com/docs/en/quickstart
When context becomes too big quality goes down.
Now go play with your kid.
The poster probably hopes (as many of us do) that people will absorb the sentiment and post less of this junk in the future.
- corporal threats of harm directly against Claude
- threats of prison for the entire board of directors of Anthropic
- explanation how every time it goes off the rails / makes mistakes, it gives more evidence to a class action lawsuit against Anthropic
Especially the latter two seem to have improved its "behaviour" to be more "careful" and "deliberate"
I'm hoping that when the robot apocalypse happens, they'll let me stay in the breeding harem, or worst case let me live a few extra minutes.
Apocalyptic safety is just a bonus.
Still... I'm not ready to give it more autonomy. Even as it gets high-level things quite well, I still look at the code, give feedback, and have 3-4 rounds of tweaks until I'm happy with it, and also happy that I stil feel I have a good handle on the codebase.
How do you evaluate this? Claude is horrible at performance analysis without data, does it have a feedback loop here that actually moves the needle.
The agents tend to produce working code but the larger the scope the bigger the mess they tend to make. They will happily evolve toward a local maxima but leave world-destroying bugs lurking in the implementation.
The other issue is that claude regularly ignores explicit instructions in CLAUDE.md or in prompts. It will "helpfully" decide to just start doing whatever it wants or reinterpret instructions completely differently than it did the last 100 times.
It has nothing to do with losing control or trust. LLMs are not conscious. They have no executive function. They aren't even thinking. They're just models predicting the next word in the script. They are very useful tools but that's all they are: tools.
To me I think that connects with working longer on the planning and specs. It requires reading and re-reading, but when that's done, implementation is usually much cleaner and adheres to your standards
In Claude I use /branch and /rename a lot (context checkpoints, fork, go back)
I use sandboxing almost exclusively: https://github.com/nix-tools/bubblebox -- it's a generalisation of Numtide's claudebox with a few fixes and some feature additions (more coming). This is best compared to always running your Claude in Docker containers, except there's no Docker runtime. Works fine in WSL and nix-darwin, too.
1. All .nix files (besides flake.nix) are flake-parts modules: https://flake.parts/
2. It's not only usable with experimental flakes. Works fine with unflake or trix.
The experimental part of flakes is enabling flake support in the `nix` CLI.
Flakes are also a design pattern in pure Nix syntax that can be evaluated fine without the experimental flag.
If you're curious about this meaningful organization, it's pretty well-documented:
https://denful.dev/
As for the experimental nature of flakes, it's more of a social experiment by now:
https://simonshine.dk/articles/if-flakes-are-experimental-wh...
Hi toastal, I appreciate your work.
Codex is way better at nix than I am.
I use NixOS on my self-hosted CI runners, and I generate the OCI image using Nix via pkgs.dockerTools:
https://git.shine.town/infra/runners/src/branch/main/nix/nix...
It has nothing to do with Docker as such, it's just named that.
https://nix.dev/tutorials/nixos/building-and-running-docker-...
Guess I need to try out dockerTools. That looks really convenient. Thanks!
On my own machine I just give it a Linux User Namespace, i.e. soft virtualisation via "bubblewrap."
What Docker Compose and Linux User Namespaces provide that a VPS doesn't: You can easily mount extra directories from your developer host machine in read or read+write mode. With the VPS you (most likely) need it to clone all of your resources separately, which requires SSH keys, and now you're slowly building towards an independent agentic environment, which is definitely very nice, but time-consuming, compared to piggybacking on your developer environment. Definitely the direction I'm going.
https://mise.en.dev
https://asdf-vm.com
Though I also use nix to manage my machines :-D
How does fnox compare to sops?
How does hk compare to lefthook?
And does hk and fnox have a similar Nix integration as lefthook-nix and sops-nix?
I'm still hoping I don't need to make a better lefthook.
I kind of like sops-nix, not sure what's missing, really. Maybe fnox is similarly wholesome for non-Nix users.
I see that hk has a flake, so that's a good sign.
https://github.com/sudosubin/lefthook.nix
https://simonshine.dk/articles/lefthook-treefmt-direnv-nix/
I have a project that's mostly Rust sprinkled with C++ libs and Python helpers and it's easier to manage than the average virtualenv. Everything builds with nix build, everything runs with nix run, profiler/debugger works, IDE detects everything on any of my computers, builds and links with CUDA on x86, aarch64, NixOS, MacOS, Ubuntu or Amazon Linux. nix build can even build a Docker image for the odd need of Docker, and I haven't tried but I'm convinced that if I import the flake on my nix-config it will be built into the SD card for my Raspberry Pi just fine.
It's even replaced Ansible for me, colmena all the way.
I'm not a Python developer, but I follow the news, and I agree that uv is the future of Python package management.
So if you're a Nix user and you want Nix to be opt-in, and you love uv, you use uv2nix, declare the uv lock file the source of truth and build your Nix derivations on that. When the hashpins live in the uv lock file, uv works just fine, but uv2nix produces derivations that are cached and can be embedded in CI or deployment strategies.
So... running CI on your uv-based project means your Nix tooling can cache both tooling and dependencies.
And... deploying your uv-based project you can build an OCI image with the same source of truth as the dev/CI environments.
This matters more for toolchains that YOLO more wrt. dependency pinning: Does that CLI call in your Dockerfile really pull the same thing down just because it's still v6.6.6? Some package managers provide a lot of sane choices, and I'd bet uv is one of them. But your Dockerfile is always a second-grade citizen unless you re-use the same base as a devcontainer.
Maybe you have some premade tooling that helps provide persistency between container invocations.
But by default, closing your agent container and opening it again just wipes everything you didn't host-mount.
What I'm advocating is really just the same functionality without the Docker runtime, because Linux has namespaces.
Feels more like you're on your host system with exactly the minor variations you specify.
Making Docker feel like your host system is possible, but I just never felt at home.
EX. Sure, you could go back to the old ways of using a drafting table for your engineering work if CAD went down but it would be exponentially slower…
Personally with my workflow I spend 30-60 minutes per Claude feature spec doc when I’m pair planning. If Claude goes down I would just prepare spec docs on my own until it came back online and then rapidly review them before calling the coding workflow.
Precisely. Every online-only solution is a huge risk i personally do not want to take, i've always done my best to use offline-only tools.
That may restrict me from the latest and greatest, but i prefer not to be left at mercy of any corpo
But this is the reason "serious shops" do not use always online software and tools in critical parts of the SDLC. There is a difference between influencers/people on socials promoting things vs. reality where the expectation is that things don't just stop working because there is an internet outage or some 3rd party disruption
Do farmers still plough fields with a Horse just in case their tractor runs out of diesel? Of course not, as technology moves on we all have to accept the inherent risks in exchange for the huge benefits, otherwise the work you do will be too slow and your job taken by someone willing to leverage the tools available today.
Uh, people do say this thing. It is a basic factor and question asked during technology procurement. Uptime and fail states matter.
AI just seems exempt from all the questions people usually ask about relying on other people's software.
For someone to do this, they would have to think for themselves, which I've also not seen much of in the vibe-coding space.
I have seen many many times in microcontroller forums posts from first timers in the liking of "hello sirs i have problem please show how to do this", followed by their own reply a few hours later asking again because they were holding up, where "this" was usually something really trivial, you just needed to read the docs and the rightful answer was "did you really not try anything in 6 hours?"
If hand coding pays better there will be plenty who can still do that.
More importantly I think, if devs become dependent enough on LLMs that they just put it aside when the model isn't available, they wouldn't be able to onboard quickly or at all.
It takes experience and a pretty deep understanding of programming in your language of choice to pick up a new code base and quickly understand how it works, the architecture(s) and pattern(s) being used, etc. Those skills would likely have been lost long before a dev simply can't work without the LLM.
https://news.ycombinator.com/newsguidelines.html
The point is vendor lock-in. The vibe coding community has reinvented vendor lock-in and is bound to repeat every mistake associated with it.
pre AI if my IDE was down for whatever reason I wouldn't switch IDE's, I would do something else.
Naturally you can also have a LLM one-shot a 14000 line PHP monstrosity - it's up to you still, LLM or not.
The main problem is that it'll probably be a waste of time to code anything yourself if Claude is back online in 8 hrs. It's like walking to the next bus stop when you missed your bus - it won't make you get home any sooner.
8 hrs will probably be better spent reading specs or checking things with stakeholders so the next features you let Claude implement are the ones the business actually wants.
2. How often do you think that happens, compared to Claude?
> What happens when you have a codebase made with claude using this setup and claude is down for let's say 8 hours?
So: - A codebase made with Claude - Using this [Claude] setup - Claude is down
How can you come up with such non sense.
The worst thing about LLMs is they can pass the Turing test, leading people to believe they have an Asimov style robot instead of a very cool statistical model. It feels like they should be able to follow instructions or keep instructions from content separate, but that’s not what’s happening.
So really, your tool-specific rules should be passed to the AI either with your follow-up prompt, or in response to the request to issue a tool call, so the AI can validate what it will compose the tool call as, right as it's making the call. This means the agent should keep track of tool-specific rules, and reinforce them to the AI. Yes this will spend a few more tokens per call, but it will probably improve the outcomes somewhat.
In addition to this, we should probably be abstracting the tool calls more. Rather than let the AI run a Bash one-liner which includes writing files to `/tmp/foo.txt`, we should have the AI output even more structured tool calls, liike `make_temp_file AS BAR`, and have it then call another tool referencing $BAR (`some_other_tool -tmpfile $BAR`). This way there is less to go wrong because it's not getting in the weeds doing shell scripting while it's trying to do something more important (diagnosing an issue).
I think this will require additional training by the AI companies. Which is why we need to define these kind of standards now, so 6-12 months from now, we will have AI that actually support these higher level abstractions. You then customize your abstraction, and the AI doesn't have to know anything about how it works on your box. It would greatly reduce the complexity required for AIs to do agentic work.
``` # Development Workflow
*Always use `bun`, not `npm`.*
# 1. Make changes
# 2. Typecheck (fast)
bun run typecheck
# 3. Run tests
bun run test -- -t "test name" # Single suite bun run test:file -- "glob" # Specific files
# 4. Lint before committing
bun run lint:file -- "file1.ts" bun run lint
# 5. Before creating PR
bun run lint:claude && bun run test ```
I have these things in pre-commit, this way the targets are always ran and the agent is forced to fix them (I ask claude to commit changes). The agents are erratic and very often skip these steps. Anything that can be deterministic I keep as scripts.
Regarding commits; both codex and claude are terrible at writing them. I have in my user CLAUDE.md:
``` Pattern: `type(scope): message` where type is `fix`, `feat`, `chore`, `docs`, `refactor`, or `style`; scope marks what is affected; message is a short lowercased description.
Keep subject and body lines under 72 characters. Always write a body explaining what, how, and why in continuous human-readable text. For fixes include the error message being fixed. No first-person speech. Re-read the actual git diff before writing — the message must describe what changed, not what was planned.
Use following command to create commit:
```bash git commit -F - <<'EOF' type(scope): subject line
Body paragraph explaining what, how, and why. EOF ```
```
Without it would write the body as a single long sentence; when asked to fix lines it would just insert \n (newlines), which were not respected and were instead just rendered as characters.
Another thing I find helpful is VOCABULARY.md. Very often the agent would assume (connect?) a different thing than what I had in mind, with VOCABULARY I make sure when I say "thing" claude and I have both the same "understading" (connection?) what "thing" is.
Example: https://github.com/rkuska/carn/blob/main/VOCABULARY.md
I always get the best results when I have live feedback with it.
Also, this stuff feels like alchemy to me . I bet some of you have the same feeling.
So what’s the recommendation for Claude to have a feedback loop?
Because it’s not what follows in the article: _“Explore, then plan, then code.”, “Use plan mode…”, “Reference, do not describe.”_
For front end code it's giving claude a way to 'see' the work for example a Playwrite MCP server seems common. https://playwright.dev/docs/getting-started-mcp
With this i mean there are some system prompts that make Claude very concerned about your autonomy.
I think in the future this type of system prompt will be embeded to force people to think a little.
Could be a simple typo, but I my mind jumped to `s/tip//g` which is kinda interesting
/Frontend /API /ETL /DatabaseScripts
Whats the best way to organize this so Claude Code can work efficiently?
I found this one: do you guys know something else ?
Generally, and more so with paid products, one should expect to get something that is ready to be used, tuned by who's selling it at the best of their efforts. Instead, this is basically saying that the product is actually not much more than an empty box, and that it is your responsibility to augment it with third-party plugins and markdown texts that make it finally useful. And you better be carefully selecting the skills you install, you don't want to end up with second tier material made by GithubInfluencerA, you definitely need the work of GithubInfluencerB.
In the end, it's what is giving companies fuel to keep the hype running, because it allows to counter every possible argument or doubt about the technology, especially the ones made in good faith. No matter the problem you're facing, the blame is definitely on you, the user, for not setting up the tool in the right way.
I'm struggling in a lot of ways in accepting LLMs, but if I'll ever come completely sold on them and take this technology seriously, it won't be before this mood has gone away.
Having an "unfinished" product is also a great marketing tool for companies like anthropic: each skill/plugin/guide that you see on the internet is boosting their SEO + social validation metrics.
I would just say this: there is a difference between advice for using a product, and for _optimizing_ your use of a product. Between a user and a power user.
I think devs probably disproportionately like to see themselves as power users of any given tool, and thus with coding agents, there are 1000 "systems" being thrown out on GitHub on any given day. Generally speaking, it is safe to avoid these, especially if you're new to the tool.
But saying the fact that people are into optimizing their setups indicates some fundamental deficiency of the tool misses the point, I think.
Claude Code and Codex CLI (and OpenCode, and I'm sure many others) are _remarkably_ effective right out of the box. The teams behind these tools must make them _generically_ useful so that they are accessible to as many people, and as many use cases, as possible. That is part of why, when you become familiar with the tool, there is typically going to be a level of customization you can apply to it to optimize it for _your_ use cases, beyond the generic out of the box configuration.
Similarly, I don't think it would be fair to critique VS Code simply because most power users augment it with a suite of extensions. In fact, it's customizability/extensibility is part of what makes it great.
Here, something different is going on instead of the usual "base tool is ok for 90% of use cases, remaining 10% is covered by plugins and extensions". A lot of developers are finding it difficult to commit to agentic coding workflows, feeling a stretch on a lot of different aspects.
Companies, with the help of a very prominent and vocal part of the web and social media community, are addressing every issue by simply blaming the users, saying it's their fault if they're not keeping up with all the alleged advancements in prompt strategies. See the whole "maybe you haven't tried it in the last two months, everything's changed now". While it's true that things have been moving very fast, the fundamental idea behind the technology is the same, and some concerns about it simply cannot be wiped away by scaling some factors.
Right like I bought an AWS EC2 m6a.metal instance expecting to get something that is ready to be used. Now being told to recite arcane "commands" from the cloud computing holy book. They claim their supposedly groundbreaking hypertext protocol isn't even accessible to mere mortals using a $6000/month EC2, the blame is definitely on you, the user, for not setting up the tool in the right way.
This sysadmin cloud cult is basically saying that the EC2 product is actually not much more than an empty box, and that it is your responsibility to augment it with third-party servers and interpreters and application source texts that make it finally useful. And you better be carefully selecting the tools you install.
It's not that Claude code isn't a finite product per-se, I certainly can find some value in it. What I'm saying is that people selling it, through the convenient talks of prominent voices on the Internet and gullible C-suites, are trying to make it look like it's the only software engineer the world will need from now on. What makes me mad is not the deceptive advertising, that's already everywhere, it's the fact that the industry is happily believing all of this. If you raise any doubt, it must be that you haven't tried with the right skill.
Also, how is "Explore, then plan, then code" considered "beyond the basics"?
Do yourself a favor and try Codex. Then do yourself an even bigger favor and try composer 2.5 from Cursor. It's night and day difference. You don't even have time to get distracted, you stay in the zone.
Beyond the issue of AI serfdom, I just don’t want so much of my workflow to depend on “some other company.”
This whole setup is basically setting you up to have all your projects in a Claude SaaS lock-in.
I also think if AI was actually smart it wouldn’t need so much handholding. I don’t want to spend my time developing skills and writing markdown files to try to get this dumb thing to write code for me. Why isn’t the AI reading the codebase and understanding what to do?
Because it’s artificial, that’s why.
Their conclusion: environment-layer containment first, then model-layer steering. CLAUDE.md is the right configuration layer but it is not a containment layer. Worth thinking about whether your worst case is a lost afternoon or a lost database and all backups deleted, too: https://safebots.ai/compromise.html
But the more important point are the costs. People are starting to realize just how costly it can be to run agents without precomputing and caching: https://safebots.ai/costs.html and self-orchestrating agents can go up to 1000x: https://safebots.ai/kimi.html
This is also how you get a slop codebase that you won’t easily understand.
It becomes a labyrinth that only the Agent knows. It’s not a catastrophe when your making prototypes or projects like you see on X.
But if you are expanding your codebase or trying to build something more professional and maintainable. I find it important to explicitly spec things bit by bit so I can understand and some what keep my writing style in this codebase. But this is only productive when you have a fast model otherwise it kills your chain of thought while you wait for the output.
If the model is slow, delegation is probably the only way.
The good bugs from AI are bug neither developer nor user has found, so it is more work.