“The difference between a mediocre prompt and a good one can mean going from 60% accuracy to 95% accuracy on the same task.”
I've built and deployed over 50 AI agents across different industries—marketing automation, document processing, voice scheduling, SEO audits, customer support. The pattern is always the same: teams get excited about the model's capabilities, ship something that works in a demo, then hit a wall in production.
The wall isn't about the model. Claude, GPT, or whatever you're using is powerful enough. The wall is always about prompts. And not because people are bad at writing them—but because they're trying to solve production problems with techniques designed for one-off queries.
Here's what actually works when you're building AI agents that run thousands of times a day.
The Real Problem With Prompt Engineering for Agents
Most guides on prompt engineering treat it like a writing exercise. "Be clear," "show examples," "specify the format." That's baseline. But when you're building an agent that needs to handle edge cases, recover from failures, and maintain consistency across 10,000 daily executions, something shifts.
The discipline has split cleanly in two: casual prompting (which anyone can do—the models got better at reading intent) and production context engineering (which is a genuine engineering skill).
I'm talking about the second one.
The agents that work aren't the ones with the most elaborate prompts. They're the ones with the most constrained prompts. Every instruction matters. Every word either reduces ambiguity or it doesn't.
Pattern 1: Structure Beats Verbosity
I used to write long prompts. I thought more detail meant better results. I was wrong.
Research shows that LLM reasoning performance starts degrading around 3,000 tokens—well below the technical maximums we all get excited about. The practical sweet spot for most tasks is 150–300 words.
But here's the thing: you don't get better results by cutting words randomly. You get better results by organizing what you keep.
In practice, Claude tends to behave best when you give it a clear structure. If you write the prompt like a contract, it usually sticks to it. That's the insight that changed everything for me.
The template I use across all agents looks like this:
You are: [role - one line]
Goal: [what success looks like]
Constraints:
- [constraint 1]
- [constraint 2]
- [constraint 3]
If unsure: Say so explicitly and ask 1 clarifying question.
That's it. No fluff. Every section serves a function. I've tested variations—more examples, more detail, longer explanations—and this minimal structure consistently outperforms them.
For agents specifically, I add one more section:
Tool use guidelines:
- Call tools before making assumptions
- If a tool fails, escalate rather than guess
- Always include reasoning in tool calls
This prevents the hallucination problem where agents make up data instead of using their tools.
Pattern 2: XML Tags for Tool-Using Agents
When you're building agents that need to use tools, XML tags are genuinely the best structuring method for Claude. Not Markdown, not numbered lists—XML tags. Wrap your few-shot examples in <example> tags. It makes a measurable difference.
Here's a real example from a document processing agent:
<instructions>
You are a document classifier. Your job is to read a document and assign it to exactly one category.
Constraints:
- Never guess. If you're under 70% confident, respond with ESCALATE.
- Categories are mutually exclusive. Pick the single best match.
- Explain your reasoning in 1-2 sentences.
Output format:
{
"category": "string",
"confidence": 0.0-1.0,
"reasoning": "string"
}
</instructions>
<context>
Valid categories: Invoice, Receipt, Contract, Email, Other
</context>
<example>
Document: "Dear John, Please find attached the Q4 invoice for services rendered..."
Response: {
"category": "Invoice",
"confidence": 0.95,
"reasoning": "Contains explicit 'invoice' reference and typical invoice language."
}
</example>
<example>
Document: "Thanks for stopping by! Your total today was $47.23. See you next time!"
Response: {
"category": "Receipt",
"confidence": 0.88,
"reasoning": "Informal tone, transaction total, typical retail receipt phrasing."
}
</example>
The structure here does three things: it separates concerns (instructions vs. context vs. examples), it makes the prompt easy to version control, and it forces you to think about what's actually necessary.
Pattern 3: Explicit Constraints Beat Implicit Ones
This is where a lot of teams fail. They write a prompt that says "be accurate" or "be thorough" and then wonder why the agent produces inconsistent results.
Constraints need to be specific and measurable. Not "write good summaries." Instead: "Summarize in exactly 3 sentences, each under 20 words."
From my deployments:
-
For classification tasks: Always include a confidence threshold. "If confidence is below 0.7, respond with ESCALATE instead of guessing."
-
For generation tasks: Specify length exactly. Not "brief," but "exactly 2 paragraphs, 150 words total."
-
For tool use: Define fallback behavior. "If the API returns an error, escalate to human review. Do not retry automatically."
-
For multi-step workflows: State the exact sequence. "First retrieve data, then validate, then format. Do not skip validation."
The agents that work in production are the ones where you could hand the prompt to someone else and they'd execute it identically every time. That's the test.
Pattern 4: Few-Shot Examples Are Your Leverage Point
Few-shot prompting remains one of the highest-ROI techniques available. But the way most people use it is wrong.
They show one or two examples of the happy path. Then the agent encounters an edge case and falls apart.
In production, your examples need to cover:
- The happy path: Normal, straightforward input
- An edge case: Something slightly unusual but valid
- A boundary case: Something that's almost out of scope but should still be handled
- A failure case: Something that should trigger escalation
Here's what that looks like for a customer support agent deciding whether to refund:
<example>
Input: "I bought this 2 days ago and it's broken."
Decision: APPROVE
Reasoning: Recent purchase, clear defect, low-risk refund.
</example>
<example>
Input: "I bought this 6 months ago and it's worn out. I use it every day."
Decision: PARTIAL
Reasoning: Outside normal window, but evidence of heavy use. Offer store credit instead.
</example>
<example>
Input: "I want a refund because I changed my mind about the color."
Decision: DENY
Reasoning: Item works as described. Offer exchange instead.
</example>
<example>
Input: "I lost my receipt and can't remember when I bought it."
Decision: ESCALATE
Reasoning: Cannot verify purchase window. Escalate to human review.
</example>
Each example teaches the agent not just what to do, but why. When you do this right, the agent generalizes to cases you never explicitly covered.
Pattern 5: Measure Actual Performance, Not Confidence
This one kills me because it's so simple and almost nobody does it.
You can't improve what you don't measure. And most teams measure the wrong thing—they look at the model's confidence score or whether it "sounds right" when they spot-check it.
Real measurement:
-
Define success criteria upfront: "This agent succeeds if it correctly classifies documents 95% of the time and escalates ambiguous cases to humans."
-
Build evaluation sets: Take 100-200 real examples from production and manually label them. This is the ground truth.
-
Test every prompt change: Run the new prompt against your evaluation set and compare the accuracy to the baseline. If it doesn't improve, don't ship it.
-
Track failure modes: When the agent gets it wrong, categorize why. "Misclassified due to ambiguous language," "hallucinated data," "didn't use available tool," etc. This tells you what to fix next.
I've seen teams ship prompts that "feel better" and actually reduce accuracy by 5-10 points. The only way you catch that is by measuring.
Pattern 6: Context Management Matters More Than You Think
For agents, you also need to think about memory, retrieved documents, and tool definitions. The prompt is just one part of the system.
Here's what I've learned about managing context:
Keep the system prompt minimal. Your system prompt should be your core instructions only. Not examples, not context, not reference material.
Use user messages for dynamic context. When you call the agent, pass the specific data it needs in the user message, not the system prompt. This lets you change context without redeploying.
Cache repeated context. If you're passing the same reference material to every call (like "here are the company's policies"), use prompt caching. For cloud-based AI services, fine-tuning incurs significant costs. Prompt engineering uses the base model, which is typically cheaper. But caching is even cheaper.
Isolate contexts for different agents. If you have multiple agents working on the same task, give each one its own system prompt. They should be able to fail independently.
Pattern 7: The Failure Mode You're Not Thinking About
Most teams worry about the agent getting the answer wrong. That's real. But the failure mode that actually breaks production systems is inconsistency.
An agent that's 80% accurate but consistent is manageable. You can build a human review layer around it. An agent that's 85% accurate but inconsistent will drive your users crazy.
Inconsistency usually comes from one of three places:
-
Underspecified output format: The agent sometimes returns JSON, sometimes markdown, sometimes plain text. Fix this by being explicit: "Always respond with valid JSON matching this schema exactly."
-
Ambiguous constraints: "Be thorough" means different things on different days. Replace with measurable constraints.
-
Context leakage: The agent's behavior changes based on what it saw in previous messages. Prevent this by being explicit about what it should and shouldn't remember.
I track inconsistency by running the same input through the agent 5 times and checking if I get the same output. If I don't, the prompt isn't ready for production.
Building Your Own Playbook
The techniques that work aren't complicated. They're systematic.
Start with Building AI Agents That Actually Work to understand the architecture. Then apply these prompt engineering patterns.
For deeper context on how prompt engineering fits into your broader AI strategy, read Prompt Engineering Is Reshaping How Enterprises Automate Work.
And if you're deciding between prompt engineering and other approaches, Prompt Engineering vs Fine-Tuning: Strategic Decision Framework for AI Implementation breaks down when each makes sense.
The agents that work at scale aren't the ones with the cleverest prompts. They're the ones where every word serves a purpose, every constraint is measurable, and every edge case is accounted for.
Here's the process I use:
- Write a minimal prompt with clear structure
- Build 20-30 test cases covering happy paths, edge cases, and failures
- Measure accuracy against those test cases
- Iterate on constraints, not prose
- Only add complexity if your evaluation shows it helps
- Version control your prompts like code
The people who master prompting in 2026 aren't necessarily the most technical; they're the ones who understand that clarity beats cleverness, and that the effort you put into developing your question determines the value you get from the answer.
If you're building AI agents, this is the skill that separates demos from production systems. And it's learnable.
Want to talk through your specific use case? Get in touch.
