Sponsored Advertisement
Back to Blog
Originally published on Medium.Read Original Article
2026-05-13
large-language-modelseconomicscloud-computingaiscalability

Part 3: The Scaling Problem — Economics, Model Routing, and Prompt Caching

In Parts 1 and 2 , we built an agent that can think and act. But in a production environment, an agent that “loops” is an agent that “spends...

Part 3: The Scaling Problem — Economics, Model Routing, and Prompt Caching

In Parts 1 and 2, we built an agent that can think and act. But in a production environment, an agent that “loops” is an agent that “spends.” If your security agent checks 1,000 servers and uses a high-reasoning model like Claude 3.5 Sonnet for every single check, your API bill will skyrocket before you’ve even mitigated the first threat.

To move from a cool demo to a viable enterprise tool, you must master the Economics of Agentic AI.

1. Understanding TCO (Total Cost of Ownership)

In traditional software, TCO is dominated by server uptime and engineering hours. In Agentic AI, TCO has a new, volatile variable: The Reasoning Tax.

Every time your agent “loops” to verify an action, you are paying for the LLM to re-read the entire conversation history. As the conversation gets longer, the cost per turn increases quadratically. Scaling requires breaking this correlation.

2. Strategic Solution: Model Tiering & Routing

Not every task requires a “PhD-level” model. Determining if a string contains an IP address is a “Grade School” task; determining if that IP poses a sophisticated security risk is a “PhD” task.

Model Routing is the practice of using a small, inexpensive model (a “Router”) to categorize incoming tasks and send them to the appropriate “Worker” model.

Image

3. Practical Code: Implementing a Semantic Router

We can modify our LangGraph workflow to include a Router. The Router inspects the user’s intent and decides whether to trigger the expensive “Security Specialist” agent or a cheaper “General Info” model.

Keywords like if "security" in user_input are brittle. A Semantic Router uses embeddings to understand the "intent" of a query more robustly without the cost of a full LLM call.

from typing import Literal
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage

# Initialize models based on 'Reasoning Tiers'
expensive_model = ChatAnthropic(model="claude-3-5-sonnet-20240620")
cheap_model = ChatAnthropic(model="claude-3-haiku-20240307")

def router_node(state: AgentState) -> Literal["specialist", "generalist"]:
    """
    Analyzes intent to decide the most cost-effective path.
    """
    user_input = state["messages"][-1].content
    
    # Logic: High-stakes security keywords or complex patterns trigger the PhD model
    security_triggers = ["vulnerable", "exploit", "remediate", "audit"]
    if any(word in user_input.lower() for word in security_triggers):
        return "specialist"
    return "generalist"

# In LangGraph, we add this as a conditional edge from the START
workflow.add_conditional_edges(
    START,
    router_node,
    {
        "specialist": "security_agent", # Expensive Specialist
        "generalist": "simple_chat"     # Inexpensive Generalist
    }
)

4. Prompt Caching: The 90% Discount

Most agents use a massive “System Prompt” that defines their tools and persona. In a loop, you send this same prompt over and over.

Prompt Caching (available on Anthropic and DeepSeek) allows the LLM provider to store your system instructions in their memory. You only pay the full price for the first call; every subsequent call in that “loop” uses the cache at a massive discount (often up to 90%).

The Trade-off: Caching usually requires the prompt to be at least 1,024 tokens long to be cost-effective. If your agent is “lightweight,” caching won’t help. If your agent is “heavy” (like an enterprise auditor), caching is mandatory for scaling.

5. Advanced State Management: The Summarization Node

In long sessions, the “historical baggage” of old messages slows down the agent. You can implement a Summarization Node that triggers every 5–10 turns to “compress” memory.

def summarize_conversation(state: AgentState):
    messages = state["messages"]
    if len(messages) < 10:
        return state

    # Use the cheap model to summarize the previous conversation
    summary = cheap_model.invoke(f"Summarize the interaction concisely: {messages[:-2]}")
    
    # Overwrite state: Keep the summary + the last 2 messages for immediate context
    new_messages = [HumanMessage(content=f"Summary of previous chat: {summary.content}")] + messages[-2:]
    
    return {"messages": new_messages}

Tip: This acts like “Garbage Collection” for AI, keeping the context window small and the response time fast

6. Summary: How to Scale Without Breaking the Bank

Image

Final Recommendation

  • Prompt for Logic, Not Data: Don’t feed the LLM raw logs. Use a Python tool to summarize the logs first, then send the summary to the LLM. Fewer tokens = lower cost.

If you are auditing a cloud environment with 500 servers, sending the raw metadata of every server to the LLM is a disaster. It bloats the context window, confuses the model, and wastes money.

  • The Problem: Sending 500 JSON objects to the LLM and asking, “Which ones are vulnerable?”
  • The Scaled Solution: Use a Python tool to iterate through the 500 objects locally. Have the code filter for the specific vulnerability (e.g., Port 22 open) and return only the IDs of the problematic servers.
  • The Result: You send the LLM a tiny, refined list: “I found 3 vulnerable instances (ID-101, ID-102, ID-103). How should I remediate these?” You’ve reduced your token input by 99% while increasing accuracy.

2. Use SLMs (Small Language Models): For internal classification steps, use models like Llama 3 or Mistral hosted on your own infrastructure to eliminate per-token costs.

Not every task requires a high-reasoning model like GPT-4o or a large-scale Gemma model. Many nodes in your graph are simply classifying intent or formatting text.

  • The Problem: Using an expensive “Frontier” model to decide if a user said “Hello” or “Buy this.”
  • The Scaled Solution: Host a Small Language Model (SLM) like Gemma-2–2B or Llama-3–8B locally on LM Studio.
  • The Workflow: Entry Node (SLM): Classifies the request (e.g., “General Chat” vs. “Cloud Action”). This costs $0.00 on your local hardware. Logic Node (Large LLM): Only trigger the more complex, higher-reasoning model when the SLM identifies a task that actually requires “heavy lifting.”

3. Shorten the History: Periodically “summarize” the conversation history in your AgentState. This keeps the context window small and the "Reasoning Tax" low.

In LangGraph, the messages list is your agent's memory. In long shopping sessions or multi-step audits, this list grows indefinitely. Eventually, the "historical baggage" slows down the agent and risks hitting context limits.

  • The Problem: After 20 turns of conversation, the LLM is still re-processing the very first “Hello” and every intermediate calculation you’ve ever done.
  • The Scaled Solution: Implement a Summarization Node that triggers every 5–10 turns.
  • The Logic: This node takes the current history, generates a concise “Snapshot” of the state (e.g., “User has already calculated tiles for Room A and B; currently choosing colors for Room C”), and deletes the old messages, replacing them with the snapshot.
  • The Result: You keep the context window small, the latency low, and the “Reasoning Tax” flat, regardless of how long the session lasts.

4. Deterministic Circuit Breaker

Autonomous agents can occasionally get stuck in “reasoning loops” where they call the same tool repeatedly. Without a guardrail, this is a literal infinite bill.

  • The Scaled Solution: Implement a loop_count or max_iterations variable in your AgentState.
  • The Implementation: In your route function, check the counter. If the agent tries to loop more than 5 or 10 times, the graph should force an exit to a human_intervention node.
  • The Result: You ensure that even if the AI “hallucinates” a reason to keep going, your code (the “Nervous System”) maintains ultimate control over the budget.

Clap, share, comment, and provide your thoughts. Please share your opinions with me! Happy studying! Please follow to receive notifications on new articles.

Image

Sponsored Advertisement