Puneet Follow Machine Learning Engineer at Zillow, focused on AI/ML Architecture & Innovation

From Models to Impact: Why Agentic Systems Belong in Every ML Engineer's Toolkit

Published: 16 Apr 2025

A Personal Take

Recently, while chatting with a fellow MLE friend, I mentioned that I’d been investing time in building agentic systems—those LLM-powered pipelines that handle planning, reasoning, and tool invocation. He paused for a second and asked:

“Do you think this is a worthy skill? I mean… isn’t it mostly just API calling?”

That question stuck with me.

Because from the outside, it can look like that. But once you’re under the hood, you realize agentic systems aren’t just wrappers—they’re architectural layers that introduce reasoning, autonomy, coordination, and memory into how we deliver ML experiences. They’re an essential bridge between intelligent models and the real-world workflows they aim to improve.

What Are Agentic Systems, Really?

Agentic systems are software architectures that embed language models into decision-making loops. These systems don’t just call an endpoint and return a string; they:

Interpret complex user goals
Choose appropriate actions (often involving external tools)
Coordinate across tools and services
Track memory or past interactions
Manage uncertainty and fallbacks
Adapt their behavior based on new information

They represent a paradigm shift from static model-serving to dynamic, dialogue-driven computation.

graph TD
    subgraph "Agentic System"
        direction TB
        A["Agent Core"]

        subgraph "Agent Core Components"
            direction TB
            P["Planning & Reasoning"]
            PE["Prompt Engineering"]
            TI["Tool Interface & Validation"]
            EH["Error Handling & Fallbacks"]
        end

        M["Memory <br/> (Short/Long Term, RAG)"] --> A
        T["Tools/APIs <br/> (e.g., Jira, Search, Email, <br/> **ML Models**)"] --> TI
        E["Environment <br/> (User Input/Output, State)"] --- A --- E


        A --> P
        A --> PE
        A --> TI
        A --> EH

    end

    subgraph "ML Lifecycle (Produces Tools for Agent)"
        direction LR
        DP["Data Preparation"] --> MT["Model Training"] --> MS["Model Serving"] --> MO["Monitoring"]
    end

    MS -- "Deployed Model" --> T

    %% Connections showing interaction flow
    E -- "User Query/Event" --> A
    A -- "Tool Call" --> T
    T -- "Tool Result" --> A
    A -- "Memory Read/Write" --> M
    A -- "Response/Action" --> E

    %% Optional Styling
    style A fill:#lightblue,stroke:#333,stroke-width:2px
    style M fill:#ccf,stroke:#333
    style T fill:#fec,stroke:#333
    style E fill:#cfc,stroke:#333
    style MS fill:#eee,stroke:#333

Agent Development 101: The Technical Landscape

To build robust agents, you’re engineering a distributed cognitive system. Building these systems often involves leveraging frameworks specifically designed for agentic architectures, such as LangChain, LlamaIndex, AutoGen, or Haystack, which provide abstractions for planning, tool use, and memory management. Here’s a more in-depth look at what this entails technically:

1. Planning and Decision-Making

Agents must translate ambiguous instructions into structured plans. In traditional programming, that’s flow control. In agent systems, it’s modeled through reasoning frameworks like ReAct (reason + act), Tree-of-Thoughts, or custom planners. This process involves maintaining internal state, chaining subtasks, and using the model as a controller rather than a simple answer engine.

This often involves crafting prompts with few-shot examples of desired reasoning patterns or using specific instruction formats like ‘Think step-by-step before acting’ to guide the LLM. The output plan might then need to be parsed from the LLM’s response, often requiring robust parsing logic to handle variations in the generated text. It also involves handling the inherent non-determinism of LLMs; the same prompt might yield slightly different plans, requiring systems that are resilient to variability and can recover from suboptimal choices. It requires careful prompt design, recursive calls, and planning abstractions that are both efficient and interpretable.

2. Tool Invocation

Modern LLMs are not used in isolation. They act as orchestrators that call other models, APIs, or services. You expose tools to agents with schemas and function signatures. Exposing tools effectively requires detailed descriptions, often using formats like JSON Schema or OpenAPI specifications, so the LLM clearly understands the tool’s purpose, inputs, and outputs.

# Example Tool Definition (Conceptual)
def create_jira_ticket(project_key: str, summary: str, description: str, issue_type: str = "Task", priority: str = "Medium") -> dict:
    """Creates a new ticket in the specified Jira project."""
    # ... implementation using Jira API client ...
    pass

def send_email(recipient: str, subject: str, body: str) -> bool:
    """Sends an email."""
    # ... implementation using email API/service ...
    pass

The agent needs to:

Parse unstructured intent into structured input for the tool.
Validate the inputs and retry on error.
Parse the tool’s output and update internal reasoning state.

The core agent loop (e.g., the ReAct cycle: Thought -> Action -> Observation -> …) relies on reliably parsing the LLM’s intent to call a specific tool and extracting the correct arguments, often using techniques like enforcing JSON output or using dedicated parsing models. This goes far beyond simple function calls. Engineers must build robust error handling for failed API requests or malformed LLM outputs, manage state consistently across asynchronous tool calls, implement retry logic with backoff strategies, and often deal with rate limits or authentication for external services. Ensuring the agent correctly interprets tool outputs and incorporates them into its ongoing reasoning requires careful validation and potential correction loops. This is where most of the “not just API calling” complexity shows up: input/output grounding, fallbacks, retries, and multi-tool coordination.

3. Memory Management

Stateless LLMs are limited. Agents need persistent memory to:

Recall past decisions and interactions.
Fetch relevant context for the current task (RAG).
Maintain identity and preferences over sessions.

Memory can be short-term (fitting within the LLM’s context window) or long-term (retrieval-augmented via vector databases or other stores). For RAG, the choice of embedding model is critical, balancing retrieval accuracy, dimensionality, and computational cost. Retrieval might involve not just vector similarity but also keyword filters or recency weighting. For summarizing long histories or large documents to fit context, techniques like map-reduce summarization (summarizing chunks independently, then summarizing the summaries) can be employed. Engineers must design schemas, embedding strategies, and summarization techniques to fit within budget and latency requirements.

# Example using a vector store for RAG
relevant_docs = vectorstore.similarity_search("internal docs about handling critical bugs")
context = "\n".join([doc.page_content for doc in relevant_docs])

4. Multi-Agent Systems

For complex workflows, agents often collaborate. A planner agent may assign tasks to specialized agents — one for search, another for summarization, another for interacting with a specific complex API.

graph LR
    U["User Input/Event"] --> P["Planner/Manager Agent"]
    P -- Task 1 --> S["Triage Agent (Email/Text Analysis)"]
    P -- Task 2 --> J["Jira Agent (API Interaction)"]
    P -- Task 3 --> E["Email Agent (Compose/Send Reply)"]
    S --> P
    J --> P
    E --> P
    P --> F["Final Status Aggregator"]
    F --> O["Logging/User Notification"]

Each agent has a constrained role and controlled access to tools. Designing multi-agent systems requires:

Shared memory protocols or databases.
Standardized messaging formats between agents (e.g., using JSON objects with predefined keys for sender, recipient, task, data).
Communication patterns, such as a hierarchical structure where a manager agent dispatches tasks, or a ‘blackboard’ system where agents read and write to a shared state.
Failure routing, arbitration logic for disagreements, and mechanisms for coordinating parallel work.

5. Evaluation and Supervision

Evaluating agentic systems presents unique challenges, as performance often isn’t a simple pass/fail. Success depends on the quality of reasoning, tool use, and final output. Effective evaluation requires a multi-faceted approach:

Task Success Metrics: Defining clear goals and measuring if the agent achieved the desired outcome (e.g., successfully created the correct Jira ticket, sent an appropriate email reply).
Programmatic Scoring (LLM-as-Judge): Using another LLM to assess aspects like reasoning coherence, response quality, or tool usage correctness based on defined criteria. However, setting up reliable prompts and avoiding grader bias is non-trivial.
Manual Reviews: Human oversight using annotation interfaces to grade nuanced agent behavior, identify failure modes, and create high-quality test cases.
Traceability and Debugging: Implementing robust logging and replay systems to meticulously track the agent’s internal state, reasoning steps, tool inputs/outputs, and error handling. Fine-grained traceability is crucial, as a single flawed decision early on can derail the entire process. Techniques like comparing agent execution traces against verified “golden” traces can also be employed.

6. Security and Ethical Guardrails

Giving agents autonomy and tool access introduces significant security and ethical considerations. Engineers must proactively address:

Prompt Injection: Malicious inputs designed to hijack the agent’s planning or force unintended tool use.
Tool Misuse: Ensuring agents use tools within designated boundaries and don’t perform harmful or unauthorized actions (requires robust permissioning).
Data Privacy: Protecting sensitive information, especially when agents access or store user data in memory systems or interact with sensitive APIs.
Unintended Consequences: Mitigating risks of agents generating harmful, biased, or incorrect outputs, especially in complex, multi-step tasks.
Transparency: Making agent reasoning and decision-making processes as interpretable as possible to identify and rectify issues.

Building robust input validation, permission systems, output content filtering, and continuous monitoring is essential for deploying agents responsibly.

Integrating with the MLE Stack

So where does all of this fit for a machine learning engineer?

In traditional ML pipelines:

graph TD
    A[Raw Data] --> B[Feature Engineering]
    B --> C[Model Training]
    C --> D[Model Deployment]
    D --> E[Scoring Service]

Agentic systems extend the interface, often wrapping around or orchestrating calls to such scoring services:

graph TD
    U[User Input/Event] --> R[Agent Reasoning & Planning]
    R --> T[Tool Use: Models, APIs, DBs]
    T --> M[Memory Access/Update]
    M --> R[Loop back for next step based on memory/tool output]
    R --> P[Plan Execution Complete]
    P --> O[Formatted Output/Action]

Instead of applications calling the model endpoint directly, the agent becomes the abstraction layer that decides when, why, and how to use the model (or other tools). The ML model becomes a powerful tool within the agent’s toolkit.

This changes how you:

Expose your models: Requires clear contracts, schemas (like OpenAPI specs), and descriptions so agents can reliably use them as tools.
Evaluate end-to-end usage: Focus shifts from just model accuracy to overall task success rate and the quality of the agent’s decision paths (which paths succeed vs. fail).
Optimize cost and latency: Agentic systems, with their multiple LLM calls and tool interactions, require careful consideration of token consumption and overall response time. This involves choosing the right model for each sub-task (balancing capability vs. cost/speed), implementing caching strategies, potentially parallelizing tool calls, and monitoring operational expenses closely. Agents allow for dynamic decisions on if and which model or tool to use, offering opportunities for optimization compared to static approaches.
Think about deployment: Requires managing the agent’s state, orchestration logic, and connections to various tools, in addition to the model itself.

Concrete Example: Automated Support Ticket Handling

Let’s make this real with a detailed yet structured walkthrough of how an agentic system handles customer support.

Incoming Support Request

Subject: “Urgent: Login fails on checkout page”
Body: “I keep getting an ‘Authentication Error’ when trying to log in during checkout. This is blocking my purchase! My username is customer@email.com.”

Agent Processing Flow

graph TD
    A["New Email Arrives"] --> B["Parse Email: Extract Intent, Entities, Urgency"]
    B --> C["Plan Actions: Jira + Email"]
    C --> D["Query Docs/User DB for Context"]
    D --> E["Create Jira Ticket"]
    E --> F["Send Confirmation Email"]
    F --> G["Log Results"]
    E -->|"Error?"| H["Fallback: Notify Human"]

Step-by-Step Execution

Parse and Understand:
- The LLM analyzes the email content, identifying:
  - Intent: Bug Report
  - Urgency: High (keywords “Urgent”, “blocking”)
  - Key entities: login failure, checkout page, Authentication Error, customer@email.com
- The model creates a structured representation of the issue from unstructured text.
Planning:
- The agent constructs a multi-step plan:
  - Determine the appropriate Jira project for frontend issues
  - Extract required fields for ticket creation
  - Set appropriate priority based on urgency analysis
  - Create ticket via API
  - Send acknowledgment to customer
  - Log the interaction for analytics
- This demonstrates the agent’s ability to break down a high-level task into executable steps.
Memory/Context Retrieval:
- The agent queries internal knowledge bases:
  - RAG system to find the correct Jira project key for frontend bugs (“FRONTEND”)
  - Customer database to verify if customer@email.com is a valid user
  - Historical context of similar issues for reference
- This showcases the integration of memory systems with planning.

Tool Invocation Details

# First tool call: Create Jira ticket
ticket_result = create_jira_ticket(
    project_key="FRONTEND",
    summary="Login fails on checkout page",
    description="User customer@email.com reports 'Authentication Error' during checkout login. Blocking purchase.",
    issue_type="Bug",
    priority="High"
)

# Agent processes result
ticket_id = ticket_result.get("id")  # e.g., "FRONTEND-1234"

# Second tool call: Send confirmation email
email_result = send_email(
    recipient="customer@email.com",
    subject=f"Re: Urgent: Login fails on checkout page [Ticket {ticket_id}]",
    body=f"""Thank you for reporting this issue. We've created a ticket ({ticket_id})
    and our team will investigate urgently. We'll update you as soon as
    we have more information about the authentication error you're experiencing."""
)

# Log interaction for analytics and future reference
log_interaction(
    ticket_id=ticket_id,
    customer_email="customer@email.com",
    resolution_stage="initial_response",
    agent_actions=["ticket_creation", "customer_notification"]
)

Error Handling and Fallbacks

The agent implements sophisticated error handling:

If the Jira API returns an error (e.g., invalid project key, service unavailable):
- First retry with exponential backoff
- If continued failure, try fallback project (e.g., “GENERAL”)
- If still unsuccessful, escalate to human support team with context
- Send appropriate messaging to customer reflecting current status
If email sending fails:
- Log failure
- Try alternative notification method
- Alert support team to manual follow-up need

This demonstrates how agents go beyond basic “happy path” execution to handle real-world failures.

What This Demonstrates

This end-to-end example showcases key agent capabilities:

Natural language understanding: Converting unstructured email into structured data
Complex reasoning: Planning appropriate actions based on context and urgency
Tool orchestration: Coordinating multiple API calls across systems
Memory integration: Using RAG to retrieve relevant knowledge
Error resilience: Implementing fallback strategies for robust operation
Context management: Maintaining state across multiple steps and tools

Instead of a single model call, the agent creates an intelligent processing pipeline that connects user input to meaningful business actions, demonstrating why agent development goes far beyond “just API calling”.

Bridging the Gap to Production Reality

It’s crucial to recognize that while the support ticket example illustrates the concepts of agentic systems, it significantly simplifies the challenges faced in a real-world production environment. Moving beyond such a “toy” example requires substantially more engineering rigor:

Bespoke Model Development: Production systems often demand custom-trained ML models tailored to specific domains (e.g., intent classification optimized for your product’s jargon, specialized NER models) rather than relying solely on general-purpose LLMs for every task. This involves the full ML lifecycle: data collection, annotation, training, evaluation, and versioning.
Real-time Model Serving: These custom models need to be deployed as robust, low-latency, scalable APIs (often microservices) that the agent can reliably call as tools. This requires expertise in MLOps, infrastructure management, performance optimization, and monitoring.
Production-Grade Orchestration: The agent framework itself needs to handle concurrency, distributed state management, robust logging/tracing for observability, sophisticated monitoring for cost and performance, and seamless integration with existing CI/CD pipelines.
Advanced Error Handling & Resilience: Production systems need far more sophisticated error handling, circuit breaking, retry logic, and failover strategies than shown. They must gracefully handle intermittent network issues, API rate limits, downstream service outages, and unexpected model outputs.
Security and Compliance: Implementing enterprise-grade security measures, managing secrets, ensuring data privacy (like GDPR or CCPA compliance), and undergoing security audits are non-negotiable.
Scalability and Cost Management: Designing the system to scale efficiently under varying loads while managing the potentially high costs associated with frequent LLM calls and complex tool interactions is a major engineering challenge.

Building production-level agentic systems involves deep work across the entire software and machine learning stack, far exceeding the complexity suggested by simplified examples.

Final Thoughts

Agentic systems aren’t replacing models. They’re unlocking them.

They don’t reduce the need for ML expertise — they amplify it by allowing engineers to:

Build more flexible, dynamic, and interactive user interfaces.
Create sophisticated data feedback loops for model improvement based on agent performance.
Explore complex product requirements and achieve product-market fit faster through iterative development.

The complexity lies significantly in the orchestration, state management, reliability, and evaluation, not just the core model inference. Start experimenting today – perhaps by building a basic RAG (Retrieval-Augmented Generation) agent or trying out different planning techniques like ReAct using a framework. The hands-on experience is invaluable. And mastering it means you’re no longer just training models—you’re delivering intelligence.

If you want to build user-facing, ML-powered products in the era of LLMs, agent development is not optional. It’s how you close the loop between raw model outputs and meaningful, reliable experiences. It’s one of the most valuable skills an ML engineer can have today.

References

16 Apr 2025

« Automating My Resume with LaTeX, GitHub Actions, and OpenAI API Taming the Dice Roll: Building Deterministic LLM Systems »

Puneet Singh Ludu