Puneet Follow Machine Learning Engineer at Zillow, focused on AI/ML Architecture & Innovation

Taming the Dice Roll: Building Deterministic LLM Systems

Published: 21 Apr 2025

Introduction

Large Language Models (LLMs) have revolutionized how we build intelligent systems, powering everything from chatbots to complex data analysis tools. However, they come with an inherent characteristic that can be a major hurdle in many applications: they are non-deterministic by default. Ask the same question twice, and you might get two different answers. While this variability is perfectly acceptable, even desirable, in creative applications like story generation or brainstorming, it becomes a significant problem for structured, repeatable tasks. Think about classification, form filling, data extraction, or workflow automation – consistency and predictability are paramount.

This tutorial-style guide dives deep into the challenge of LLM non-determinism and provides practical strategies to build systems that produce consistent, reliable outputs given the same input. We’ll explore the root causes of randomness, architectural patterns to mitigate it, specific decoding strategies, and essential production tips. By the end, you’ll have a clearer understanding of how to move from the probabilistic nature of LLMs towards robust, deterministic implementations for your specific needs.

Why Are LLMs So Random, Anyway?

At their core, LLMs generate text one token (a word or part of a word) at a time. For each step, the model calculates a probability distribution over all possible next tokens in its vocabulary. The “randomness” creeps in during the decoding process – how the model chooses the next token from this distribution. Even tiny variations in this selection process can cascade, leading to significantly different outputs over longer sequences.

Here are the key sources of this randomness:

Sampling During Decoding: Techniques like temperature scaling, top-k, and top-p sampling are designed to introduce variability by randomly selecting from the most likely next tokens, rather than always picking the absolute most probable one. A temperature greater than 0 explicitly introduces randomness.
Hardware and Distributed Execution: Running inference across multiple GPUs or in different hardware environments can subtly affect calculations, floating-point precision, or even the order of operations in batch processing, potentially leading to different random seed states or tokenization outcomes.
Model Updates and Drift: When using LLM APIs (like OpenAI, Anthropic, Google), the underlying models are periodically updated. Even minor updates can change the model’s internal weights and, consequently, its outputs for the exact same prompt, making long-term reproducibility challenging without version pinning.

Deep Dive: Greedy vs. Sampling

Understanding the decoding strategy is crucial:

Greedy Decoding: This is the simplest approach. At each step, the model always selects the single token with the highest probability. If all other factors are kept constant, greedy decoding is deterministic.
Sampling (Temperature > 0): This introduces controlled randomness.
- Temperature: Controls the “creativity” or randomness. A temperature of 0 effectively becomes greedy decoding. Higher temperatures flatten the probability distribution, making less likely tokens more probable, increasing randomness.
- Top-p (Nucleus Sampling): Considers only the smallest set of tokens whose cumulative probability exceeds a threshold p. The next token is sampled only from this set.
- Top-k Sampling: Considers only the k most likely tokens and samples from that reduced set.

✅ Tip: Your first step towards determinism should always be setting temperature=0 (or as close to zero as the API allows) and top_p=1.0.

The Prompt Is the Program: Engineering for Consistency

Think of your prompt not just as a question, but as the source code for the LLM’s task. Just like in traditional programming, small changes in the code (prompt) can lead to vastly different outputs. To minimize variability stemming from the prompt itself:

Use Consistent Structure: Employ clear delimiters (like ---, ###, XML tags) to separate instructions, context, examples, and the input data. This helps the model parse the request reliably.
Provide Few-Shot Examples: Include 2-3 examples of the exact input/output format you expect within the prompt. This “primes” the model to follow the desired pattern.
Be Specific and Avoid Ambiguity: Frame requests clearly. Instead of “Summarize the text,” try “Summarize the following text in exactly three bullet points.”
Avoid Open-Ended Questions (Unless Variability is Desired): Questions like “What are your thoughts on AI?” are inherently designed for varied responses. Stick to constrained tasks for deterministic needs.

Example: Structured Prompt for Extraction

You are an expert data extraction assistant. Extract the requested information from the provided text and format it EXACTLY as shown in the examples.

---
EXAMPLE 1
Text:
"My name is John Doe, and I work at Acme Corp. You can email me at john.doe@acme.com."

Output:
{
  "name": "John Doe",
  "email": "john.doe@acme.com",
  "company": "Acme Corp"
}
---
EXAMPLE 2
Text:
"Reach out to Jane Smith from Globex Inc. at j.smith@globex.org."

Output:
{
  "name": "Jane Smith",
  "email": "j.smith@globex.org",
  "company": "Globex Inc"
}
---
ACTUAL TASK
Text:
"Hi, I'm Sarah Connor from Skynet. Reach me at sarah@skynet.ai."

Output:
{
  "name": "Sarah Connor",
  "email": "sarah@skynet.ai",
  "company": "Skynet"
}

👀 Note: Even with strict prompts, subtle variations in whitespace, punctuation, or phrasing in the input text itself can sometimes lead the LLM down slightly different paths if the examples aren’t robust enough. Test thoroughly!

Tuning the Sampler: The Control Knobs for Randomness

Beyond temperature, other parameters influence the sampling process. Understanding them helps lock down behavior:

Parameter	Effect	Recommended for Determinism
`temperature`	Controls randomness. Lower values = less random.	`0` or lowest possible
`top_p` (nucleus)	Cumulative probability cutoff. Limits sampling pool.	`1` (or omit if temp=0)
`top_k`	Limits sampling pool to the top k tokens.	`None` (or omit if temp=0)
`frequency_penalty`	Penalizes tokens based on their frequency in the text so far.	`0`
`presence_penalty`	Penalizes tokens based on whether they appear in the text so far.	`0`
`seed` (if avail.)	Initializes the random number generator for sampling.	Fixed integer value
`system_fingerprint` (OpenAI)	A unique identifier for the backend configuration used. Log this.	Log for reproducibility

Recommended Settings for Maximum Determinism

# Example using OpenAI's API (conceptual)
response = openai.ChatCompletion.create(
  model="gpt-4", # Pin specific version if possible, e.g., "gpt-4-0613"
  messages=[...],
  temperature=0.0,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  seed=42 # Use a fixed seed if the API supports it
)

# Log the fingerprint for debugging reproducibility issues
system_fingerprint = response.system_fingerprint
print(f"System Fingerprint: {system_fingerprint}")

print(response.choices[0].message.content)

✅ Most Deterministic: Use temperature=0. If the API supports it, also set a fixed seed. Log the system_fingerprint provided by APIs like OpenAI to track the backend configuration used for the request, which aids in debugging reproducibility issues.

Constraining the Output: Guiding the LLM to Structure

Even with deterministic sampling, the LLM might still generate text that semantically fits but doesn’t match the format you need (e.g., slightly different phrasing, extra commentary). To enforce structure:

Method 1: JSON Output Mode / Structured Output Prompts

Many models and APIs are being fine-tuned or offer specific modes to generate valid JSON (or XML, YAML).

Prompt Engineering

Explicitly instruct the model to respond only in JSON format, providing the schema in the prompt.

Extract the user's name and city from the text. Respond ONLY with a valid JSON object matching this schema:
{"name": "string", "city": "string"}

Text:
"My name is Alex, and I live in Berlin."

JSON Output:

API Features

Some APIs (like OpenAI’s) offer a dedicated JSON mode that forces the output to be valid JSON.

# Example using OpenAI's API JSON Mode (conceptual)
response = openai.ChatCompletion.create(
  model="gpt-4-turbo", # Check model compatibility
  messages=[...],
  response_format={ "type": "json_object" }, # Enable JSON mode
  temperature=0.0
  # ... other deterministic settings
)

Method 2: Function Calling / Tool Use

This is often the most robust method for structured data extraction. You define a function signature (schema) that the LLM should populate. The API then forces the output to conform to this schema.

# Example using OpenAI's Function Calling (conceptual)
functions = [{
    "name": "extract_contact_information",
    "description": "Extracts name, email, and company from text.",
    "parameters": {
        "type": "object",
        "properties": {
            "name": {"type": "string", "description": "The person's full name"},
            "email": {"type": "string", "description": "The person's email address"},
            "company": {"type": "string", "description": "The company name"}
        },
        "required": ["name", "email"] # Specify mandatory fields
    }
}]

response = openai.ChatCompletion.create(
  model="gpt-4",
  messages=[{"role": "user", "content": "Contact is Sarah Connor at sarah@skynet.ai from Skynet."}],
  functions=functions,
  function_call={"name": "extract_contact_information"}, # Force the call
  temperature=0.0
)

# The response will contain structured arguments for the function
# (Error handling omitted for brevity)
arguments = response.choices[0].message.function_call.arguments
print(arguments) # Output will be a JSON string like: '{"name": "Sarah Connor", "email": "sarah@skynet.ai", "company": "Skynet"}'

Function calling significantly reduces the chances of formatting errors or the LLM adding extraneous text, making the output much more predictable and machine-readable.

Caching, Hashing, and Idempotency: Don’t Ask Twice If You Don’t Have To

If you need the exact same output for the exact same input every time, caching is your most powerful tool. It also saves costs and reduces latency. The strategy relies on idempotency: ensuring that processing the same input multiple times yields the same result.

Hash the Full Context

Create a unique identifier (hash) for the entire input context that goes into the LLM call. This includes:

The core prompt template
The specific input data
The model name (and version!)
Key sampling parameters (temperature, seed, top_p, etc.)

import hashlib
import json

def create_llm_request_hash(model, messages, temperature, top_p, seed=None, **kwargs):
    """Creates a unique hash for an LLM request configuration."""
    payload = {
        "model": model,
        "messages": messages,
        "temperature": temperature,
        "top_p": top_p,
        "seed": seed,
        # Include any other parameters that affect the output
        **kwargs
    }
    # Use json.dumps for consistent serialization, sort keys to handle dict order
    serialized_payload = json.dumps(payload, sort_keys=True).encode('utf-8')
    return hashlib.sha256(serialized_payload).hexdigest()

# Example usage:
model = "gpt-4-0613"
messages = [{"role": "user", "content": "Extract name: John Doe"}]
temp = 0.0
tp = 1.0
fixed_seed = 123

prompt_hash = create_llm_request_hash(model, messages, temp, tp, seed=fixed_seed)
print(f"Request Hash: {prompt_hash}")

# Now use this hash as the key in your cache (e.g., Redis, database)
# cache_key = f"llm_cache:{prompt_hash}"
# Check if cache_key exists before calling the LLM

Caching Architecture

Here’s a typical flow:

graph TD
    A[User Query / Input Data] --> B[Construct Full Prompt + Parameters];
    B --> C[Generate Request Hash];
    C --> D{Check Cache using Hash};
    D -- Cache Hit --> E[Return Cached Output];
    D -- Cache Miss --> F["LLM Inference (Deterministic Settings)"];
    F --> G["Store LLM Output in Cache (Key: Hash)"];
    G --> E;

⚡ Benefit: This guarantees that if the exact same request configuration is seen again, the cached result is returned instantly, ensuring perfect consistency and saving an LLM call.

Deterministic Agentic Systems: Chains of Thought

When LLMs are part of multi-step workflows or “agents” (where the output of one LLM call informs the input of the next), ensuring determinism at each step is critical. Otherwise, randomness can compound quickly.

Key principles for deterministic agents:

Deterministic Steps: Ensure each individual LLM call within the agent uses deterministic settings (temp=0, fixed seed if possible, structured output).
Log Everything: Record the inputs, outputs, and parameters for each step. This allows for debugging and replaying sequences.
Hashable State: The state passed between steps should be serializable and hashable, enabling caching of intermediate results.
Retry Logic: Implement robust retries for transient API errors, but be careful not to introduce randomness during retries (use the same parameters).
Planning → Execution → Verification Loop:
- Planner (LLM): Generates a plan (e.g., sequence of steps, API calls). Use deterministic settings here.
- Executor: Executes each step (could be another LLM call or a different tool). Again, aim for determinism.
- Verifier: Checks if the output of a step is valid and meets requirements before proceeding. This can catch unexpected deviations early.

Example: Simple Task Planner Agent

graph TD
    A["Goal: 'Book flight from NYC to LON for tomorrow'"] --> B["Planner LLM (temp=0, structured output: JSON plan)"];
    B --> C["Plan: {'action': 'search_flights', 'params': {'from': 'NYC', 'to': 'LON', 'date': 'YYYY-MM-DD'}}"];
    C --> D["Verifier: Is plan valid? Contains required fields?"];
    D -- Yes --> F["Executor: Call Flight Search API (External Tool)"];
    F --> G["API Result: List of flights"];
    G --> H["Next Step Planner/Selector LLM (temp=0)"];
    H --> I["Plan: {'action': 'select_flight', 'params': {'flight_id': 'XYZ'}}"];
    I --> J["Verifier: Is selection valid?"];
    J -- Yes --> K["Executor: Call Booking API"];
    K --> L[Final Result: Booking Confirmation];
    D -- No --> B(Retry/Refine Plan);
    J -- No --> H(Retry/Refine Selection);

By enforcing determinism and validation at each stage, the overall agent behavior becomes more predictable.

Using LLMs as Deterministic “Rule Engines”

Sometimes, you don’t need the full generative power of an LLM, just its ability to understand language and follow instructions precisely. You can use a deterministically configured LLM as a flexible alternative to complex, hand-coded rule engines or simpler ML models.

Example: Sentiment Classifier

Instead of training a dedicated sentiment model, use an LLM with a clear classification prompt and deterministic settings.

Prompt:
Classify the sentiment of the following text. Respond with ONLY one word: Positive, Negative, or Neutral.

Text: "The user interface is intuitive, but the app crashed twice during setup."

Sentiment:

Settings: temperature=0, max_tokens=5 (to prevent rambling)

By providing clear instructions, few-shot examples (optional but helpful), and zero temperature, the LLM can act as a reliable, albeit potentially slower or more expensive, classifier. This is useful for complex rules that are hard to capture with regex or simple logic.

Evaluation and Regression Testing: Catching Drift

Determinism isn’t a one-time setup; it requires ongoing vigilance. Models change, prompts get tweaked, and infrastructure evolves. You need automated checks to ensure your system remains consistent.

Snapshot Testing (“Golden” Tests)

Define a set of representative input prompts (“golden prompts”).
Run these prompts through your deterministic LLM setup and store the exact outputs (“golden responses”).
In your CI/CD pipeline or regular testing, rerun the golden prompts with the current setup.
Compare the new outputs character-by-character against the stored golden responses. Any difference indicates a regression in determinism.

Tools & Frameworks

OpenAI Evals: A framework for evaluating models, which can be adapted for deterministic checks.
llm-guard / Guardrails AI: Tools focused on validating and securing LLM outputs, which often involves checking for format consistency.
Standard Testing Libraries (e.g., pytest in Python): Use fixtures to manage golden prompts/responses and write simple assertion tests.

# Conceptual pytest example
import pytest

# Assume get_deterministic_llm_response(prompt, settings) exists
# Assume load_golden_response(prompt_id) exists

GOLDEN_PROMPTS = {
    "prompt1": "Classify: 'I love this!'",
    "prompt2": "Extract email from 'Contact me at test@example.com'"
}

@pytest.mark.parametrize("prompt_id", GOLDEN_PROMPTS.keys())
def test_llm_determinism(prompt_id):
    prompt_text = GOLDEN_PROMPTS[prompt_id]
    deterministic_settings = {"temperature": 0, "seed": 42, "model": "gpt-4-0613"} # Example

    # Get the current response
    current_response = get_deterministic_llm_response(prompt_text, deterministic_settings)

    # Load the expected golden response
    golden_response = load_golden_response(prompt_id) # Load from file/db

    # Assert exact match
    assert current_response == golden_response, f"Determinism failed for {prompt_id}"

✅ Key Practice: Integrate these deterministic checks into your automated testing pipeline to catch regressions before they reach production.

Production Considerations: Scaling Determinism

Running deterministic LLM systems reliably at scale requires attention to infrastructure:

Batch Processing: For high throughput, use frameworks like Ray, vLLM, or specialized inference servers. Ensure the batching process itself doesn’t introduce randomness (e.g., consistent ordering, isolated random seeds per request if needed). Caching becomes even more critical here.
Pin Model Versions: Always specify the exact model version in your API calls (e.g., gpt-4-0613 instead of just gpt-4). This prevents unexpected output changes when the provider updates the default model alias.
Log system_fingerprint and seed: When using APIs like OpenAI, log the system_fingerprint and the seed used for each request. This is invaluable for debugging non-reproducible outputs, as it tells you if the underlying infrastructure configuration changed.
Control Hardware Randomness (Self-Hosted Models): If running open-source models, be mindful of GPU non-determinism. Libraries like PyTorch offer flags (torch.use_deterministic_algorithms(True)) but be aware this can impact performance and may not cover all operations. Consistent hardware and software environments are key.
Leverage Frameworks: Tools like LangChain or LlamaIndex offer components like Output Parsers with built-in retry logic and formatting enforcement, which can help manage minor deviations and enforce structure, contributing to overall system stability.

The Limits and Tradeoffs of Determinism

Striving for absolute determinism comes with tradeoffs:

Determinism vs. Creativity/Flexibility: Setting temperature=0 eliminates randomness but also removes the LLM’s ability to generate diverse, creative, or nuanced responses. This is unsuitable for tasks like brainstorming, writing assistance, or conversational AI where variability is often desired.
Latency: Techniques like function calling, rigorous output parsing, and retries can add latency compared to a simple, non-constrained generation. Caching helps mitigate this for repeated requests.
Cost: Retries due to validation failures consume additional tokens and API calls. Overly complex prompting for structure might also increase token count.
Brittleness: Highly constrained systems can sometimes be brittle. If the input data deviates slightly from what the prompt examples cover, a deterministic system might fail predictably, whereas a slightly more flexible system (e.g., low temperature > 0) might still manage to produce a usable, albeit not identical, output.

🎯 The Goal: Choose the right level of determinism based on your specific use case and business requirements. Don’t aim for absolute determinism if some flexibility is acceptable and beneficial. Balance consistency needs with performance, cost, and the desired level of generative capability.

Conclusion

While LLMs possess an inherently stochastic nature, building deterministic systems is achievable and often necessary for reliable, production-grade AI applications. It requires a multi-faceted approach: disciplined prompt engineering, careful control over sampling parameters (especially temperature and seed), leveraging structured output mechanisms like JSON mode or function calling, robust caching strategies based on request hashing, and continuous evaluation through regression testing.

By understanding the sources of randomness and applying these techniques systematically, you can effectively “tame the dice roll.” This allows you to harness the power of LLMs for structured tasks in areas like automation, data processing, and rule-based reasoning, building systems that are not only intelligent but also predictable and dependable. The journey towards deterministic LLMs is one of careful engineering, continuous monitoring, and a clear understanding of the tradeoffs involved.

Stay tuned for potential future deep dives into specific areas like advanced caching techniques, deterministic agent architectures, or evaluation strategies.

21 Apr 2025

« From Models to Impact: Why Agentic Systems Belong in Every ML Engineer's Toolkit The AI Renaissance: From Generic Tools to Infinite Possibilities »

Puneet Singh Ludu