Understanding why deterministic output from LLMs is nearly impossible
Table of Contents
The Dream of Perfect Reproducibility
If you’re building products that extract structured data from unstructured documents—like we do at Unstract—you’ve probably had this thought: “Why can’t I get the exact same JSON output every time I process the same invoice through my LLM pipeline?” It’s a fair question, and one that keeps many of us up at night.
Here’s the thing: when you’re processing thousands of documents with virtually unlimited format variations—invoices from different vendors, contracts with unique clauses, reports with creative layouts—you need your extraction pipeline to be both flexible enough to handle the chaos AND consistent enough to produce reliable, standardized JSON outputs. It’s like asking a jazz musician to improvise the exact same solo twice. Theoretically possible? Maybe. Practically achievable? Well, that’s where things get interesting.
The allure of deterministic outputs isn’t just about satisfying our inner perfectionist. In production systems, determinism means:
Debuggability: When something goes wrong, you can reproduce it
Testing: Your test suites actually mean something
Compliance: Auditors love it when you can show them the exact same result twice
Caching: Why process the same document twice if you know you’ll get the same result?
But here’s the kicker: even when you set the temperature to 0 (supposedly forcing the model to always pick the highest probability token), your LLM might still surprise you with slightly different outputs.
So, the million dollar question is: why does this happen? Let’s dive into why and more importantly, what you can actually do about it.
Sidenote
At Unstract, we handle challenges such as non-determinism and hallucinations with a LLM-as-a-Judge implementation we call LLMChallenge. But, for the purposes of this article, we will explore the very question of why LLMs have this often troublesome (but sometimes very useful) non-determinism.
Check out this 2-minute video to learn more about Unstract’s LLMChallenge:
The Auto-Regressive Nature of LLMs: Why Every Token Matters
Before we get into the nitty-gritty of non-determinism, let’s talk about how LLMs actually generate text. Understanding this is crucial because it explains why tiny variations can cascade into completely different outputs.
LLMs are auto-regressive models—they predict the next token based on all the tokens that came before. It’s like building a house of cards where each card’s position depends on every card below it. Here’s what happens when an LLM generates a response:
Initial Context: The model takes your prompt and encodes it
First Token: It predicts the probability distribution over all possible next tokens and selects one
Cascade Effect: That first token becomes part of the context for predicting the second token
Rinse and Repeat: Each new token influences all subsequent predictions
This is why a tiny variation in the third token can lead to a completely different sentence by the tenth token. If your model generates “The invoice total is” vs “The total amount is”, everything that follows might change. In the context of JSON extraction, this could mean the difference between:
{"invoice_total": 1500.00}
and
{"total_amount": 1500.00}
Same information, different structure—and suddenly your downstream systems are throwing errors.
Calculate Token Usage and API Costs for 400+ LLMs
Effectively managing tokens is crucial for controlling costs, performance, and output quality—particularly in complex document extraction workflows. These processes often involve large inputs, requiring predictable and cost-efficient results.
The “temperature” parameter to LLM API’s completion calls determines how “creative” the model’s response is going to be. Simplifying this, in essence, when determining the next token to select in the response, this parameter determines the randomness at which the next token is selected from a set of candidate tokens. The “creativity” comes from the randomness of the future path that is taken as each token is added.
You might think you’ll simply set the temperature to 0 and you’ll get perfectly predictable output every single time. Unfortunately though, even with temperature set to 0, LLM responses can still be non-deterministic due to several technical factors that most of us don’t think about until they bite us in production. It’s important to understand why this can be.
Floating-Point Arithmetic: The Silent Culprit
GPUs (and even CPUs) use floating‑point math that isn’t associative: (a+b)+c can differ from a+(b+c) by a few ULPs (units in the last place). In massively parallel kernels, reductions and accumulations happen in different orders across runs (threads finish in different sequences, kernels get fused differently, etc.), so the final logits can shift ever so slightly. That’s the real source of variation—not two tokens having literally identical probabilities and a random tie-breaker.
Most of the time greedy decoding still picks the same top token; the tiny numeric drift only matters when two candidates are extremely close. But when that happens, one run might tip toward “amount” while another picks “total”, and the divergence cascades from there.
The Hardware Lottery
Different GPUs, CPU architectures, or even the same hardware under different conditions (temperature, load) can produce slightly different calculations. Modern inference systems often distribute computations across multiple devices, and the order of operations or how results are aggregated can vary.
This isn’t just theoretical. If you’re running inference on a cluster of A100s, the exact same prompt might take a slightly different path through the hardware depending on which GPUs are available and how loaded they are.
Batch Processing Blues
LLMs typically process multiple sequences in batches for efficiency. When LLMs process multiple sequences in batches for efficiency, the batch configuration (size, padding patterns, memory layout) can affect the numerical computation path, potentially introducing slight variations through different floating-point rounding errors. However, the actual content of other sequences in the batch should not influence your output due to proper masking. Your innocent JSON extraction request can change low-level rounding paths, though in greedy decoding, this rarely flips the chosen token with other similar requests or mixed with completely different prompts.
Software Stack Shenanigans
Libraries like PyTorch or TensorFlow may use non-deterministic algorithms by default for performance reasons. Operations like matrix multiplication or reduction operations might execute in different orders, producing mathematically equivalent but numerically slightly different results.
Some specific culprits:
Atomic operations on GPUs
Parallel reduction operations
Dynamic kernel selection based on input sizes
The State of Determinism in Today’s LLM Landscape
So, today’s LLM services don’t prioritize determinism at temp=0. But are there some that provide this guarantee? Let’s look at the current landscape.
Anthropic follows a similar approach. The Claude API can produce slightly different outputs across calls, even with identical inputs and temp=0. No promises, no guarantees. Their documentation mentions: “Note that even with a temperature of 0.0, the results will not be fully deterministic.”
Google (Vertex AI/Gemini) also embraces non-determinism. Their documentation acknowledges that identical requests may produce different results. It reads: “A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.”
AWS Bedrock, since it hosts multiple models, inherits the determinism characteristics (or lack thereof) from each model provider.
Why Nobody’s Fixing This
The reasons are surprisingly pragmatic:
Performance Tradeoffs: Deterministic operations are often 2-5x slower than their optimized non-deterministic counterparts. When you’re serving millions of requests, that matters.
Infrastructure Complexity: Ensuring identical hardware/software states across distributed systems is like herding cats—if the cats were quantum particles that exist in multiple states simultaneously.
Limited Demand: Think about this: most production use cases can tolerate minor variations in output. The folks who really need determinism are a minority, relatively speaking.
Model Updates: Providers regularly update models and infrastructure. Even if they could guarantee determinism today, tomorrow’s model update would break it.
Workarounds for the Desperate
If you really need more deterministic behavior:
Dedicated instances or private deployments give you more control over the environment
Running models locally with fixed seeds and deterministic settings gives you the most control
Using log probabilities to detect when outputs might vary (when top tokens have very similar probabilities)
The industry consensus seems to be that perfect determinism isn’t worth the performance and complexity costs for most applications. If you need reproducible outputs, the best approach is usually to design your application to be robust to minor variations rather than expecting bit-perfect determinism.
Seeds: Helpful, But Far From a Silver Bullet
Fixed seeds are great for controlling randomness in traditional code, but their power during LLM inference is narrow.
What seeds actually influence
At inference time they only matter when you introduce randomness:
Token sampling (temperature > 0, top‑k, top‑p/nucleus, beam search with stochastic tie breaks).
Any other explicit PRNG calls in your decoding pipeline (rare in prod).
With temperature = 0 (greedy decode), there’s no sampling step—so the seed is effectively ignored.
What seeds cannot fix
Seeds do nothing about:
Floating‑point drift from different reduction/accumulation orders.
Batch/padding layout differences that alter numeric paths.
Hence, you can set seed=1234 and still see slight changes run‑to‑run.
# Example: seed helps only if randomness is used
resp = client.chat.completions.create(
model="gpt-4o",
messages=[...],
temperature=0.7, # <-- sampling, seed matters
top_p=0.9,
seed=1234
)
# With temperature=0, the seed won't rescue you from infra or numeric drift:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[...],
temperature=0,
seed=1234 # largely irrelevant here
)
The “best effort” reality of hosted APIs
Providers like OpenAI expose a seed to improve reproducibility, but they explicitly call it best effort. Model weight refreshes, kernel upgrades, or request routing mean you still shouldn’t expect bit‑perfect repeats.
Bottom line: Use seeds to tame sampling randomness. For everything else, design your system to tolerate tiny deviations—or run fully controlled, deterministic stacks locally.
Best Practices for JSON Extraction in a Non-Deterministic World
So we’ve established that perfect determinism is a pipe dream. But we still need to extract structured JSON from wildly varying documents. Here’s how to build robust systems that work with, not against, the non-deterministic nature of LLMs.
1. Embrace Structured Output Modes
Most modern LLM APIs now offer structured output features that, in theory, at least guarantee valid JSON, even if the content might vary slightly:
This doesn’t solve non-determinism, but it does ensure you won’t get malformed JSON that crashes your parser at 3 AM.
One thing to note, however: OpenAI’s JSON schema, Anthropic tools, Gemini function calling, etc. greatly raise the odds of syntactically valid JSON but do not provide a hard guarantee—models can still emit invalid bytes if, e.g., they hit the token limit.
2. Schema Design: Be Explicit, Be Comprehensive
Your schema is your contract. Make it bulletproof:
schema = {
"type": "object",
"properties": {
"customer_name": {
"type": "string",
"description": "Full name of the customer - extract exactly as written"
},
"order_date": {
"type": "string",
"format": "date",
"description": "Date in YYYY-MM-DD format - convert any date format to this"
},
"items": {
"type": "array",
"description": "All line items in the order",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"quantity": {"type": "integer", "minimum": 1},
"price": {"type": "number", "minimum": 0}
},
"required": ["name", "quantity", "price"]
}
}
},
"required": ["customer_name", "order_date", "items"],
"additionalProperties": false # Prevent random fields
}
3. Prompt Engineering: Show, Don’t Just Tell
Few-shot examples with edge cases are your friend:
prompt = """
Extract order information into this exact JSON structure.
Important: Always use these exact field names, never create variations.
Examples showing exact format required:
Input: "John bought 2 apples for $3 each on Monday"
Output: {"customer_name": "John", "order_date": "2024-11-18", "items": [{"name": "apples", "quantity": 2, "price": 3.0}]}
Input: "Yesterday Sarah purchased one dozen eggs ($4.99) and milk"
Output: {"customer_name": "Sarah", "order_date": "2024-11-17", "items": [{"name": "eggs", "quantity": 12, "price": 4.99}, {"name": "milk", "quantity": 1, "price": null}]}
Critical rules:
- Convert relative dates to YYYY-MM-DD
- Use null for missing values, never omit fields
- Convert "dozen" to 12, "pair" to 2, etc.
"""
4. Validation and Retry Logic
Accept that you might need multiple attempts:
def extract_json_with_fallbacks(input_text, schema, max_retries=3):
for attempt in range(max_retries):
try:
response = llm_call_with_schema(input_text, schema)
parsed = json.loads(response)
validate(parsed, schema)
# Additional business logic validation
if not is_business_logic_valid(parsed):
raise ValueError("Business logic validation failed")
return parsed
except (json.JSONDecodeError, ValidationError, ValueError) as e:
if attempt < max_retries - 1:
# Provide specific feedback for retry
input_text = f"{input_text}\n\nPrevious attempt failed: {str(e)}\nPlease correct and try again."
else:
# Final fallback: use deterministic regex/rule-based extraction
return fallback_parser(input_text)
5. Build for Variance, Not Against It
Instead of fighting non-determinism, design your system to handle variations gracefully:
# Normalize outputs post-extraction
def normalize_extraction(raw_output):
# Handle field name variations
field_mappings = {
"total_amount": "invoice_total",
"amount": "invoice_total",
"sum": "invoice_total",
"customer": "customer_name",
"client": "customer_name"
}
normalized = {}
for key, value in raw_output.items():
normalized_key = field_mappings.get(key, key)
normalized[normalized_key] = value
return normalized
6. Monitor and Learn
Track variations to understand patterns:
def log_extraction_variance(doc_id, extraction_attempts):
# Compare multiple extraction attempts
variance_metrics = {
"field_consistency": calculate_field_consistency(extraction_attempts),
"value_variations": find_value_differences(extraction_attempts),
"structural_changes": detect_structural_variations(extraction_attempts)
}
# Use this data to improve prompts and schemas
if variance_metrics["field_consistency"] < 0.95:
alert_engineering_team("High variance detected", doc_id, variance_metrics)
The Path Forward: Embracing Controlled Chaos
Here’s the truth: perfect determinism in LLM outputs is like a perfectly spherical cow in physics—a useful theoretical construct that doesn’t exist in reality. But that’s okay. We’ve built incredible systems on top of non-deterministic foundations before. The internet itself runs on protocols that embrace packet loss and out-of-order delivery.
For those of us building products like Unstract that need to extract structured data from the chaos of real-world documents, the key is to:
Accept non-determinism as a fact of life, not a bug to be fixed
Design systems that are robust to variations
Use determinism where it matters (like in your business logic layer)
Leverage the flexibility that non-determinism provides
Remember, the same non-determinism that makes testing frustrating is what allows LLMs to handle the infinite variety of real-world documents. It’s not a bug; it’s a feature—you just need to design around it.
The next time your JSON extraction pipeline produces slightly different output for the same invoice, don’t panic. Take a deep breath, implement proper validation, add some retry logic, and remember: we’re asking these models to do something pretty magical. A little variation in the output is a small price to pay for the ability to turn any document into structured data.
After all, if we wanted perfect determinism, we’d still be writing regex patterns. And nobody wants to go back to that dark timeline.
A taste of the challenges we solve at Unstract
While you can build agents that do many trivial things, truly useful agents are those that can free up humans. However, building agents and workflows that automate work only humans can achieve is never easy. We’ve also seen that agents come to a screeching halt when complex communication and documents are involved and so human involvement continues. With Unstract, however, we’re able to map complex documents and communication (e.g: email conversations) into a standardized JSON schema—even though variants approach infinity—consistently and accurately. We leverage some of the techniques mentioned in this article and more to fight the non-determinism which is inherent in Large Language Models.
UNSTRACT
End Manual Document Processing
Leveraging AI to Convert Unstructured Documents into Usable Data
Shuveb Hussain is the Co-founder and CEO of Unstract. Previously, he served as VP of Engineering at Freshworks, a NASDAQ-listed global SaaS company. With over two decades of experience, Shuveb has co-founded multiple internet startups and worked with companies operating at massive scale—handling petabytes of data and billions of requests per hour.
Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.
We do not use cookies of this type.
Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.
We do not use cookies of this type.
Analytics cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.
We do not use cookies of this type.
Preference cookies enable a website to remember information that changes the way the website behaves or looks, like your preferred language or the region that you are in.
We do not use cookies of this type.
Unclassified cookies are cookies that we are in the process of classifying, together with the providers of individual cookies.