The Determinism Gap: Why Your LLM Accuracy Measurements May Be Unreliable

Abstract

We report a previously unquantified determinism gap between two major large language model families deployed for regulatory compliance assessment. In controlled experiments using identical inputs, identical prompts, and identical configurations, an instruction-following reasoning model (OpenAI GPT-5.2/Codex) produced different results on 16.7% of assessments between runs, while a constitution-informed model (Anthropic Claude Opus 4.6) produced different results on only 3.4% — a 4.9x difference in inter-run stability.

This finding has immediate practical consequences. With a 16.7% flip rate, a reported accuracy of 76% on a compliance assessment task represents a single sample from a distribution spanning approximately 73-79%. Accuracy comparisons between configurations at deltas smaller than 3 percentage points are statistically indistinguishable from noise. For practitioners optimising prompt engineering through iterative A/B testing, this means that many perceived improvements — and regressions — may be measurement artefacts rather than real effects.

Most significantly, this non-determinism is now a permanent architectural property. Current-generation OpenAI reasoning models no longer expose temperature control, meaning there is no available configuration that produces deterministic output. The question facing practitioners is no longer "what is the optimal output for this input?" but "what is the distribution of outputs for this input?"

1. Introduction

A foundational assumption in applied LLM engineering is that identical inputs with identical prompts produce identical — or near-identical — outputs. This assumption underpins iterative prompt optimisation, A/B testing between configurations, accuracy benchmarking, and production monitoring. If the assumption holds, a practitioner can make a prompt change, measure the accuracy delta, and attribute the change in performance to the prompt modification.

We discovered, through a controlled variance experiment within a larger research programme, that this assumption fails substantially on at least one major model family — and that the magnitude of failure is large enough to invalidate common prompt engineering practices.

Why This Matters Now

Prior generations of OpenAI models supported temperature=0, which produced deterministic output for a given input. This allowed practitioners to establish a stable baseline, make a change, and confidently attribute any accuracy difference to that change. The prompt engineering workflow that most organisations have developed — iterative refinement measured against a fixed test set — depends on this determinism.

Current-generation OpenAI reasoning models (including GPT-5.2/Codex with reasoning capabilities) no longer expose temperature as a configurable parameter. The model's reasoning process introduces inherent variability that cannot be eliminated through configuration. This is not a bug or a temporary limitation — it reflects a fundamental architectural change in how reasoning-capable models generate output.

The practical question is: how much variability? Without quantifying the magnitude of non-determinism, practitioners cannot know whether their accuracy measurements are signal or noise.

2. Experimental Design

Context

The variance experiment was conducted as part of a broader research programme investigating multi-model prompt architectures for regulatory compliance assessment. The assessment task involves evaluating recorded interactions against approximately 30 compliance criteria organised into 11 assessment chains, producing binary pass/fail judgments that are compared against a human expert benchmark.

Method

We re-ran two model configurations on identical inputs:

Configuration A: OpenAI GPT-5.2/Codex with production-tuned prescriptive prompts (refined across nine iterative versions). Reasoning set to high. Temperature control unavailable.

Configuration B: Anthropic Claude Opus 4.6 with principle-based prompts (designed through model-native self-referential authoring). Temperature set to 0.

Both configurations were run on the same frozen test corpus of 14 representative interactions, producing 96 assessable fields on the instruction-following model and 117 on the constitution-informed model. Results were compared field-by-field against each model's own prior run on the same inputs with the same prompts.

Controls

Identical transcripts (frozen — no re-transcription between runs)
Identical prompts (no modifications between runs)
Identical model versions and configuration parameters
Identical assessment criteria and human benchmark
The only variable was the inherent non-determinism of each model's inference process

3. Results

Headline

Model	Assessments	Fields Changed Between Runs	Flip Rate	Error Distribution
GPT-5.2/Codex (instruction-following)	96	16	16.7%	Shifted between runs
Claude Opus 4.6 (constitution-informed)	117	4	3.4%	Identical between runs
Ratio			4.9x

Detail: Instruction-Following Model (GPT-5.2/Codex)

Sixteen of 96 field-level assessments (16.7%) produced a different result on the second run despite identical inputs and prompts. The flips were distributed across multiple assessment chains and field types — not concentrated in a single problematic area. The error distribution (ratio of false positives to false negatives) also shifted between runs, meaning not only individual assessments but the model's overall bias profile was unstable.

On a per-field basis, accuracy measurements varied by up to 27 percentage points between runs on the same field (3 assessment flips on a field with 11 data points). Fields that appeared to regress catastrophically in one configuration were subsequently shown to be within normal variance when re-measured.

Detail: Constitution-Informed Model (Claude Opus 4.6)

Four of 117 field-level assessments (3.4%) produced a different result on the second run. The four flips were offsetting — two gained, two lost — producing identical aggregate accuracy across runs. Most notably, the error distribution was perfectly stable: 11 false positives and 7 false negatives in both Run 1 and Run 2. The model's overall bias profile was reproducible.

Variance Impact on Accuracy Measurement

With a 16.7% flip rate on n=96 assessments, we estimate the instruction-following model's accuracy measurements have a variance band of approximately +/-2-3 percentage points. This means:

A reported accuracy of 76.2% represents one sample from an estimated distribution of roughly 73-79%
An accuracy improvement of +1.5 percentage points (e.g., from 76.2% to 77.7%) cannot be attributed to a prompt change — it is within the variance floor
Even a +2.5 point improvement is marginal — it may be real, but a single measurement cannot confirm it

For the constitution-informed model, the 3.4% flip rate produces a much tighter variance band of approximately +/-1 percentage point, making accuracy comparisons at 1.5+ point deltas meaningful.

4. Implications

4.1 Iterative Prompt Engineering on Non-Deterministic Models May Be Largely Ineffective

The standard prompt engineering workflow is: make a change, run the test set, compare accuracy, keep or revert. If the measurement noise exceeds the improvement being measured, this workflow degenerates into random walk optimisation — the practitioner is effectively selecting between noise samples rather than genuine improvements.

With a 3 percentage point variance floor, only substantial prompt changes (those producing >3 point improvements) can be reliably detected in a single run. Incremental refinements — the kind that typically constitute the V3-to-V9 iteration cycle — are below the noise floor and may represent accumulated random selection bias rather than genuine gains.

Mitigation: Multi-run averaging (minimum 3-5 runs per configuration) can reduce the effective variance floor. Alternatively, substantially larger test corpora reduce the per-run variance by increasing the denominator.

4.2 Benchmark Comparisons Between Models May Be Misleading

When comparing Model A (76.2% accuracy) against Model B (76.5% accuracy), the natural conclusion is that Model B is slightly better. But if Model A has a +/-3 point variance band while Model B has a +/-1 point variance band, the comparison is between "somewhere in 73-79%" and "approximately 75.5-77.5%". The apparent 0.3 point advantage is meaningless.

More subtly: if Model A was measured once and happened to land at the high end of its distribution (78%), it would appear 1.5 points better than Model B — when in reality Model B may be consistently superior. Single-run benchmarks on non-deterministic models are unreliable for ranking.

Mitigation: Report accuracy with confidence intervals derived from multi-run measurement. At minimum, report the number of runs and whether variance was measured.

4.3 Production Monitoring Requires Variance-Aware Alerting

If a production system is expected to maintain 76% accuracy and the model has a 16.7% flip rate, daily accuracy will naturally fluctuate between approximately 73-79% without any change in the system. Alert thresholds set at the development-time accuracy level will produce false alarms. Conversely, genuine regressions smaller than the variance floor will go undetected.

Mitigation: Establish the variance floor through multi-run measurement during development. Set alert thresholds at the edge of the measured variance band, not at the point estimate. Monitor trends across multiple runs rather than reacting to individual measurements.

4.4 Determinism Is an Independent Selection Criterion

For applications where auditability and consistency matter — regulatory compliance, quality assurance, legal review, medical assessment — a model that produces the same result on the same input is materially more valuable than one that produces a marginally higher average accuracy but varies between runs.

Consider the audit scenario: a regulator asks why the same interaction was assessed differently on two occasions. "The model is non-deterministic" is a difficult answer to defend. "The model produces consistent results at temperature=0" is substantially more defensible.

The determinism gap we measured (4.9x) is large enough to be a primary selection criterion, not a secondary consideration. For high-stakes assessment tasks, we argue that consistent 75% accuracy is preferable to non-deterministic accuracy that averages 76% but ranges from 73% to 79%.

4.5 Non-Determinism Is Now Permanent on Some Model Families

This is the most consequential implication. Prior to the current generation of reasoning models, non-determinism was a choice — practitioners could set temperature=0 and get deterministic output. The variance was opt-in.

Current-generation OpenAI reasoning models do not expose temperature control. The variance is inherent and cannot be configured away. This is not a temporary limitation that will be resolved in a future update — it reflects the architectural reality of how chain-of-thought reasoning introduces variability into the generation process.

Practitioners must design their systems, their measurement methodologies, and their production monitoring around the assumption that identical inputs will not produce identical outputs from these models. Any workflow that assumes determinism — including most iterative prompt engineering practices — needs to be redesigned for a non-deterministic world.

5. Limitations

Sample size. Our test corpus comprised 14 interactions producing 96-117 field-level assessments per model. While sufficient to identify the magnitude of the variance effect, larger corpora would provide tighter confidence intervals on the flip rate estimates.

Two model families. We compared one OpenAI model against one Anthropic model. Other instruction-following models (Google Gemini, open-source alternatives) and other constitution-informed models may exhibit different variance profiles. The specific 4.9x ratio should not be generalised without further measurement.

Single task type. The assessments were binary compliance judgments. Generative tasks (summarisation, coding, creative writing) may exhibit different variance characteristics. The flip rate on structured assessment tasks may not predict the variance on open-ended generation.

Two runs per model. Our flip rate is measured from a single pair of runs per model. Additional runs would provide more robust estimates and potentially reveal whether the flip rate itself is stable or variable.

Temporal validity. These measurements reflect model versions available in February 2026. Future model updates may change the variance characteristics of either family. The qualitative finding (reasoning models are inherently non-deterministic) is likely durable; the specific magnitudes are not.

6. Conclusion

We have quantified a 4.9x determinism gap between two major LLM families on a structured compliance assessment task. The instruction-following reasoning model (GPT-5.2/Codex) exhibited a 16.7% flip rate on identical inputs, while the constitution-informed model (Claude Opus 4.6) exhibited a 3.4% flip rate.

This gap is large enough to undermine standard prompt engineering practices, invalidate single-run accuracy benchmarks, and introduce unpredictable variance into production systems. Most significantly, the non-determinism on the instruction-following reasoning model is now a permanent architectural property — there is no configuration option to eliminate it.

We urge practitioners deploying LLMs for high-stakes assessment tasks to:

Measure their variance floor before trusting any accuracy comparison
Use multi-run averaging for prompt engineering iteration
Set production alert thresholds above the measured variance band
Treat determinism as a selection criterion alongside accuracy, particularly for regulated or auditable contexts

The era of assuming that the same prompt on the same input produces the same output is over — at least for reasoning-capable models. Prompt engineering methodology, benchmarking practices, and production monitoring all need to adapt.

This research was conducted by Qanara Research as part of an ongoing R&D programme investigating AI-driven compliance assessment optimisation. For enquiries, contact [email protected].