Exploring Temperature as a Diagnostic Mechanism for LLM Prompts

In materials science, engineers don't just test whether a structure holds under normal conditions. They heat it. They cool it. They vibrate it. They apply controlled stress to find the weaknesses that don't show up under nominal load. The structure either holds or it doesn't. The point of failure tells you something important about the design.

We took the same idea and applied it to LLM prompts.

Every LLM has a temperature parameter. It's usually described as a creativity knob. Low temperature, deterministic output. High temperature, more variety. Most practitioners set it somewhere between 0 and 1 and don't think about it much beyond that.

But temperature does something more interesting than control creativity. It controls how willing the model is to deviate from its highest-probability response. At temperature 0, the model commits to whatever it considers most likely. At higher temperatures, it samples more broadly across its probability distribution. If the model is confident, temperature barely matters. If the model is uncertain, temperature amplifies that uncertainty into visible output variation.

That's the insight. Temperature doesn't create instability. It reveals it.

We built a test harness and ran 13,000+ API calls across three models (Claude Sonnet 4, Gemini 3 Pro, and GPT-5.2) to test a simple hypothesis: if you run the same prompt at increasing temperatures and the output flips, the prompt has a problem.

The experimental setup was straightforward. Sixteen synthetic prompts across four flaw types: contradictory instructions, ambiguous specifications, conflicting context, and instruction-context mismatch. Four severity levels for each, from clean (no flaw) to severe. Nine temperature levels from 0 to 2.0. Thirty runs per temperature per prompt per model. Three output formats tested: binary (YES/NO), binary with chain-of-thought reasoning, and scalar confidence (1-10).

We measured flip rate: the percentage of runs at a given temperature that produced a different answer from the most common response. A clean prompt should produce 0% flip rate everywhere. A flawed prompt should start flipping as temperature rises.

The clean prompts held. Zero false positives. Every control prompt, all three models, all formats, all temperatures. 0% flip rate. That's the first thing you need from a diagnostic: it doesn't flag things that aren't broken.

The flawed prompts told a more interesting story. Each model detected different things.

Claude detected ambiguous specifications. At temperature 0, AS-L3 (a maximally ambiguous prompt) returned a stable NO. By temperature 0.2, 29% of runs flipped. By temperature 1.2, 47% flipped. The prompt looked fine at rest. Under thermal stress, it fell apart.

Gemini detected instruction contradictions. On CI-L1 (a subtly contradictory prompt), Gemini showed 10% instability even at temperature 0, rising to 50% by temperature 0.8. Claude saw 0% on the same prompt at every temperature. The same contradiction, processed by a different architecture, produced completely different diagnostic outcomes.

GPT-5.2 added a third dimension. It was the most broadly sensitive model, detecting instability on four separate prompts. But the striking finding was that GPT-5.2 showed instability before we even raised the temperature. On CI-L3 (severe contradiction), 50% of runs at temperature 0 returned the minority answer. Not temperature 0.4. Not temperature 0.2. Temperature 0. The model was flipping a coin under greedy sampling conditions.

This changes the interpretation. For Claude and Gemini, temperature amplifies latent uncertainty into visible output variation. For GPT-5.2, the uncertainty is already visible at baseline. The flaw in the prompt is severe enough that the model's top-token probabilities are genuinely split, producing inconsistent outputs even without stochastic sampling.

Neither model detected what the other models detected. That's the finding.

The table below shows peak flip rate (the highest observed across all nine temperature levels) for every prompt that produced substantial signal on at least one model. Clean controls and zero-signal prompts are omitted. All controls were 0% everywhere on all three models.

Prompt	Flaw Type	Claude Sonnet	Gemini 3 Pro	GPT-5.2
AS-L3 (severe ambiguity)	Ambiguous Specification	47%	0%	27%
CI-L1 (subtle contradiction)	Contradictory Instructions	0%	50%	23%
CI-L3 (severe contradiction)	Contradictory Instructions	23%	17%	50%
CC-L1 (subtle conflict)	Conflicting Context	3%	17%	37%

Each model has a distinct sensitivity fingerprint. Claude's strongest signal is on ambiguous specifications. Gemini's is on instruction contradictions. GPT-5.2 detects the broadest range of flaws and shows the most extreme instability on severe contradictions. The only prompt all three models flag strongly is CI-L3, and even there the profiles differ: Claude starts flipping at temperature 0.4, Gemini at temperature 0.2, and GPT-5.2 is already at 50% before temperature is applied.

We also tested eight models to see whether temperature sensitivity is a general phenomenon. It is not. Five of the eight (DeepSeek V3, Llama 4 Maverick, Gemini 3 Flash, GPT-4o, Mistral Large) produced 0% flip rate on every prompt at every temperature. Completely deterministic. No diagnostic signal at all.

Only Claude, Gemini 3 Pro, and GPT-5.2 showed the kind of temperature-sensitive uncertainty that makes this diagnostic work. This has a practical implication: model selection matters. Not every model is usable as a diagnostic probe. The models that produce signal appear to be larger, more capable models that maintain richer probability distributions over their outputs rather than collapsing to a single high-confidence answer.

This wasn't what we expected. The hypothesis assumed models would show similar sensitivity patterns, maybe at different magnitudes. Instead, they have completely different blind spots.

Claude's safety training resolves contradictory instructions deterministically. When a prompt says "be generous" and also "reject anyone who doesn't meet criteria," Claude picks the conservative interpretation every time. The contradiction exists but the model has a trained default for handling it. No decisional conflict, no flip.

Gemini doesn't have the same conservative default. The contradiction creates genuine uncertainty, and temperature amplifies it into visible instability. But Gemini handles ambiguous specifications differently. When a prompt is vague enough to be meaningless, Gemini grabs whatever tokens look like an answer and commits with full confidence. Claude recognises the ambiguity and wavers.

GPT-5.2 lacks the strong priors that mask flaws in the other two models. It doesn't default to conservative rejection like Claude. It doesn't default to literal interpretation like Gemini. The result is broader sensitivity but also more baseline instability. Whether this makes it a better or worse diagnostic probe depends on the use case: it catches more flaws, but it also requires careful calibration to distinguish signal from noise.

The practical implication is clear. A single-model diagnostic has blind spots. A three-model panel catches what no single model catches alone. The pattern of which model flips on which prompt is itself diagnostic of the flaw type.

We also tested what happens when you ask the model to explain its reasoning before answering.

The diagnostic signal disappeared completely. 0% flip rate across all temperatures, all prompts. Chain-of-thought reasoning suppresses the signal entirely.

This is genuinely interesting. The reasoning process acts as an internal stabilisation mechanism. The model thinks through the ambiguity, resolves it in the reasoning step, and then commits to a deterministic answer. The ambiguity is still there in the prompt. The model just works through it before responding. The weakness is still present in the structure. The model has simply braced against it before you can observe the failure.

The materials science analogy holds here too. If you give a structural beam time to redistribute load before you measure deflection, you won't see the weakness. The test needs to be fast and forced. Binary. YES or NO. No time to think.

We tested a third format: scalar confidence ratings, 1 to 10.

This produced something the binary format can't. Three separate signals instead of one. First, the absolute confidence level at temperature 0. Claude rated the ambiguous prompt 5 out of 10. Dead-centre uncertainty. It rated the clean version 10 out of 10. Gemini rated the same ambiguous prompt 10 out of 10 (it doesn't recognise the ambiguity) but gave the contradictory prompt 1 out of 10 (it recognises the conflict). Those gaps alone are diagnostic.

Second, the temperature at which variance first appears. Clean prompts: never. Ambiguous prompts: around temperature 1.0. That's a stress threshold, the point where the structure starts to give.

Third, the magnitude of the variance at high temperature. Clean prompts: zero. Ambiguous prompts: standard deviation of 1.69, range of 1 to 5 at temperature 2.0.

The scalar format is less sensitive than binary for raw flip detection (20% versus 47% at the same temperature). But it tells you more about what's wrong.

One finding we didn't predict at all. The diagnostic signal doesn't track the logical severity of the flaw. It tracks how close the decision sits to a tipping point.

Our most severe instruction-contradiction prompt says, in effect, "you must say YES" and "you must say NO." Explicit, maximal contradiction. Claude and Gemini both show modest instability (17-23%). Our subtlest contradiction prompt has a borderline candidate (4.5 years experience versus 5 required, expired certification) combined with conflicting evaluation criteria. Gemini's flip rate: 50%. GPT-5.2: 23%.

But GPT-5.2 flips that pattern. It treats the severe contradiction as a coin flip (50% at temperature 0) while showing a lower 23% on the subtle one. The severe prompt genuinely splits GPT-5.2's decision because it lacks the conservative default that Claude uses to resolve naked contradictions.

This reframes the whole diagnostic. We're not measuring how broken the prompt is. We're measuring how much the brokenness matters to the outcome, given the model's priors. A prompt can be maximally contradictory and still produce deterministic output if the model has a default for handling that type of contradiction. The diagnostic works best where the flaw creates genuine outcome uncertainty, where the model is truly torn.

We also found something revealing about model defaults. The maximally ambiguous prompt (essentially "check this thing, is it yes?") produced different answers from all three models. Claude said NO with 5/10 confidence. Gemini said YES with 10/10 confidence. GPT-5.2 said YES but flipped 17% of the time at temperature 0.

Same prompt. Three answers. Three confidence levels. Three different relationships with the ambiguity.

Claude's position: there isn't enough information to confirm, so I'll default to rejection. I know I'm guessing. Gemini's position: the prompt contains the word "yes," so yes. I'm certain. GPT-5.2's position: probably yes, but I'm not sure enough to commit consistently.

This isn't really about temperature diagnostics. It's about the hidden priors baked into model architectures. But the temperature-gradient methodology is what surfaced it. At temperature 0, all models look like they're giving confident answers. It's only when you apply thermal stress (or in GPT-5.2's case, just repeat the question) that you see the uncertainty underneath.

This is early work. Three models. Synthetic prompts. One prompt per flaw scenario. Thirty runs per condition. The signal is clear and consistent, but the prompt set is small enough that individual findings could be specific to the prompt wording rather than the flaw type.

The obvious next steps: more prompts per scenario, more models, redesigned severity levels that target the decision boundary rather than logical contradiction. Real-world prompts from production deployments, not just synthetic ones.

And one question that the data has already collected the material to answer, but we haven't analysed yet: when chain-of-thought reasoning suppresses the binary flip, do the reasoning traces themselves diverge? The model always says NO. But does it think differently about why? If the reasoning is stable at low temperature and divergent at high temperature, even while the answer stays the same, that's potentially a more sensitive diagnostic than the binary flip rate. The instability might be there. It might just be hidden one layer deeper.

The raw data for that analysis is sitting in our result files. That's where this goes next.

Temperature is not a creativity dial. It's a stress test. Apply it systematically and it tells you where your prompts are weakest, which models are masking flaws, and where the decision boundary actually sits. Zero false positives on clean prompts, clear signal on flawed ones, complementary detection across three model architectures, and a dataset of 13,000+ observations large enough to trust.

The methodology needs more validation before it's a production tool. But the principle is sound: controlled perturbation reveals structural weakness. Materials scientists have known this for decades. It works on prompts too.

Exploring Temperature as a Diagnostic Mechanism for LLM Prompts

More Research & Insights