Non-determinism in LLM responses

Large language models are non-deterministic — running the same prompt twice produces different responses. For brands tracking AI visibility, this means single-run measurements are unreliable. Robust measurement requires multiple runs and statistical aggregation.

Why LLM responses vary

LLMs generate responses by sampling from a probability distribution at each token. Without the temperature set to zero (and even then, not always), the same prompt can produce subtly or substantially different outputs. Three sources of variation matter for visibility measurement. Sampling temperature. Most production LLMs run at temperature greater than zero, which introduces randomness into token selection. Each run produces a slightly different response — different word choices, different brand order, sometimes different brands mentioned altogether. This is the primary source of run-to-run variance. Live retrieval variance. Search-enabled LLMs — Perplexity, ChatGPT in search mode — pull live web results that change between runs. A new article published between two runs can shift which brands appear. This makes live-retrieval platforms inherently more volatile than pure training-data-based responses. Model updates. AI providers update models, sometimes silently, which shifts response patterns. A model update on ChatGPT can change which brands get mentioned for the same query — not because of anything the brand did, but because the model’s weights or retrieval logic changed.

What this means for brand visibility measurement

A brand mentioned in 4 out of 5 ChatGPT runs has a “real” mention rate around 80%. A brand mentioned in 1 out of 5 runs has a “real” mention rate around 20%. Run them once each and you might conclude both are at 100% or 0%. Single-run snapshots produce volatile, misleading measurements. The relationship between sample size and confidence is predictable.

Number of runs	Confidence level	When to use
1	Very low — single snapshot, high variance	Quick spot checks, never for reporting
3	Low — directional only	Initial baseline, internal exploration
5-7	Moderate — usable for trend detection	Weekly tracking, gap identification
10+	High — stable for benchmarking	Monthly reporting, competitive analysis

Any AI visibility metric reported from a single run should be treated with skepticism. A CEO who asks ChatGPT one question and concludes “we are not in AI search” has collected one data point — not evidence. Real measurement requires aggregation.

How Cited handles non-determinism

Cited accounts for non-determinism through multiple runs and statistical aggregation:

Each prompt is run multiple times per platform per pipeline cycle
Mention rate, citation rate, and average position data is aggregated across runs before reporting
Single-run noise is smoothed out by averaging across the run set
Confidence indicators in the dashboard reflect sample size and variance

For individual brand dashboards, per-prompt aggregation across multiple runs is what produces reliable per-brand metrics. Cited runs prompts multiple times before reporting so single-run noise gets smoothed out before it ever reaches your dashboard.

Why prompt variations also matter

Beyond sampling variance, the exact wording of a prompt affects results. “Best CRM for small business” and “top CRM software for small companies” return overlapping but not identical brand sets. Prompt variation is itself a form of non-determinism in measurement. Cited handles this by tracking multiple variations of the same intent type within a brand’s prompt library. The dashboard reports both the union (any variation mentions you) and the intersection (all variations mention you) for stronger confidence in the signal.

Practical implications for brands

Do not over-react to single-day movement. A brand that drops from 32% to 20% mention rate in one day is most likely seeing run variance, not a real shift. Wait for 5-7 days of data before treating a trend as real. Average across multiple runs for any reported metric. Internal stakeholder reports based on single-run snapshots lead to wrong decisions. If a team member asks “why did our mention rate drop,” the first question should be “over what time window and how many runs?” Trust statistical aggregates over anecdotes. “I asked ChatGPT yesterday and you were not mentioned” is one sample. The dashboard’s aggregated metrics, computed across multiple runs and multiple prompt variations, are the reliable signal.

Mention rate — the primary metric affected by non-determinism
Reading your data with confidence — when to trust a movement
Refresh cadence — how often measurements are repeated

Frequently asked questions

Can I just set the temperature to zero to get deterministic results?

Some platforms allow this, but it does not solve the problem. Even at temperature zero, search-enabled platforms produce variance because the underlying retrieval changes between runs. Beyond that, deterministic responses do not represent real customer experience — customers get the temperature-greater-than-zero responses, not the deterministic version. Measuring deterministic responses misrepresents what users actually see.

How many runs is enough for reliable measurement?

Depends on the brand and query. For category-leading brands with consistently high mention rates, 5-7 runs produce stable measurements. For brands near the visibility threshold (mention rate 10-30%), 10-15 runs are typically needed to distinguish real movement from noise. For specific high-stakes prompts, more runs reduce variance further.

Does non-determinism mean GEO is fundamentally unmeasurable?

No. It means single-snapshot measurement is unreliable. With sufficient runs and proper aggregation, AI visibility is measurable to a useful degree of confidence. The same statistical rigor that makes survey research and A/B testing trustworthy applies here — sample size, aggregation, and variance estimation produce reliable metrics.

How should I think about week-over-week changes in my dashboard?

Look at trends over 2-3 week windows, not single-week movements. Real shifts in AI visibility happen as training data refreshes, editorial coverage propagates, and content gets indexed — these effects unfold over weeks, not days. Single-week deltas are dominated by run variance and should not drive decisions.

​Why LLM responses vary

​What this means for brand visibility measurement

​How Cited handles non-determinism

​Why prompt variations also matter

​Practical implications for brands

​Related concepts

​Frequently asked questions