Large language models are non-deterministic — running the same prompt twice produces different responses. For brands tracking AI visibility, this means single-run measurements are unreliable. Robust measurement requires multiple runs and statistical aggregation.Documentation Index
Fetch the complete documentation index at: https://docs.getcited.in/llms.txt
Use this file to discover all available pages before exploring further.
Why LLM responses vary
LLMs generate responses by sampling from a probability distribution at each token. Without the temperature set to zero (and even then, not always), the same prompt can produce subtly or substantially different outputs. Three sources of variation matter for visibility measurement. Sampling temperature. Most production LLMs run at temperature greater than zero, which introduces randomness into token selection. Each run produces a slightly different response — different word choices, different brand order, sometimes different brands mentioned altogether. This is the primary source of run-to-run variance. Live retrieval variance. Search-enabled LLMs — Perplexity, ChatGPT in search mode — pull live web results that change between runs. A new article published between two runs can shift which brands appear. This makes live-retrieval platforms inherently more volatile than pure training-data-based responses. Model updates. AI providers update models, sometimes silently, which shifts response patterns. A model update on ChatGPT can change which brands get mentioned for the same query — not because of anything the brand did, but because the model’s weights or retrieval logic changed.What this means for brand visibility measurement
A brand mentioned in 4 out of 5 ChatGPT runs has a “real” mention rate around 80%. A brand mentioned in 1 out of 5 runs has a “real” mention rate around 20%. Run them once each and you might conclude both are at 100% or 0%. Single-run snapshots produce volatile, misleading measurements. The relationship between sample size and confidence is predictable.| Number of runs | Confidence level | When to use |
|---|---|---|
| 1 | Very low — single snapshot, high variance | Quick spot checks, never for reporting |
| 3 | Low — directional only | Initial baseline, internal exploration |
| 5-7 | Moderate — usable for trend detection | Weekly tracking, gap identification |
| 10+ | High — stable for benchmarking | Monthly reporting, competitive analysis |
How Cited handles non-determinism
Cited accounts for non-determinism through multiple runs and statistical aggregation:- Each prompt is run multiple times per platform per pipeline cycle
- Mention rate, citation rate, and average position data is aggregated across runs before reporting
- Single-run noise is smoothed out by averaging across the run set
- Confidence indicators in the dashboard reflect sample size and variance
Why prompt variations also matter
Beyond sampling variance, the exact wording of a prompt affects results. “Best CRM for small business” and “top CRM software for small companies” return overlapping but not identical brand sets. Prompt variation is itself a form of non-determinism in measurement. Cited handles this by tracking multiple variations of the same intent type within a brand’s prompt library. The dashboard reports both the union (any variation mentions you) and the intersection (all variations mention you) for stronger confidence in the signal.Practical implications for brands
Do not over-react to single-day movement. A brand that drops from 32% to 20% mention rate in one day is most likely seeing run variance, not a real shift. Wait for 5-7 days of data before treating a trend as real. Average across multiple runs for any reported metric. Internal stakeholder reports based on single-run snapshots lead to wrong decisions. If a team member asks “why did our mention rate drop,” the first question should be “over what time window and how many runs?” Trust statistical aggregates over anecdotes. “I asked ChatGPT yesterday and you were not mentioned” is one sample. The dashboard’s aggregated metrics, computed across multiple runs and multiple prompt variations, are the reliable signal.Related concepts
- Mention rate — the primary metric affected by non-determinism
- Data freshness and statistical confidence — how confidence is quantified
- Refresh cadence and pipeline schedule — how often measurements are repeated
- Benchmarks methodology — how the Cited Index handles variance at scale
Frequently asked questions
Can I just set the temperature to zero to get deterministic results?
Can I just set the temperature to zero to get deterministic results?
Some platforms allow this, but it does not solve the problem. Even at temperature zero, search-enabled platforms produce variance because the underlying retrieval changes between runs. Beyond that, deterministic responses do not represent real customer experience — customers get the temperature-greater-than-zero responses, not the deterministic version. Measuring deterministic responses misrepresents what users actually see.
How many runs is enough for reliable measurement?
How many runs is enough for reliable measurement?
Depends on the brand and query. For category-leading brands with consistently high mention rates, 5-7 runs produce stable measurements. For brands near the visibility threshold (mention rate 10-30%), 10-15 runs are typically needed to distinguish real movement from noise. For specific high-stakes prompts, more runs reduce variance further.
Does non-determinism mean GEO is fundamentally unmeasurable?
Does non-determinism mean GEO is fundamentally unmeasurable?
No. It means single-snapshot measurement is unreliable. With sufficient runs and proper aggregation, AI visibility is measurable to a useful degree of confidence. The same statistical rigor that makes survey research and A/B testing trustworthy applies here — sample size, aggregation, and variance estimation produce reliable metrics.
How should I think about week-over-week changes in my dashboard?
How should I think about week-over-week changes in my dashboard?
Look at trends over 2-3 week windows, not single-week movements. Real shifts in AI visibility happen as training data refreshes, editorial coverage propagates, and content gets indexed — these effects unfold over weeks, not days. Single-week deltas are dominated by run variance and should not drive decisions.