Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getcited.in/llms.txt

Use this file to discover all available pages before exploring further.

Large language models are non-deterministic — running the same prompt twice produces different responses. For brands tracking AI visibility, this means single-run measurements are unreliable. Robust measurement requires multiple runs and statistical aggregation.

Why LLM responses vary

LLMs generate responses by sampling from a probability distribution at each token. Without the temperature set to zero (and even then, not always), the same prompt can produce subtly or substantially different outputs. Three sources of variation matter for visibility measurement. Sampling temperature. Most production LLMs run at temperature greater than zero, which introduces randomness into token selection. Each run produces a slightly different response — different word choices, different brand order, sometimes different brands mentioned altogether. This is the primary source of run-to-run variance. Live retrieval variance. Search-enabled LLMs — Perplexity, ChatGPT in search mode — pull live web results that change between runs. A new article published between two runs can shift which brands appear. This makes live-retrieval platforms inherently more volatile than pure training-data-based responses. Model updates. AI providers update models, sometimes silently, which shifts response patterns. A model update on ChatGPT can change which brands get mentioned for the same query — not because of anything the brand did, but because the model’s weights or retrieval logic changed.

What this means for brand visibility measurement

A brand mentioned in 4 out of 5 ChatGPT runs has a “real” mention rate around 80%. A brand mentioned in 1 out of 5 runs has a “real” mention rate around 20%. Run them once each and you might conclude both are at 100% or 0%. Single-run snapshots produce volatile, misleading measurements. The relationship between sample size and confidence is predictable.
Number of runsConfidence levelWhen to use
1Very low — single snapshot, high varianceQuick spot checks, never for reporting
3Low — directional onlyInitial baseline, internal exploration
5-7Moderate — usable for trend detectionWeekly tracking, gap identification
10+High — stable for benchmarkingMonthly reporting, competitive analysis
Any AI visibility metric reported from a single run should be treated with skepticism. A CEO who asks ChatGPT one question and concludes “we are not in AI search” has collected one data point — not evidence. Real measurement requires aggregation.

How Cited handles non-determinism

Cited accounts for non-determinism through multiple runs and statistical aggregation:
  • Each prompt is run multiple times per platform per pipeline cycle
  • Mention rate, citation rate, and average position data is aggregated across runs before reporting
  • Single-run noise is smoothed out by averaging across the run set
  • Confidence indicators in the dashboard reflect sample size and variance
The Cited Index uses single runs per query but aggregates across 253 brands per category — the per-category percentile distribution is stable even if individual brand-run measurements are noisy, because the sample size is large enough to absorb run-level variance. For individual brand dashboards (where per-brand prompt counts are smaller), per-prompt aggregation across multiple runs is more important. Cited runs prompts multiple times to produce reliable per-brand metrics.

Why prompt variations also matter

Beyond sampling variance, the exact wording of a prompt affects results. “Best CRM for small business” and “top CRM software for small companies” return overlapping but not identical brand sets. Prompt variation is itself a form of non-determinism in measurement. Cited handles this by tracking multiple variations of the same intent type within a brand’s prompt library. The dashboard reports both the union (any variation mentions you) and the intersection (all variations mention you) for stronger confidence in the signal.

Practical implications for brands

Do not over-react to single-day movement. A brand that drops from 32% to 20% mention rate in one day is most likely seeing run variance, not a real shift. Wait for 5-7 days of data before treating a trend as real. Average across multiple runs for any reported metric. Internal stakeholder reports based on single-run snapshots lead to wrong decisions. If a team member asks “why did our mention rate drop,” the first question should be “over what time window and how many runs?” Trust statistical aggregates over anecdotes. “I asked ChatGPT yesterday and you were not mentioned” is one sample. The dashboard’s aggregated metrics, computed across multiple runs and multiple prompt variations, are the reliable signal.

Frequently asked questions

Some platforms allow this, but it does not solve the problem. Even at temperature zero, search-enabled platforms produce variance because the underlying retrieval changes between runs. Beyond that, deterministic responses do not represent real customer experience — customers get the temperature-greater-than-zero responses, not the deterministic version. Measuring deterministic responses misrepresents what users actually see.
Depends on the brand and query. For category-leading brands with consistently high mention rates, 5-7 runs produce stable measurements. For brands near the visibility threshold (mention rate 10-30%), 10-15 runs are typically needed to distinguish real movement from noise. For specific high-stakes prompts, more runs reduce variance further.
No. It means single-snapshot measurement is unreliable. With sufficient runs and proper aggregation, AI visibility is measurable to a useful degree of confidence. The same statistical rigor that makes survey research and A/B testing trustworthy applies here — sample size, aggregation, and variance estimation produce reliable metrics.
Look at trends over 2-3 week windows, not single-week movements. Real shifts in AI visibility happen as training data refreshes, editorial coverage propagates, and content gets indexed — these effects unfold over weeks, not days. Single-week deltas are dominated by run variance and should not drive decisions.