What sources LLMs cite

Large language models do not cite all sources equally. They consistently prefer certain types of content — established editorial publications, structured reference sites, and brand-owned content with clear schema markup. Understanding this preference hierarchy is the foundation of citation-focused AEO/GEO work.

The source preference hierarchy

LLMs draw from a broad pool of sources but show consistent preferences. Higher tiers earn citations more frequently and across more query types.

Tier	Source type	Examples	Why LLMs prefer them
1	Established editorial publications	The Verge, TechCrunch, Forbes, Mint, Economic Times	Long history of human-edited, fact-based content
2	Structured reference sites	Wikipedia, government domains, university .edu sites	Encyclopedic format, clear authority signals
3	Industry-specific publications	Stratechery for tech, Business of Fashion for retail	Topical depth, regular publishing cadence
4	Brand-owned content with schema	Product pages, company blogs, documentation	First-party authority IF technically AI-readable
5	Aggregators and review sites	G2, Capterra, Trustpilot, Amazon reviews	Aggregated signal, mixed quality

Sources consistently NOT preferred include forum posts (Reddit is partially indexed but not a preferred citation source), social media posts (rarely cited), pages with heavy JavaScript rendering that crawlers cannot parse, and paywalled content the LLM cannot fully access.

What makes a source citable

Five characteristics make any source more likely to be cited by AI platforms, regardless of tier. Structured content. Pages with clear headings, semantic HTML, and defined-term blocks are easier for LLMs to parse and quote. A well-structured page with H2 sections, tables, and bulleted lists outperforms a long, unstructured essay on the same topic. Author authority. Bylined articles from named experts, published on established domains, carry more citation weight than anonymous or generic content. This is why editorial coverage in known publications (Tier 1-3) outperforms most brand-owned content. Recency signals. Dated content, regular updates, and explicit “last updated” markers signal freshness. Perplexity in particular weights content under one year old more heavily. Pages with stale dates from 2-3 years ago are systematically cited less than recently-updated pages on the same topic. Topical focus. Pages that answer a specific question tend to outperform broad, multi-topic pages. A dedicated page titled “Best HR Software for Indian Startups” is more citable for that query than a general “HR Software Guide” that covers 20 sub-topics. AI-readable structure. Schema markup, llms.txt, and no anti-bot protections are table stakes. If an AI crawler cannot access and parse your content cleanly, it cannot cite it — regardless of how good the content is.

Why brand-owned content underperforms

Most brands have substantial owned content — product pages, blog posts, documentation — but get cited far less often than they expect. Three reasons explain the gap. Self-promotion penalty. LLMs implicitly down-weight content that reads as marketing rather than reference. Pages that lead with “Why we are the best” get cited less than pages that lead with “Here is how this category works.” The writing voice matters: encyclopedic content earns citations, promotional content does not. Technical AI readability gaps. Many brand sites have JavaScript-heavy rendering that crawlers cannot fully parse. Even good content is invisible if it cannot be cleanly extracted. Single-page applications, client-side-rendered React apps, and sites behind aggressive bot protection are common offenders. No schema discipline. Branded sites often lack the structured data — Product, Article, FAQPage, DefinedTerm schemas — that helps LLMs understand and cite content. A product page without Product schema is harder for an LLM to parse into a citable snippet than one with clean structured data. Brand-owned content can compete with editorial sources, but it requires deliberate work: write like a reference not a brochure, ensure clean HTML and schema markup, and use llms.txt to surface key pages explicitly.

Per-platform variations

Source preferences vary by platform, which matters for optimization prioritization. Perplexity strongly favors editorial publications and structured reference sites. Brand sites are cited but less frequently than on other platforms. Earning Perplexity citations typically requires third-party editorial coverage in Tier 1–3 sources. ChatGPT is broader in source selection — it cites a wider mix including aggregators, forums, and brand sites depending on prompt type. Brand-owned content has a better chance of being cited on ChatGPT than on Perplexity. Gemini and Google’s AI surfaces (AI Overviews and AI Mode) favor sources Google trusts, with strong preference for high-domain-rating sites. The overlap with Google Search authority signals is higher here than on other platforms. Claude rarely surfaces source URLs in standard responses but draws from the same source hierarchy in its training data. Claude’s mentions (without citations) still reflect the underlying source authority — brands covered by Tier 1–3 publications get mentioned more. Grok layers its X (Twitter) integration on top of the standard source hierarchy — active social conversation about a brand can surface alongside editorial coverage in ways no other platform replicates.

How to identify which sources cite your category

To know which sources matter for your category, follow this workflow:

Run a baseline scan with mention and citation tracking across AI platforms
Identify the publications and sites that appear most frequently as cited sources for your category prompts
Pursue editorial coverage on those specific publications — they are the ones AI platforms trust for your topic
Measure impact via mention rate change over the next 4–12 weeks as training data refreshes

This is the workflow described in the Win editorial coverage playbook.

Citation rate — measuring how often your domain is cited
Citations vs mentions — the vocabulary distinction
How Perplexity ranks sources — the most citation-heavy platform
Win editorial coverage that LLMs cite — the tactical playbook
Fix your llms.txt and robots.txt — the technical prerequisites

Frequently asked questions

Does Wikipedia really matter that much for AI citations?

Yes, disproportionately. Wikipedia is in the training data of every major LLM and is among the most-cited sources across categories. A Wikipedia page about your brand or category is high-leverage — but Wikipedia has strict notability standards and cannot be created promotionally. The path is to earn enough independent editorial coverage that the brand becomes Wikipedia-worthy.

Will paying for press coverage help my AI citations?

Sponsored content rarely earns AI citations. LLMs are trained on editorial content that has gone through human editorial review. Press releases and sponsored posts on the same publication are typically not cited at the same rate as earned editorial. The work is real PR — not paid placement.

How important is content freshness for citation?

Important and growing. LLM training cycles favor recent content, especially for prompts about evolving topics like pricing, features, and product comparisons. Stale pages with last-updated dates from 2–3 years ago are systematically cited less than recently-updated pages on the same topic. Regular content refreshes pay measurable dividends.

If my brand is in the training data, will it be cited?

Being in training data is necessary but not sufficient. The model also has to choose to cite you over other sources for a given prompt. This depends on relevance, authority, recency, and source type — the same factors that determine whether the model mentions you in the first place. Inclusion in training data sets the floor, not the ceiling.

Should I publish on third-party platforms like Substack or Medium for citation visibility?

Mixed value. Established personal Substacks with editorial credibility can be cited frequently. New or low-traffic Substacks are typically not cited at meaningful rates. Medium has lost citation share over the past two years as LLMs have de-prioritized it. Owning your own domain with strong technical AI readability tends to be more durable than third-party platforms.

​The source preference hierarchy

​What makes a source citable

​Why brand-owned content underperforms

​Per-platform variations

​How to identify which sources cite your category

​Related concepts

​Frequently asked questions