Large language models do not cite all sources equally. They consistently prefer certain types of content — established editorial publications, structured reference sites, and brand-owned content with clear schema markup. Understanding this preference hierarchy is the foundation of citation-focused GEO work.Documentation Index
Fetch the complete documentation index at: https://docs.getcited.in/llms.txt
Use this file to discover all available pages before exploring further.
The source preference hierarchy
LLMs draw from a broad pool of sources but show consistent preferences. Higher tiers earn citations more frequently and across more query types.| Tier | Source type | Examples | Why LLMs prefer them |
|---|---|---|---|
| 1 | Established editorial publications | The Verge, TechCrunch, Forbes, Mint, Economic Times | Long history of human-edited, fact-based content |
| 2 | Structured reference sites | Wikipedia, government domains, university .edu sites | Encyclopedic format, clear authority signals |
| 3 | Industry-specific publications | Stratechery for tech, Business of Fashion for retail | Topical depth, regular publishing cadence |
| 4 | Brand-owned content with schema | Product pages, company blogs, documentation | First-party authority IF technically AI-readable |
| 5 | Aggregators and review sites | G2, Capterra, Trustpilot, Amazon reviews | Aggregated signal, mixed quality |
What makes a source citable
Five characteristics make any source more likely to be cited by AI platforms, regardless of tier. Structured content. Pages with clear headings, semantic HTML, and defined-term blocks are easier for LLMs to parse and quote. A well-structured page with H2 sections, tables, and bulleted lists outperforms a long, unstructured essay on the same topic. Author authority. Bylined articles from named experts, published on established domains, carry more citation weight than anonymous or generic content. This is why editorial coverage in known publications (Tier 1-3) outperforms most brand-owned content. Recency signals. Dated content, regular updates, and explicit “last updated” markers signal freshness. Perplexity in particular weights content under one year old more heavily. Pages with stale dates from 2-3 years ago are systematically cited less than recently-updated pages on the same topic. Topical focus. Pages that answer a specific question tend to outperform broad, multi-topic pages. A dedicated page titled “Best HR Software for Indian Startups” is more citable for that query than a general “HR Software Guide” that covers 20 sub-topics. AI-readable structure. Schema markup,llms.txt, and no anti-bot protections are table stakes. If an AI crawler cannot access and parse your content cleanly, it cannot cite it — regardless of how good the content is.
Why brand-owned content underperforms
Most brands have substantial owned content — product pages, blog posts, documentation — but get cited far less often than they expect. Three reasons explain the gap. Self-promotion penalty. LLMs implicitly down-weight content that reads as marketing rather than reference. Pages that lead with “Why we are the best” get cited less than pages that lead with “Here is how this category works.” The writing voice matters: encyclopedic content earns citations, promotional content does not. Technical AI readability gaps. Many brand sites have JavaScript-heavy rendering that crawlers cannot fully parse. Even good content is invisible if it cannot be cleanly extracted. Single-page applications, client-side-rendered React apps, and sites behind aggressive bot protection are common offenders. No schema discipline. Branded sites often lack the structured data — Product, Article, FAQPage, DefinedTerm schemas — that helps LLMs understand and cite content. A product page without Product schema is harder for an LLM to parse into a citable snippet than one with clean structured data. Brand-owned content can compete with editorial sources, but it requires deliberate work: write like a reference not a brochure, ensure clean HTML and schema markup, and use llms.txt to surface key pages explicitly.Per-platform variations
Source preferences vary by platform, which matters for optimization prioritization. Perplexity strongly favors editorial publications and structured reference sites. Brand sites are cited but less frequently than on other platforms. Earning Perplexity citations typically requires third-party editorial coverage in Tier 1-3 sources. ChatGPT is broader in source selection — it cites a wider mix including aggregators, forums, and brand sites depending on query type. Brand-owned content has a better chance of being cited on ChatGPT than on Perplexity. Gemini favors sources Google trusts, with strong preference for high-domain-rating sites. The overlap with Google Search authority signals is higher here than on other platforms. Claude rarely surfaces source URLs in standard responses but draws from the same source hierarchy in its training data. Claude’s mentions (without citations) still reflect the underlying source authority — brands covered by Tier 1-3 publications get mentioned more.How to identify which sources cite your category
To know which sources matter for your category, follow this workflow:- Run a baseline scan with mention and citation tracking across AI platforms
- Identify the publications and sites that appear most frequently as cited sources for your category prompts
- Pursue editorial coverage on those specific publications — they are the ones AI platforms trust for your topic
- Measure impact via mention rate change over the next 4-12 weeks as training data refreshes
Related concepts
- Citation rate — measuring how often your domain is cited
- Citations vs mentions — the vocabulary distinction
- How Perplexity ranks sources — the most citation-heavy platform
- Win editorial coverage that LLMs cite — the tactical playbook
- Fix your llms.txt and robots.txt — the technical prerequisites
Frequently asked questions
Does Wikipedia really matter that much for AI citations?
Does Wikipedia really matter that much for AI citations?
Yes, disproportionately. Wikipedia is in the training data of every major LLM and is among the most-cited sources across categories. A Wikipedia page about your brand or category is high-leverage — but Wikipedia has strict notability standards and cannot be created promotionally. The path is to earn enough independent editorial coverage that the brand becomes Wikipedia-worthy.
Will paying for press coverage help my AI citations?
Will paying for press coverage help my AI citations?
Sponsored content rarely earns AI citations. LLMs are trained on editorial content that has gone through human editorial review. Press releases and sponsored posts on the same publication are typically not cited at the same rate as earned editorial. The work is real PR — not paid placement.
How important is content freshness for citation?
How important is content freshness for citation?
Important and growing. LLM training cycles favor recent content, especially for queries about evolving topics like pricing, features, and product comparisons. Stale pages with last-updated dates from 2-3 years ago are systematically cited less than recently-updated pages on the same topic. Regular content refreshes pay measurable dividends.
If my brand is in the training data, will it be cited?
If my brand is in the training data, will it be cited?
Being in training data is necessary but not sufficient. The model also has to choose to cite you over other sources for a given query. This depends on relevance, authority, recency, and source type — the same factors that determine whether the model mentions you in the first place. Inclusion in training data sets the floor, not the ceiling.
Should I publish on third-party platforms like Substack or Medium for citation visibility?
Should I publish on third-party platforms like Substack or Medium for citation visibility?
Mixed value. Established personal Substacks with editorial credibility can be cited frequently. New or low-traffic Substacks are typically not cited at meaningful rates. Medium has lost citation share over the past two years as LLMs have de-prioritized it. Owning your own domain with strong technical AI readability tends to be more durable than third-party platforms.