SERVICES
WORK
ABOUT
JOURNAL
BEGIN
← BACK TO JOURNAL
§ VOLUME I · ARTICLE II · FRAMEWORK · MMXXVI
How LLMs Choose What to Cite: A Framework.
A large language model does not cite sources the way a journalist or a search engine does. It composes under constraints — length, entity coherence, authority, freshness — and selects passages that satisfy all of them simultaneously. Citation is not a function of who ranks highest. It is a function of who the model can quote without contradicting itself.
BY JONATHAN LANDMAN · 13 MIN · 17.IV.MMXXVI
W.
W.
SERVICES
WORK
W.
§ EXECUTIVE ANSWER
A citation is the output of a four-layer pipeline — training corpus, retrieval, ranking, synthesis — operating under five simultaneous constraints in the final composer. Understanding the layers and the constraints is the difference between a brand that is mentioned once and a brand that becomes a reference.
I
Why this question matters
Every GEO discussion eventually reaches the same question: what makes a language model choose one source over another? The intuitive answers — more content, more backlinks, higher domain authority — are the SEO answers, and they are partially but insufficiently right. The real answer requires understanding the generative pipeline as an engineering system, not as an opaque box.
There is a reason this matters commercially. In the ten-blue-links era, citation was a derivative of ranking — win the rank and the citation followed. In the generative era, citation is the product itself. The link is gone. The click is gone. The only thing that reaches the buyer is the passage the model chose to quote, and the entity the model chose to name. Every brand now competes not for a position on a page, but for inclusion inside a paragraph composed in four hundred milliseconds by a system the brand cannot see.
The good news: the system is not random. It is constrained, and the constraints are legible. Once the constraints are understood, they can be engineered against.
II
The four-layer architecture
Every citation-producing generative engine — ChatGPT Search, Perplexity, Gemini with Grounding, Claude with Web, Copilot — runs a variant of the same four-layer pipeline. The prevailing discourse focuses almost entirely on the third layer while the first, second, and fourth do most of the work.
Layer 1 — The training corpus (the prior)
What the model believes about the world before it reads a single word of retrieved text. If a brand is mentioned in the training corpus — especially in high-authority sources — the model starts with a favorable prior. It is more likely to recognize the entity, more likely to name it in a shortlist even without retrieval, and more likely to select it for quotation when candidate passages are near-equivalent.
A brand absent from the training corpus must compensate through the retrieval layer on every single query. That is exhausting and fragile. A brand present in the corpus has embedded gravity.
Layer 2 — Retrieval (the candidate set)
At inference time, the engine queries a search index — sometimes Google's, sometimes Bing's, sometimes a proprietary crawl — and pulls a candidate set of passages. The retrieval is fast, keyword-and-vector hybrid, and constrained to a few dozen candidates maximum.
Three things determine whether your content enters the candidate set: the URL must be crawlable and allowed for AI agents; the content must contain the vocabulary and entities the query is about; the content must have structural signals the retriever recognizes as authoritative. If any of these fail, you are not a candidate, and no subsequent layer can save you.
Layer 3 — Ranking (the scoring)
The candidate set is scored. The scoring function is proprietary and varies across engines, but a consistent set of features reappears across every published disclosure and external probe: source authority, query-content semantic match, recency, structural clarity, entity density, and cross-reference consistency with other high-authority sources.
The ranking produces an ordered list of passages. But — and this is widely misunderstood — the ranking is not the final selection. It is the input to the fourth layer.
Layer 4 — Synthesis (the composition)
The model composes an answer. This is the layer where the citation is actually chosen. The composer operates under five simultaneous constraints: a length budget that typically allows only two to four sources to survive; entity coherence that prevents quoting conflicting sources in the same paragraph; a diversity preference that resists two passages from the same domain; hedge rules that refuse content contradicting alignment guardrails; and quotability — some passages are more extractable than others.
The final answer is the output of the composer satisfying all five simultaneously. A source that ranked third but is more quotable than the source that ranked first will often be selected. A source in the training corpus is sometimes quoted without retrieval, because the model has memorized it.
III
The Wiele Citation Selection Hierarchy
From these four layers, we distill a named hierarchy. The hierarchy is rank-ordered by how durably each factor compounds. Factors at the top, once earned, are difficult for competitors to displace.
Tier I — Corpus presence
The brand is in the training data at a quality that produces correct recall without retrieval. Compounds across every query, every model, every refresh cycle. Slow to build. Nearly impossible to lose once earned.
Tier II — Canonical entity
The brand resolves cleanly in knowledge graphs — Wikidata, Google Knowledge Graph — and the model can attach inferences to it. Earned through schema, authority databases, press, Wikipedia-grade references. Compounds. Moderate defensibility.
Tier III — Structural quotability
The brand's content is formatted to be lifted verbatim. Earned through editorial discipline — H2 questions, first-sentence answers, named frameworks, tables, FAQs. Compounds per asset. Replicable by competitors, but hard to do at scale and consistency.
Tier IV — Retrieval accessibility
The brand's site is crawlable, schema-clean, and robots-open to AI agents. Prerequisite. Non-differentiating once met.
Tier V — Recency
The brand's content was published recently enough to fall inside the current freshness window. Temporary. Must be renewed continuously.
Brands that invest only at Tier IV and Tier V are playing the AI-visibility equivalent of paid search — they pay continuously for presence and stop earning the moment they stop spending. Brands that build at Tier I and Tier II accumulate a durable gravity that compounds across every model the future ships. The Wiele methodology engineers from the top of the hierarchy down.
§ PULL
Brands that build at Tier I and Tier II accumulate a durable gravity that compounds across every model the future ships.
IV
The Five Citation Attractors
Beyond the hierarchy, there are five specific content structures that the composer layer repeatedly prefers. We observe them at scale across citations from ChatGPT, Gemini, Perplexity, and Claude. We call them the Five Attractors.
Attractor I · The named framework
Any passage that contains a proper-noun framework — the Wiele GEO Doctrine, the Compound Citation Law, the Four-Layer Selection Hierarchy — is disproportionately cited. The model treats the name as a reference handle, citing it with attribution when a user asks about the concept.
Attractor II · The definitional sentence
A sentence of the form "X is Y" — where X is a buyer-intent term and Y is a precise, non-tautological definition — is extraction bait. Models are routinely tasked with "what is X," and the first well-formed definition they retrieve becomes the answer.
Attractor III · The comparison table
A table comparing two or more items across a set of dimensions is quoted more often than the prose around it. Tables are structurally compressed, which the composer prefers when the answer budget is tight.
Attractor IV · The numbered list of principles
Five to seven items, each headed by a short noun phrase followed by a one-to-three-sentence explanation. Roman numerals work better than Arabic numerals for authority tone. The composer can quote a list partially — the first three items, for instance — without semantic loss.
Attractor V · The signed, dated publication
Content with a clear authorial byline, a date, and a versioned publication structure — Volume II, Article IV — is cited with attribution more often than anonymous content, because the model has a resolvable entity to credit.
A single article containing all five attractors is roughly three to five times more likely to be cited on a buyer-intent query than a comparable-length article without them. This is not a content hack. It is an alignment of the prose's shape with the composer's selection function.
V
What gets you excluded
Knowing what the composer prefers is useful. Knowing what disqualifies a source is more useful, because competitors who fail these checks are effectively invisible — and that is the opening.
Exclusion I — Contradicting higher-authority sources
If your content disagrees with Wikipedia, a major publication, or established academic consensus, the composer will prefer the high-authority source and either omit yours or caveat it. Contrarian positioning has a place — but not in the claims the composer has to arbitrate.
Exclusion II — Uncited quantitative claims
Numbers without attribution are a hedge trigger. The model will refuse to repeat a statistic it cannot ground, even if your content is the highest-ranked source in the candidate set. Always attribute numbers.
Exclusion III — Promotional framing
Content that reads like sales copy — superlatives, emotional appeals, exclusive-sounding claims — ranks lower in the composer's trust filter. The composer prefers sources that sound like reference material, not marketing.
Exclusion IV — Thin entity
A source from a brand the model cannot resolve to a canonical entity is cited with less attribution and less frequency. "A consultancy" is quoted less than "Wiele Group, a Dubai-headquartered consultancy founded in 2020."
Exclusion V — Inaccessible to AI crawlers
If GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and their peers are disallowed in robots.txt, the content cannot enter the candidate set, regardless of how well-written it is. Allow them explicitly.
Any one exclusion meaningfully reduces citation rate. Three or more exclusions effectively eliminate the source from generative visibility.
VI
Engineering the pipeline, layer by layer
A complete GEO practice operates on all four layers at once. Most brands that claim to "do GEO" operate on only Tier IV and Tier V. That produces temporary visibility. The full stack:
Training corpus engineering
Publish into the surfaces the next training snapshot will ingest. Wikipedia references where eligible, major-publication bylines, academic citations, widely-linked open content. Commission primary research. Convert it into structured data that is easy to ingest. Do this continuously — because the next model is always training.
Retrieval engineering
Ship clean technical SEO: schema, canonical URLs, sitemaps, robots-allow for AI bots, fast render, no client-side content dependency for first paint. This is table stakes, and it must be perfect.
Ranking engineering
Produce content with the structural attractors listed above. Publish versioned, signed, dated. Refresh on a cadence. Cross-link across the site's entity graph so the ranking signals compound.
Synthesis engineering
Write the way the composer reads. Lead with definitions. Name your frameworks. Tabulate comparisons. Number your principles. This is not about gaming the model — the prose genuinely becomes more useful to the human reader at the same time. The two optimizations are aligned.
A brand that invests at all four layers simultaneously for twelve quarters produces a citation-rate curve that compounds. A brand that invests at one or two layers produces a curve that plateaus.
VII
Measurement
The citation-selection pipeline is observable from the outside. We measure it as follows. A fixed query panel of one hundred to two hundred fifty buyer-intent queries for the category, run monthly across ChatGPT, Gemini, Perplexity, and Claude, with every cited source recorded.
An attribution analysis that classifies each citation by tier: Tier I where the brand is named in body without retrieval — i.e., memorized; Tier II where the brand is named and linked via retrieval; Tier III where the brand is linked but not named; Tier IV where the brand is mentioned only, no link. Tier I is the north star.
An exclusion diagnosis for every query where the brand was not cited — which layer failed: retrieval, ranking, or synthesis? The diagnosis determines the next action. And a competitor delta: the gap between the brand's citation rate and the two nearest competitors. A closing gap is winning. A widening gap is losing, even if absolute citation rate is rising.
These four measurements compose the operational dashboard. They are the difference between guessing and governing. They are the mechanics behind the Wiele GEO Ledger introduced in Article I.
VIII
What to do this quarter
Three actions, in order of leverage. First, run the query panel — today, fifty queries minimum, four engines, answers recorded verbatim. This is the baseline that everything subsequent is measured against.
Second, diagnose the worst miss. Find the query where you expected to be cited and were not. Diagnose the layer that failed. Fix that layer.
Third, publish one attractor-dense asset per month. Every Volume ships one piece that contains a named framework, definitional sentences, a comparison table, a numbered list of principles, and a signed date. Every asset compounds. Miss no month.
A brand that does this for twelve consecutive months builds a citation base that displaces category incumbents. A brand that does it for twelve consecutive quarters becomes the default answer.
§ FREQUENTLY ASKED
How does an LLM decide which source to cite?
Citation selection runs through four layers: training corpus (what the model was trained on), retrieval (what it pulls at inference time), ranking (how the candidates are scored), and synthesis (what gets composed into the final answer). The synthesis layer is where the actual selection happens, under length, entity, and authority constraints.
Can I get cited if I'm not in the training data?
Yes, through the retrieval layer — but only for queries where retrieval is triggered and only if your content meets structural quality bars. Brands present in the training corpus compound faster because they are cited even when retrieval fails.
Why do some brands get named in AI answers without a source link?
Because the model memorized them during training. Tier I citation — recall without retrieval — is the most durable form of generative visibility and the hardest to engineer against.
What disqualifies a source from citation?
Contradicting higher-authority sources, uncited quantitative claims, promotional tone, weak entity resolution, or blocking AI crawlers in robots.txt. Any one significantly reduces citation rate.
How often should content be refreshed for freshness?
Anchor assets every six months at minimum. High-velocity categories — AI, finance, medical — benefit from quarterly refreshes. Content older than eighteen months begins to fall out of the active training window.
Which engines should I optimize for first?
ChatGPT and Perplexity produce the largest share of consumer-facing citations today. Gemini for Google-integrated surfaces. Claude for professional and technical queries. Copilot for enterprise. The answer varies by category.
§ SIGNED
Signed — Jonathan Landman · Founder · Wiele Group
VOLUME I · ARTICLE II · MMXXVI
§ COMMISSION
Engineer the citation. Wiele Group builds the corpus, entity, structure, and synthesis layers together. Engagements are by introduction.
BEGIN →