← All Insights
AI Tools

What actually drives whether AI assistants cite your content?

When I started modeling this on my own site, I expected the winning features to be the ones content people usually argue about. Topic adjacency to proven winners. Keyword density. Structural polish. Stats and citations. The things a good writer can deliberately engineer into a draft.

What the data actually said, at least at the scale I can measure so far, is that one feature dominates. How recently the piece was published. The other features I tried either tied that result or made predictions worse.

I want to walk through exactly what I tested, what the numbers looked like, and what it implies for where to spend time if you are writing for AI citation.

The Setup

I publish a running set of insights on my site and I track which of them get fetched by AI assistants like ChatGPT, Claude, and Perplexity when a real user asks those assistants a question. Every fetch shows up in my server logs with a specific user agent I can identify. Over five weeks of data, that produced twenty three real user citations spread across about twenty five eligible pieces.

The project goal was simple. Can I build a model that, given a candidate draft, predicts whether it will get cited in the next seven days. If I can, I have a tool that helps me prioritize what to work on.

Before building anything fancy, I ran three baselines. A baseline is a deliberately dumb model used to set a floor. If your sophisticated model cannot beat the dumb one, the sophistication is not earning its keep.

The three baselines.

Baseline one, mean at base rate. Always predict the average. If about ten percent of pieces get cited in any given week, predict ten percent for everything.

Baseline two, recency only. The only feature the model gets to see is days since published. Pieces published yesterday score higher than pieces from last month.

Baseline three, similarity only. The only features the model gets to see are how topically similar each piece is to pieces that were already cited. Closer to a proven winner equals higher score. This is the "adjacent content" thesis you hear in a lot of AI search optimization advice.

What the Data Showed

I scored each baseline two ways. Spearman correlation measures whether the model's ranking of pieces lines up with the real ranking, from negative one for exactly wrong through zero for no relationship to positive one for exactly right. Precision at five takes the five pieces the model ranked highest and asks how many of those were actually cited.

For crawler attention, meaning traffic from indexing bots preparing training or search data, recency was dominant. Spearman of 0.64 and precision at five of 1.0. The five most recently published pieces were all in fact visited by crawlers in the following week.

For real user citation, the effect was weaker but still pointed the same way. Recency got a Spearman of 0.22 and a precision at five of 0.4. Two of the five most recent pieces were actually cited by an AI assistant during a real user query, versus one out of five expected if you picked at random.

Similarity features did nothing. For both targets, the similarity-only baseline scored at or below the random floor. Negative Spearman on one of them, which means the ranking was slightly inverted. Pieces that looked most similar to previous winners were not more likely to become winners themselves, at least not in any way my setup could detect.

The full model I built on top of these baselines, with all the structural features I could think of, did not beat recency on the user citation target. The honest read is that at my current content volume, recency explains most of what is explainable.

What Surprised Me

Two things.

The first is that the gap between crawler behavior and user citation is large. Crawlers grab new content aggressively. AI assistants answering real user questions are more selective. Optimizing for what crawlers pick up is not the same as optimizing for what gets cited in an actual conversation. A lot of AI content advice conflates the two.

The second is that topic adjacency did not show up in the numbers. I had written several pieces as deliberate bets adjacent to proven winners, and I had anecdotal evidence the bets were paying off. The data does not refute the anecdotes, but it says the effect is not strong enough to beat a recency baseline on twenty three events. Either the effect is smaller than I thought, or I need more data to see it, or my way of measuring similarity misses the thing that actually matters.

What This Implies for Where to Spend Time

If you are writing for AI citation and your content library is in the dozens rather than the thousands, the most defensible use of your hour is publishing another piece, not polishing the current one past a reasonable threshold.

That is not the advice I wanted to give. I wanted to find a feature that would tell me which drafts to prioritize. What I got instead is closer to "keep the bar high, keep shipping, and measure the library rather than the piece."

Polish still matters for the reader you are trying to convert once they arrive. A thin page will not turn a citation into a lead. The point is narrower. The act of publishing the next piece produces more expected citation value than making the last one ten percent better.

What I Am Not Claiming

I want to be careful about what this data can and cannot support.

Twenty three positive events is directional, not definitive. The picture could change meaningfully once I have a few more months of data. AI assistant behavior drifts monthly, which means a model trained on April data may not describe July.

I also only tested the similarity features I could build cheaply with off the shelf embeddings. It is possible that a better measure of "does this piece engage the same argument as a proven winner" would show a real effect. That is on my list.

What I feel confident saying is this. If someone sells you a content engineering framework that promises to move AI citation rates meaningfully without increasing publishing cadence, ask them to show their work on a sample the size of yours. My sample is small and my answer is already recency. It is unlikely their sample is both larger than mine and says something very different.

By the Numbers

Recent research on generative engine optimization finds that content freshness and structural clarity are stronger predictors of AI assistant citation than topical clustering alone.

Aggarwal et al., 'GEO: Generative Engine Optimization,' arXiv:2311.09735, 2023

Across studies of search and recommender systems, simple baselines using recency or popularity frequently match or beat more sophisticated models on small datasets, a result known as the 'strong baselines' problem.

Ferrari Dacrema et al., 'Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches,' RecSys, 2019

Have a Question About Your Business?

Book a free 30-minute call and we'll work through it together.

Start a Conversation