In 2023, a wave of advice told website owners to "protect your content from AI" by blocking the bots in robots.txt. A lot of people did it. The problem is that most of them blocked the wrong crawler — and some of them, trying to keep AI out, accidentally locked themselves out of being recommended. This is the most consequential robots.txt mistake of the AI era, and it hinges on a distinction almost nobody explains.
Two bots, two completely different jobs
Every major AI provider runs more than one crawler, and they do different things. Lump them together and the advice falls apart.
- One crawler exists to train the model. Whether you allow or block it changes whether your content feeds future training. It has nothing to do with whether you appear in answers today.
- A different crawler exists to power live search and citations. This is the one that decides whether the engine can read your page and name you in a response.
For OpenAI, the training crawler is GPTBot and the citation crawler is OAI-SearchBot. Here's the part that catches people: blocking GPTBot does not remove you from ChatGPT's search results. ChatGPT can still cite you. The only thing that removes you from ChatGPT search is blocking OAI-SearchBot. OpenAI says this plainly: the settings are independent, and you can allow OAI-SearchBot to appear in search while disallowing GPTBot to opt out of training.
So all those sites that "blocked AI" by disallowing GPTBot gave up exactly nothing on the citation side — they only opted out of training. And the sites that went further and blocked OAI-SearchBot, or the regular Googlebot, made themselves invisible in AI answers while believing they'd done something clever.
The same pattern repeats across providers
Once you see the split, every engine reads the same way:
- Anthropic:
ClaudeBottrains the model;Claude-SearchBotpowers Claude's search citations. Block the first, keep the second, and Claude can still cite you. - Google:
Google-Extendedonly governs whether you train Gemini/Vertex. It does not control AI Overviews. Those run on ordinaryGooglebot— so the way to disappear from Google's AI answers is to block Googlebot, which also tanks your normal search. BlockingGoogle-Extendedcosts you nothing in AI Overviews. - Microsoft: one crawler,
Bingbot, feeds the Bing index — which powers both Copilot and ChatGPT's web search. Block Bingbot and you lose multiple AI surfaces at once.
The takeaway: the bots that matter for being recommended are the search/citation crawlers and the classic search crawlers — OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Bingbot. Those are the ones to keep open.
The Perplexity wrinkle
Worth knowing if you're deciding what to block: blocking a crawler doesn't always equal total disappearance. Perplexity has said that even when you disallow PerplexityBot, it may still surface a blocked page's domain, headline, and a brief factual summary. And in August 2025, Cloudflare reported that after some sites disallowed Perplexity's declared bots and added firewall blocks, it observed fetches from an undeclared user agent that looked like an ordinary browser. (Cloudflare noted that ChatGPT's crawler, by contrast, fetched robots.txt and stopped when disallowed.) The honest summary: robots.txt is a request that well-behaved crawlers honour, not a wall.
The goal for a local business almost never is to block AI. It's the opposite — you want to be read and recommended. The job is to make sure you haven't accidentally slammed a door you meant to leave open.
Where the accidental blocks come from
If you've never touched your robots.txt, you can still be blocked, because the file often isn't yours to control directly:
- SEO plugins generate it. On WordPress, Yoast, Rank Math, and All in One SEO usually generate robots.txt dynamically — so a hand-edited file can be silently overwritten by the plugin's settings.
- Security plugins block bots. Wordfence, Solid Security, and Sucuri ship "bad bot" rules that, in early 2024, categorised AI crawlers as scrapers and rate-limited or
403-blocked them. That block may not even appear in your robots.txt. - Managed hosts inject blocks. Some managed-WordPress platforms add AI-bot blocks at the server level, invisible from your site's own files.
This is why a site can look perfectly open in its robots.txt and still be unreachable to AI crawlers — the block lives a layer up.
What to actually do
- Check which crawlers can reach you. Run your domain through the Robots Check. It reports the citation-grade crawlers specifically and shows the exact line behind any block.
- Make sure the citation bots are open —
OAI-SearchBot,Claude-SearchBot,PerplexityBot,Googlebot,Bingbot. If you want to opt out of training while staying citable, disallow onlyGPTBot,ClaudeBot, andGoogle-Extended. - If the file looks clean but you're still blocked, check your security plugin and ask your host whether it blocks AI bots at the platform level.
- Re-test, then verify in the wild. Give crawlers a couple of days, then ask ChatGPT and Perplexity about your brand and category.
If you want the deeper picture of how each engine actually selects who to cite, the per-engine breakdowns — like how ChatGPT cites local businesses — walk through it. But start with the file. It's the cheapest fix in AI visibility, and the one most likely to be quietly costing you. hello@rankinglocal.ai reaches me directly.