14 AI crawlers controlling your visibility in 2026
Complete list of the AI crawlers you must know in 2026 — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and 10 others — with what each one does and how to allow or block them in robots.txt.
Why the 14-bot list matters
Most websites still treat robots.txt as a 2010-era SEO file: allow Googlebot, block AhrefsBot, done. That mental model misses the most important shift of the last two years. As of 2026, at least 14 distinct AI crawlers operated by 9 different companies routinely fetch web pages, each with its own purpose and its own consequence for your visibility inside AI answer engines. Blocking the wrong one cuts you out of ChatGPT search results. Forgetting to allow another costs you mentions inside Claude’s responses. Blocking everything to “save server load” — a popular 2023 panic move — now means your brand is silently invisible across the fastest-growing discovery channel since organic search. The first step toward AI Visibility is knowing the cast of characters.
The 14 crawlers, grouped by purpose
AI crawlers fall into four functional categories. AI Search crawlers fetch pages in real time to power conversational search results: OAI-SearchBot (OpenAI), PerplexityBot (Perplexity). Training crawlers harvest content for model pretraining and fine-tuning: GPTBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google Gemini), Applebot-Extended (Apple Intelligence), Bytespider (ByteDance/Doubao), Amazonbot (Amazon Nova), Meta-ExternalAgent (Meta AI). User-triggered agents fetch pages on behalf of a specific user prompt: ChatGPT-User, Claude-Web, Perplexity-User. Common Crawl (CCBot) is a non-profit archive that feeds dozens of downstream LLMs, including many open-source models. Each category has different SEO consequences and should be governed by different rules.
The blocking decision is not binary
A common mistake is treating each crawler as “all-or-nothing.” In reality, you almost always want to allow AI Search and user-triggered agents (these are how your brand shows up in ChatGPT answers and Perplexity citations), while you may legitimately want to block training crawlers if you sell content as a product, license your IP, or operate in regulated industries. The asymmetric pattern looks like this in robots.txt: User-agent: GPTBot Disallow: / blocks training while still allowing OAI-SearchBot and ChatGPT-User to fetch your pages on user demand. This nuance is what separates a sophisticated AI strategy from accidentally torching your visibility. Most sites should default to allowing all 14 unless they have a clear business reason not to.
The 14 crawlers at a glance
| Crawler | Vendor | Type | Purpose |
|---|---|---|---|
| GPTBot | OpenAI | Training | Pretrains GPT models |
| ChatGPT-User | OpenAI | User agent | Fetches when a user asks ChatGPT |
| OAI-SearchBot | OpenAI | AI search | Powers ChatGPT search results |
| ClaudeBot | Anthropic | Training | Pretrains Claude |
| Claude-Web | Anthropic | User agent | Real-time fetches inside Claude |
| anthropic-ai | Anthropic | Training | Legacy training crawler |
| PerplexityBot | Perplexity | AI search | Indexes for Perplexity answers |
| Perplexity-User | Perplexity | User agent | User-triggered fetches |
| Google-Extended | Training | Opt-out for Gemini/Bard training | |
| Applebot-Extended | Apple | Training | Opt-out for Apple Intelligence |
| Bytespider | ByteDance | Training | Powers Doubao + TikTok recommendations |
| Amazonbot | Amazon | Training | Feeds Amazon Nova and Alexa+ |
| CCBot | Common Crawl | Common Crawl | Feeds dozens of LLMs downstream |
| Meta-ExternalAgent | Meta | Training | Powers Meta AI and Llama |
How to audit your current configuration
The fastest way to see your current state across all 14 crawlers is to run an AI Visibility audit — the report renders a matrix showing exactly which bots are Allowed, Blocked, or Unspecified by your current robots.txt. A site with no robots.txt falls back to “Allowed” by default for every bot, which is usually the right baseline for most marketing sites. If you do choose to block training crawlers, document the decision in a comment block at the top of robots.txt so future team members understand the intent. And finally: re-check the list every quarter. New AI crawlers ship roughly every 90 days, and old ones occasionally rename (Claude-Web replaced an earlier identifier in 2025).
Frequently asked questions
Will blocking GPTBot hurt my Google search rankings? No. GPTBot only affects OpenAI’s training data. Google’s crawlers are separate (Googlebot for search, Google-Extended for Gemini training). Blocking GPTBot does not change anything in Google Search.
What is the difference between Google-Extended and Googlebot? Googlebot powers traditional Google Search and is essential for SEO. Google-Extended is an opt-out signal specifically for Gemini training. Blocking Google-Extended does not affect Search.
Should I block CCBot? Probably not. CCBot feeds the Common Crawl dataset, which is consumed by dozens of LLMs — blocking it cuts you out of many open-source models at once. The decision depends on whether you mind being in training data at all.