May 18, 2026 · 4 min read

14 AI crawlers controlling your visibility in 2026

Complete list of the AI crawlers you must know in 2026 — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and 10 others — with what each one does and how to allow or block them in robots.txt.

ai-crawlers robots-txt geo

Why the 14-bot list matters

Most websites still treat robots.txt as a 2010-era SEO file: allow Googlebot, block AhrefsBot, done. That mental model misses the most important shift of the last two years. As of 2026, at least 14 distinct AI crawlers operated by 9 different companies routinely fetch web pages, each with its own purpose and its own consequence for your visibility inside AI answer engines. Blocking the wrong one cuts you out of ChatGPT search results. Forgetting to allow another costs you mentions inside Claude’s responses. Blocking everything to “save server load” — a popular 2023 panic move — now means your brand is silently invisible across the fastest-growing discovery channel since organic search. The first step toward AI Visibility is knowing the cast of characters.

The 14 crawlers, grouped by purpose

AI crawlers fall into four functional categories. AI Search crawlers fetch pages in real time to power conversational search results: OAI-SearchBot (OpenAI), PerplexityBot (Perplexity). Training crawlers harvest content for model pretraining and fine-tuning: GPTBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google Gemini), Applebot-Extended (Apple Intelligence), Bytespider (ByteDance/Doubao), Amazonbot (Amazon Nova), Meta-ExternalAgent (Meta AI). User-triggered agents fetch pages on behalf of a specific user prompt: ChatGPT-User, Claude-Web, Perplexity-User. Common Crawl (CCBot) is a non-profit archive that feeds dozens of downstream LLMs, including many open-source models. Each category has different SEO consequences and should be governed by different rules.

The blocking decision is not binary

A common mistake is treating each crawler as “all-or-nothing.” In reality, you almost always want to allow AI Search and user-triggered agents (these are how your brand shows up in ChatGPT answers and Perplexity citations), while you may legitimately want to block training crawlers if you sell content as a product, license your IP, or operate in regulated industries. The asymmetric pattern looks like this in robots.txt: User-agent: GPTBot Disallow: / blocks training while still allowing OAI-SearchBot and ChatGPT-User to fetch your pages on user demand. This nuance is what separates a sophisticated AI strategy from accidentally torching your visibility. Most sites should default to allowing all 14 unless they have a clear business reason not to.

The 14 crawlers at a glance

Crawler	Vendor	Type	Purpose
GPTBot	OpenAI	Training	Pretrains GPT models
ChatGPT-User	OpenAI	User agent	Fetches when a user asks ChatGPT
OAI-SearchBot	OpenAI	AI search	Powers ChatGPT search results
ClaudeBot	Anthropic	Training	Pretrains Claude
Claude-Web	Anthropic	User agent	Real-time fetches inside Claude
anthropic-ai	Anthropic	Training	Legacy training crawler
PerplexityBot	Perplexity	AI search	Indexes for Perplexity answers
Perplexity-User	Perplexity	User agent	User-triggered fetches
Google-Extended	Google	Training	Opt-out for Gemini/Bard training
Applebot-Extended	Apple	Training	Opt-out for Apple Intelligence
Bytespider	ByteDance	Training	Powers Doubao + TikTok recommendations
Amazonbot	Amazon	Training	Feeds Amazon Nova and Alexa+
CCBot	Common Crawl	Common Crawl	Feeds dozens of LLMs downstream
Meta-ExternalAgent	Meta	Training	Powers Meta AI and Llama

How to audit your current configuration

The fastest way to see your current state across all 14 crawlers is to run an AI Visibility audit — the report renders a matrix showing exactly which bots are Allowed, Blocked, or Unspecified by your current robots.txt. A site with no robots.txt falls back to “Allowed” by default for every bot, which is usually the right baseline for most marketing sites. If you do choose to block training crawlers, document the decision in a comment block at the top of robots.txt so future team members understand the intent. And finally: re-check the list every quarter. New AI crawlers ship roughly every 90 days, and old ones occasionally rename (Claude-Web replaced an earlier identifier in 2025).

Frequently asked questions

Will blocking GPTBot hurt my Google search rankings? No. GPTBot only affects OpenAI’s training data. Google’s crawlers are separate (Googlebot for search, Google-Extended for Gemini training). Blocking GPTBot does not change anything in Google Search.

What is the difference between Google-Extended and Googlebot? Googlebot powers traditional Google Search and is essential for SEO. Google-Extended is an opt-out signal specifically for Gemini training. Blocking Google-Extended does not affect Search.

Should I block CCBot? Probably not. CCBot feeds the Common Crawl dataset, which is consumed by dozens of LLMs — blocking it cuts you out of many open-source models at once. The decision depends on whether you mind being in training data at all.