llms.txt and robots.txt for AI Crawlers
Updated: April 18, 2026
llms.txt is a root-level file that lists your site's pages in Markdown format for AI crawlers. robots.txt must explicitly allow GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, and Google-Extended. Without these files, AI engines may miss or deprioritize your content.
llms.txt and robots.txt for AI Crawlers
llms.txt is a root-level file that lists your site’s pages in Markdown format for AI crawlers. robots.txt must explicitly allow GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, and Google-Extended. Without these files, AI engines may miss or deprioritize your content.
These two files together form the access and discovery layer of GEO — the first thing to configure before any other optimization.
robots.txt: Allow All AI Crawlers
A critical and common problem: many sites have Disallow: / in their robots.txt with a wildcard rule that accidentally blocks all AI crawlers. These sites are completely invisible to ChatGPT, Perplexity, Claude, and Google AI Overviews.
The correct robots.txt for GEO explicitly allows each AI crawler by name:
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: BingBot
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/llms.txt
Note: ClaudeBot handles Anthropic training data crawling, Claude-User handles retrieval for Claude.ai browsing, and Claude-SearchBot handles search-specific crawling. All three should be explicitly allowed.
robots.txt: Disallow Specific Paths (Optional)
If you want AI crawlers to access most of your site but skip specific paths:
User-agent: GPTBot
Allow: /
Disallow: /private/
Disallow: /admin/
Disallow: /drafts/
This pattern gives AI crawlers access to all public content while protecting private areas.
llms.txt: The AI-Readable Site Map
llms.txt is placed at the root of your site (/llms.txt) and provides AI crawlers with a structured, human-readable overview of your site’s content — similar to robots.txt but focused on semantic comprehension rather than access control.
The format is Markdown:
# My Site Name
> One-line description of what the site is and who it serves.
## Main Content
- [Complete GEO Guide](https://yoursite.com/geo-guide): What GEO is and how it works
- [GEO vs SEO](https://yoursite.com/geo-vs-seo): Key differences between GEO, SEO, and AEO
- [Technical Implementation](https://yoursite.com/technical): Meta tags, schema, and robots.txt setup
## Tools and Resources
- [GEO Checklist](https://yoursite.com/checklist): 22-item implementation checklist
- [Schema Generator](https://yoursite.com/tools): Free JSON-LD generator
## About
- [About Us](https://yoursite.com/about): Team credentials and expertise
## Policies
- [Terms of Use](https://yoursite.com/terms)
- [Privacy Policy](https://yoursite.com/privacy)
The key elements of an effective llms.txt:
- Site name as H1
- One-line description (the
>blockquote) — this is what AI engines quote when describing your site - Content sections organized by topic
- Descriptive link text — each link should explain what the page is about, not just its title
- No duplicates — don’t list the same page twice
XML Sitemap with lastmod
The XML sitemap works alongside llms.txt for discovery. The <lastmod> tag is the only sitemap attribute that AI crawlers actively use for prioritization.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://yoursite.com/geo-guide</loc>
<lastmod>2026-04-18</lastmod>
<changefreq>monthly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://yoursite.com/technical</loc>
<lastmod>2026-04-18</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Update <lastmod> every time you make substantial content changes. This is the primary signal AI crawlers use to decide whether to re-crawl a page.
Segmenting Sitemaps by Content Type
For larger sites, segment sitemaps by content type to help AI crawlers focus on the most relevant sections:
/sitemap.xml (index sitemap)
/sitemap-guides.xml (articles and guides)
/sitemap-tools.xml (tool pages)
/sitemap-blog.xml (blog posts)
Reference all sitemaps in robots.txt:
Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/sitemap-guides.xml
Verifying AI Crawler Access
To verify AI crawlers can access your site:
- Use Google Search Console to check for crawl errors from Googlebot-Extended
- Check server logs for GPTBot, ClaudeBot, and PerplexityBot visits
- Use Cloudflare Analytics (if applicable) to monitor AI bot traffic
- Test with Perplexity by searching for content from your site and checking if it’s cited
Implementation Checklist
- robots.txt allows: GPTBot, OAI-SearchBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Google-Extended, BingBot
- No wildcard
Disallow: /that blocks AI crawlers without explicit Allow exceptions - llms.txt at site root with site description and all major pages listed
- XML sitemap with
<lastmod>on every URL - Sitemap URL referenced in robots.txt
- llms.txt URL also referenced in robots.txt
-
<lastmod>updated when content changes