llms.txt and robots.txt for AI Crawlers

llms.txt is a root-level file that lists your site’s pages in Markdown format for AI crawlers. robots.txt must explicitly allow GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, and Google-Extended. Without these files, AI engines may miss or deprioritize your content.

These two files together form the access and discovery layer of GEO — the first thing to configure before any other optimization.

robots.txt: Allow All AI Crawlers

A critical and common problem: many sites have Disallow: / in their robots.txt with a wildcard rule that accidentally blocks all AI crawlers. These sites are completely invisible to ChatGPT, Perplexity, Claude, and Google AI Overviews.

The correct robots.txt for GEO explicitly allows each AI crawler by name:

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: BingBot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/llms.txt

Note: ClaudeBot handles Anthropic training data crawling, Claude-User handles retrieval for Claude.ai browsing, and Claude-SearchBot handles search-specific crawling. All three should be explicitly allowed.

robots.txt: Disallow Specific Paths (Optional)

If you want AI crawlers to access most of your site but skip specific paths:

User-agent: GPTBot
Allow: /
Disallow: /private/
Disallow: /admin/
Disallow: /drafts/

This pattern gives AI crawlers access to all public content while protecting private areas.

llms.txt: The AI-Readable Site Map

llms.txt is placed at the root of your site (/llms.txt) and provides AI crawlers with a structured, human-readable overview of your site’s content — similar to robots.txt but focused on semantic comprehension rather than access control.

The format is Markdown:

# My Site Name
> One-line description of what the site is and who it serves.

## Main Content
- [Complete GEO Guide](https://yoursite.com/geo-guide): What GEO is and how it works
- [GEO vs SEO](https://yoursite.com/geo-vs-seo): Key differences between GEO, SEO, and AEO
- [Technical Implementation](https://yoursite.com/technical): Meta tags, schema, and robots.txt setup

## Tools and Resources
- [GEO Checklist](https://yoursite.com/checklist): 22-item implementation checklist
- [Schema Generator](https://yoursite.com/tools): Free JSON-LD generator

## About
- [About Us](https://yoursite.com/about): Team credentials and expertise

## Policies
- [Terms of Use](https://yoursite.com/terms)
- [Privacy Policy](https://yoursite.com/privacy)

The key elements of an effective llms.txt:

Site name as H1
One-line description (the > blockquote) — this is what AI engines quote when describing your site
Content sections organized by topic
Descriptive link text — each link should explain what the page is about, not just its title
No duplicates — don’t list the same page twice

XML Sitemap with lastmod

The XML sitemap works alongside llms.txt for discovery. The <lastmod> tag is the only sitemap attribute that AI crawlers actively use for prioritization.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yoursite.com/geo-guide</loc>
    <lastmod>2026-04-18</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.9</priority>
  </url>
  <url>
    <loc>https://yoursite.com/technical</loc>
    <lastmod>2026-04-18</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Update <lastmod> every time you make substantial content changes. This is the primary signal AI crawlers use to decide whether to re-crawl a page.

Segmenting Sitemaps by Content Type

For larger sites, segment sitemaps by content type to help AI crawlers focus on the most relevant sections:

/sitemap.xml          (index sitemap)
/sitemap-guides.xml   (articles and guides)
/sitemap-tools.xml    (tool pages)
/sitemap-blog.xml     (blog posts)

Reference all sitemaps in robots.txt:

Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/sitemap-guides.xml

Verifying AI Crawler Access

To verify AI crawlers can access your site:

Use Google Search Console to check for crawl errors from Googlebot-Extended
Check server logs for GPTBot, ClaudeBot, and PerplexityBot visits
Use Cloudflare Analytics (if applicable) to monitor AI bot traffic
Test with Perplexity by searching for content from your site and checking if it’s cited

Implementation Checklist

robots.txt allows: GPTBot, OAI-SearchBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Google-Extended, BingBot
No wildcard Disallow: / that blocks AI crawlers without explicit Allow exceptions
llms.txt at site root with site description and all major pages listed
XML sitemap with <lastmod> on every URL
Sitemap URL referenced in robots.txt
llms.txt URL also referenced in robots.txt
<lastmod> updated when content changes