# How to Optimize Static HTML for AI Crawlers (2026 Guide)

> A reference for making static HTML pages discoverable and citable by AI
> crawlers like GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and
> Google-Extended. Covers page-level signals, robots.txt, llms.txt, and the
> gotchas that block citations.

By Michael McGrory, Solutions Engineer (Partnerships) at Cloudflare.
Last updated: 2026-05-02.

## Key takeaways

- Most AI crawlers do not execute JavaScript. Static HTML or server-side
  rendering is the baseline requirement.
- The pages most likely to be cited are original research, definitions,
  comparisons, and how-tos — in that order.
- GPTBot raw requests grew +305% YoY May 2024 to May 2025; ChatGPT-User grew
  +2,825%; PerplexityBot grew +157,490% (Cloudflare, 2025).
- Add `llms.txt`, a markdown mirror per page, JSON-LD, and Content Signals
  in `robots.txt`.
- Recency matters: AI assistants shift cited publication dates forward by
  up to 4.78 years when reranking, per a 2025 study.

## The three kinds of AI crawlers

| Category        | Examples                                                                       | Honors robots.txt | What they want                                  |
| --------------- | ------------------------------------------------------------------------------ | ----------------- | ----------------------------------------------- |
| Training        | GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent, Google-Extended      | Yes               | Breadth, novelty, original data                 |
| Index / search  | OAI-SearchBot, PerplexityBot, Claude-SearchBot, Googlebot                      | Yes               | Stable URLs, structured pages, fresh content    |
| User-triggered  | ChatGPT-User, Claude-User, Perplexity-User, Meta-ExternalFetcher               | Largely no        | Authoritative pages on demand, for citations    |

## Page types most likely to be crawled and cited

1. Original research and proprietary data
2. Definitional / glossary pages ("What is X")
3. Comparison pages ("X vs Y")
4. How-to and step-by-step guides
5. Pricing or cost pages with concrete numbers in plain text
6. FAQ and Q&A pages
7. Reference and API documentation
8. Programmatic pages with consistent schemas
9. Recent news and time-stamped analysis (especially YMYL)
10. Free tools and calculators

## Page-level signals to add

- Server-side rendered HTML
- One `<h1>` phrased as the user's question
- Direct answer in the lead paragraph plus a Key takeaways block
- Short paragraphs (2–4 sentences) and lists
- Stats and quotes in plain text with units, dates, and inline source
- Visible Last updated date plus `<time datetime="...">` and JSON-LD
  `dateModified`
- Author byline with credentials
- JSON-LD structured data (Article, FAQPage, HowTo, Product, Dataset)
- Stable canonical URL
- Markdown mirror at `page.html.md`

## Site-level signals to add

- `robots.txt` with explicit allow rules and Content Signals directives
- `llms.txt` at the root, listing your most useful pages in markdown
- `sitemap.xml` with accurate `<lastmod>` per URL
- Allowlist verified AI bots in your WAF or Bot Management
- Backlinks from authoritative sources
- Brand mentions across UGC platforms (Reddit, Wikipedia, YouTube, Stack
  Overflow)
- Fast TTFB and 200-status responses to bots

## Gotchas

- JS-rendered SPAs are mostly invisible. Test with
  `curl -A "GPTBot" https://example.com/page` and check that content is
  there.
- Aggressive bot management blocks legit AI bots. Allowlist verified bots
  by user-agent and published IP ranges.
- `robots.txt` is voluntary; user-triggered fetchers ignore it by design.
- Stale URLs kill citations. Keep canonicals stable.

## Sources

- Cloudflare blog, "From Googlebot to GPTBot: who's crawling your site in
  2025": https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
- Cloudflare AI Crawl Control: https://developers.cloudflare.com/ai-crawl-control/
- OpenAI bot reference: https://platform.openai.com/docs/bots
- Perplexity bot reference: https://docs.perplexity.ai/guides/bots
- llms.txt specification: https://llmstxt.org/
- Content Signals: https://contentsignals.org/
- Ahrefs, "How to Earn LLM Citations": https://ahrefs.com/blog/llm-citations/
- Semrush GEO guide: https://www.semrush.com/blog/generative-engine-optimization/
