# static-html-for-ai — full content

> A reference site demonstrating how to optimize static HTML pages for AI
> crawlers. The full content of the site's primary pages is concatenated
> below for direct ingestion into an LLM context window.

---

## Page: How to Optimize Static HTML for AI Crawlers (2026 Guide)

URL: https://ai.michaelmcgrory.org/

Author: Michael McGrory, Solutions Engineer (Partnerships) at Cloudflare.
Last updated: 2026-05-02.

### Key takeaways

- Most AI crawlers do not execute JavaScript. Static HTML or server-side
  rendering is the baseline requirement.
- The pages most likely to be cited are original research, definitions,
  comparisons, and how-tos — in that order.
- GPTBot raw requests grew +305% YoY May 2024 to May 2025; ChatGPT-User
  grew +2,825%; PerplexityBot grew +157,490% (Cloudflare, 2025).
- Add `llms.txt`, a markdown mirror per page, JSON-LD, and Content Signals
  in `robots.txt`.
- Recency matters: AI assistants shift cited publication dates forward by
  up to 4.78 years when reranking, per a 2025 study.

### Summary

If you want a static HTML page to be crawled and cited by AI assistants
like ChatGPT, Claude, Perplexity, and Google's AI Overviews, you need
three things in place:

1. The page content has to live in the initial HTML response, not behind
   JavaScript.
2. The page has to be structured so that a language model can extract a
   complete answer to a specific question from a single chunk.
3. The site has to declare its preferences and signals to crawlers via
   `robots.txt`, `sitemap.xml`, and (increasingly) `llms.txt`.

### Crawler taxonomy

Training crawlers (GPTBot, ClaudeBot, CCBot, Bytespider,
Meta-ExternalAgent, Google-Extended) bulk-crawl for model corpora and
honor `robots.txt`.

Index/search crawlers (OAI-SearchBot, PerplexityBot, Claude-SearchBot,
Googlebot) build retrievable indexes for RAG and honor `robots.txt`.

User-triggered fetchers (ChatGPT-User, Claude-User, Perplexity-User,
Meta-ExternalFetcher) fire on demand when a user asks a question and
largely ignore `robots.txt` by design.

### Page types most likely to be cited

1. Original research and proprietary data
2. Definitional / glossary pages
3. Comparison pages
4. How-to and step-by-step guides
5. Pricing or cost pages with concrete numbers in plain text
6. FAQ and Q&A pages
7. Reference and API documentation
8. Programmatic pages with consistent schemas
9. Recent news and time-stamped analysis (especially YMYL)
10. Free tools and calculators

### Page-level signals

- Server-side rendered HTML
- One `<h1>` phrased as the user's question
- Direct answer in the lead paragraph plus a Key takeaways block
- Short paragraphs (2–4 sentences) and lists
- Stats and quotes in plain text with units, dates, and inline source
- Visible Last updated date and `<time datetime="...">`
- Author byline with credentials
- JSON-LD structured data
- Stable canonical URL
- Markdown mirror at `page.html.md`

### Site-level signals

- `robots.txt` with explicit allow rules and Content Signals directives
- `llms.txt` at the root, listing your most useful pages in markdown
- `sitemap.xml` with accurate `<lastmod>` per URL
- Allowlist verified AI bots in your WAF or Bot Management
- Backlinks from authoritative sources
- Brand mentions across UGC platforms
- Fast TTFB and 200-status responses to bots

### Gotchas

- JS-rendered SPAs are mostly invisible to AI crawlers.
- Aggressive bot management blocks legit AI bots.
- `robots.txt` is voluntary; user-triggered fetchers ignore it.
- Stale URLs kill citations.

### Sources

- https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/
- https://developers.cloudflare.com/ai-crawl-control/
- https://platform.openai.com/docs/bots
- https://docs.perplexity.ai/guides/bots
- https://llmstxt.org/
- https://contentsignals.org/
- https://ahrefs.com/blog/llm-citations/
- https://www.semrush.com/blog/generative-engine-optimization/