Core Directives
spec_version
Required. Declares the specification version being used.
# REQUIRED: Protocol version declaration
spec_version: 3.0
All llmtag.txt
files must include this directive. Files without it are considered invalid and will be ignored by compliant AI agents.
Content Usage Directives
ai_training_data
Controls whether content can be used as training data for machine learning models.
# AI Training Policy: Controls whether content can be used for AI model training
# Values: allow (permit training) | disallow (block training)
ai_training_data: disallow
This is the most critical directive for most publishers, as it directly controls whether their content can be used to train AI models.
Examples:
# Block all AI training
ai_training_data: disallow
# Allow AI training
ai_training_data: allow
ai_use
Controls specific AI applications and use cases beyond training.
# AI Use Policy: Defines how AI agents can use your content
# Values: search_indexing (for search engines) | generative_synthesis (for AI responses) | research (for academic research)
ai_use: search_indexing, generative_synthesis
Examples:
# Allow only search indexing
ai_use: search_indexing
# Allow multiple use cases
ai_use: search_indexing, generative_synthesis, research
# Allow all use cases
ai_use: search_indexing, generative_synthesis, commercial_products, research, personal_assistance
Advanced Directives
verification_challenge
Establishes a cryptographic handshake to verify that an AI agent has actually read and understood the rules.
verification_challenge: sha256:abc123def456...
This is an advanced feature for publishers who want to implement verification mechanisms. Most implementations can ignore this directive.
Example:
# Require SHA-256 verification
verification_challenge: sha256:a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456
Scope Directives
User-agent
Defines a scope block for specific AI agents or crawlers.
User-agent: GPTBot
ai_training_data: allow
ai_use: search_indexing, generative_synthesis
Common AI Agent User-Agents:
# OpenAI's GPTBot
User-agent: GPTBot
# ChatGPT-User (when users browse with ChatGPT)
User-agent: ChatGPT-User
# Google's AI crawlers
User-agent: Google-Extended
# Anthropic's Claude
User-agent: Claude-Web
# Perplexity AI
User-agent: PerplexityBot
Path
Defines a scope block for specific URL paths or patterns.
Path: /premium/
ai_training_data: disallow
ai_use: search_indexing
Examples:
# Protect premium content
Path: /premium/
ai_training_data: disallow
# Allow AI training for blog content
Path: /blog/
ai_training_data: allow
ai_use: search_indexing, generative_synthesis
# Protect user-generated content
Path: /user-content/
ai_training_data: disallow
ai_use: search_indexing
Directive Processing
Precedence Order
Directives are processed in the following order of precedence (highest to lowest):
- Path-specific directives - Most specific
- User-agent specific directives - Agent-specific
- Global directives - Least specific
Inheritance
When a directive is not specified at a more specific level, it inherits from the global level:
# Global policy
spec_version: 3.0
ai_training_data: disallow
ai_use: search_indexing
# This agent inherits global policy but overrides ai_use
User-agent: ResearchBot
ai_use: research
# ai_training_data remains "disallow" from global
# This path inherits from global but overrides ai_training_data
Path: /public-research/
ai_training_data: allow
# ai_use remains "search_indexing" from global
Default Values
If no directive is specified at any level, these defaults apply:
ai_training_data: allow
ai_use: search_indexing
Use #
to add comments to your llmtag.txt
file:
# Global policy: No AI training by default
spec_version: 3.0
ai_training_data: disallow
ai_use: search_indexing
# Allow research use for specific agents
User-agent: AcademicBot
ai_training_data: allow
ai_use: research
# Protect premium content from all AI training
Path: /premium/
ai_training_data: disallow
ai_use: search_indexing
Comments are ignored by AI agents and are purely for human readability and documentation purposes.
Best Practices
1. Start Simple
Begin with a basic global policy and add complexity as needed:
spec_version: 3.0
ai_training_data: disallow
ai_use: search_indexing
Document your policies for future reference:
# Block AI training but allow search indexing
spec_version: 3.0
ai_training_data: disallow
ai_use: search_indexing
# Exception: Allow research bots to use content for training
User-agent: ResearchBot
ai_training_data: allow
ai_use: research
3. Test Your Implementation
Always verify your llmtag.txt
file is accessible and properly formatted:
curl https://yourdomain.com/llmtag.txt
4. Keep It Maintainable
Use consistent formatting and logical grouping:
spec_version: 3.0
# Global policy
ai_training_data: disallow
ai_use: search_indexing
# Agent-specific policies
User-agent: GPTBot
ai_training_data: allow
ai_use: search_indexing, generative_synthesis
User-agent: ResearchBot
ai_training_data: allow
ai_use: research
# Path-specific policies
Path: /premium/
ai_training_data: disallow
ai_use: search_indexing
Path: /blog/
ai_training_data: allow
ai_use: search_indexing, generative_synthesis