Skip to main content

Core Directives

spec_version

Required. Declares the specification version being used.
# REQUIRED: Protocol version declaration
spec_version: 3.0
All llmtag.txt files must include this directive. Files without it are considered invalid and will be ignored by compliant AI agents.

Content Usage Directives

ai_training_data

Controls whether content can be used as training data for machine learning models.
# AI Training Policy: Controls whether content can be used for AI model training
# Values: allow (permit training) | disallow (block training)
ai_training_data: disallow
This is the most critical directive for most publishers, as it directly controls whether their content can be used to train AI models.
Examples:
# Block all AI training
ai_training_data: disallow

# Allow AI training
ai_training_data: allow

ai_use

Controls specific AI applications and use cases beyond training.
# AI Use Policy: Defines how AI agents can use your content
# Values: search_indexing (for search engines) | generative_synthesis (for AI responses) | research (for academic research)
ai_use: search_indexing, generative_synthesis
Examples:
# Allow only search indexing
ai_use: search_indexing

# Allow multiple use cases
ai_use: search_indexing, generative_synthesis, research

# Allow all use cases
ai_use: search_indexing, generative_synthesis, commercial_products, research, personal_assistance

Advanced Directives

verification_challenge

Establishes a cryptographic handshake to verify that an AI agent has actually read and understood the rules.
verification_challenge: sha256:abc123def456...
This is an advanced feature for publishers who want to implement verification mechanisms. Most implementations can ignore this directive.
Example:
# Require SHA-256 verification
verification_challenge: sha256:a1b2c3d4e5f6789012345678901234567890abcdef1234567890abcdef123456

Scope Directives

User-agent

Defines a scope block for specific AI agents or crawlers.
User-agent: GPTBot
ai_training_data: allow
ai_use: search_indexing, generative_synthesis
Common AI Agent User-Agents:
# OpenAI's GPTBot
User-agent: GPTBot

# ChatGPT-User (when users browse with ChatGPT)
User-agent: ChatGPT-User

# Google's AI crawlers
User-agent: Google-Extended

# Anthropic's Claude
User-agent: Claude-Web

# Perplexity AI
User-agent: PerplexityBot

Path

Defines a scope block for specific URL paths or patterns.
Path: /premium/
ai_training_data: disallow
ai_use: search_indexing
Examples:
# Protect premium content
Path: /premium/
ai_training_data: disallow

# Allow AI training for blog content
Path: /blog/
ai_training_data: allow
ai_use: search_indexing, generative_synthesis

# Protect user-generated content
Path: /user-content/
ai_training_data: disallow
ai_use: search_indexing

Directive Processing

Precedence Order

Directives are processed in the following order of precedence (highest to lowest):
  1. Path-specific directives - Most specific
  2. User-agent specific directives - Agent-specific
  3. Global directives - Least specific

Inheritance

When a directive is not specified at a more specific level, it inherits from the global level:
# Global policy
spec_version: 3.0
ai_training_data: disallow
ai_use: search_indexing

# This agent inherits global policy but overrides ai_use
User-agent: ResearchBot
ai_use: research
# ai_training_data remains "disallow" from global

# This path inherits from global but overrides ai_training_data
Path: /public-research/
ai_training_data: allow
# ai_use remains "search_indexing" from global

Default Values

If no directive is specified at any level, these defaults apply:
  • ai_training_data: allow
  • ai_use: search_indexing

Comments

Use # to add comments to your llmtag.txt file:
# Global policy: No AI training by default
spec_version: 3.0
ai_training_data: disallow
ai_use: search_indexing

# Allow research use for specific agents
User-agent: AcademicBot
ai_training_data: allow
ai_use: research

# Protect premium content from all AI training
Path: /premium/
ai_training_data: disallow
ai_use: search_indexing
Comments are ignored by AI agents and are purely for human readability and documentation purposes.

Best Practices

1. Start Simple

Begin with a basic global policy and add complexity as needed:
spec_version: 3.0
ai_training_data: disallow
ai_use: search_indexing

2. Use Clear Comments

Document your policies for future reference:
# Block AI training but allow search indexing
spec_version: 3.0
ai_training_data: disallow
ai_use: search_indexing

# Exception: Allow research bots to use content for training
User-agent: ResearchBot
ai_training_data: allow
ai_use: research

3. Test Your Implementation

Always verify your llmtag.txt file is accessible and properly formatted:
curl https://yourdomain.com/llmtag.txt

4. Keep It Maintainable

Use consistent formatting and logical grouping:
spec_version: 3.0

# Global policy
ai_training_data: disallow
ai_use: search_indexing

# Agent-specific policies
User-agent: GPTBot
ai_training_data: allow
ai_use: search_indexing, generative_synthesis

User-agent: ResearchBot
ai_training_data: allow
ai_use: research

# Path-specific policies
Path: /premium/
ai_training_data: disallow
ai_use: search_indexing

Path: /blog/
ai_training_data: allow
ai_use: search_indexing, generative_synthesis
I