How to Create a robots.txt File (With Examples and Common Mistakes)

Learn what a robots.txt file is, how to write one correctly, and which rules help crawlers understand your site.

robots.txt is a crawler-instruction file for your site root. It helps compliant bots understand which paths to crawl and which to skip.

If you need a working starting point, this version is safe for most sites:

User-agent: *
Disallow: /admin/
Disallow: /checkout/

Sitemap: https://example.com/sitemap.xml

Put it at: https://example.com/robots.txt

Quick Answer

  • robots.txt is a text file at the domain root.
  • It controls crawling, not security or true access control.
  • Use User-agent, Disallow, Allow, and Sitemap directives.
  • Keep public content crawlable, including CSS and JavaScript assets.
  • Test changes after each deploy to avoid accidental de-indexing.

What a robots.txt File Does (and Does Not Do)

This file answers one question: "May this crawler fetch this path?"

It does:

  • Provide crawl guidance to compliant crawlers.
  • Reduce crawl budget waste on low-value URLs.
  • Help discovery with a sitemap URL.

It does not:

  • Hide sensitive content.
  • Guarantee that a URL will never appear in search.
  • Stop malicious scraping bots.

If a page must stay private, use authentication and server-side access controls.

robots.txt Syntax

1. User-agent

Defines which crawler a rule block targets.

User-agent: *         # All compliant crawlers
User-agent: Googlebot # A specific crawler

2. Disallow

Blocks crawler access to matching paths.

Disallow: /private/   # Block a directory
Disallow: /file.html  # Block a file
Disallow: /           # Block everything
Disallow:             # Block nothing

3. Allow

Creates a path exception inside a broader disallow rule.

Disallow: /images/
Allow: /images/public/

4. Sitemap

Gives crawlers the absolute URL of your sitemap.

Sitemap: https://example.com/sitemap.xml

Common robots.txt Examples (Copy/Paste)

Allow Everything

User-agent: *
Disallow:

Block One Directory

User-agent: *
Disallow: /admin/

Block Multiple Paths

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /checkout/
Disallow: /cart/

Block Specific File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$

Let One Crawler In, Block Others

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /

Block Everything During Staging

User-agent: *
Disallow: /

How to Create robots.txt Step by Step

Step 1: Decide what should be crawled

Usually crawl:

  • Main pages
  • Product or content detail pages
  • CSS and JavaScript files required for rendering

Usually disallow:

  • Admin areas
  • Internal search result paths
  • Session/parameter duplicates if they create crawl bloat

Step 2: Write a minimal first version

Start simple and expand only when needed:

User-agent: *
Disallow: /admin/
Disallow: /search?
Sitemap: https://example.com/sitemap.xml

Step 3: Upload to root

It must resolve at /robots.txt on each host you control:

  • https://example.com/robots.txt
  • https://blog.example.com/robots.txt (separate file)

Step 4: Test and monitor

After publishing:

  1. Confirm the file is reachable in the browser.
  2. Validate syntax in crawler tooling.
  3. Watch crawl logs after changes to ensure key pages are still fetched.

What to Disallow vs What to Keep Crawlable

URL Type Typical Decision Why
/admin/, /account/ Disallow Low-value for search and user-specific
/cart/, /checkout/ Disallow Transactional and private state
Internal search URLs Disallow Often thin or duplicative
CSS and JS assets Allow Needed for rendering and quality evaluation
Primary content pages Allow These should be discovered and indexed

Common Mistakes That Hurt SEO

Blocking all crawlers by accident

This rule can remove discovery for your whole site:

User-agent: *
Disallow: /

Use it only for private staging environments, not production.

Trying to use robots.txt as security

Anyone can read the file. Do not list sensitive endpoints expecting secrecy.

Using wrong path patterns

Path matching is sensitive to slashes and path structure. Test intended and edge URLs.

Forgetting subdomains need separate files

example.com and blog.example.com are distinct hosts for robots rules.

Confusing crawling with indexing

Disallow can stop fetches, but URL-only entries may still appear if other pages link to them.

FAQ

What is a robots.txt file in plain language?

It is a public instruction file for web crawlers. It tells compliant bots which paths they may crawl.

Where exactly do I put robots.txt?

At the root of each host: https://yourdomain.com/robots.txt.

Can robots.txt remove a page from search results?

Not by itself in every case. It controls crawling, while indexing decisions may also depend on links and other signals.

Should I disallow /api/?

If API URLs are not intended for search and add crawl noise, disallowing can be reasonable. Keep public docs pages crawlable.

Should I block CSS and JS files?

Usually no. Rendering assets should remain crawlable for accurate content evaluation.

Is robots.txt case-sensitive?

Yes, path matching can be case-sensitive depending on server and URL structure.

Can I block bad bots with robots.txt?

Not reliably. Malicious bots can ignore it.

Should I include my sitemap URL?

Yes. Include an absolute sitemap URL to improve discovery.

How often should I review robots.txt?

Review after URL structure changes, migration projects, or major SEO audits.

What is the safest starter robots.txt for a small site?

A minimal block for admin/private paths plus a sitemap line is often enough. Add more rules only when you have a clear crawl-control reason.

Generate Your robots.txt

Robots.txt Generator

Create a correctly formatted robots.txt file and copy it to your site root.

Open Generator

Related Tools