Spider Access Rule Configuration
Generated robots.txt content

Robots.txt Protocol Guide

What is Robots.txt?

  • Definition: robots.txt is an ASCII text file stored in the root directory of a website. It is the first file checked by a search engine spider when visiting a site.
  • Role: It acts as a "gentleman's agreement" between the website and the crawler. It tells search engines which directories can be crawled and which are forbidden, thereby protecting website privacy and saving server bandwidth.

Core Syntax Rules

  • User-agent: Defines which search engine spider the rule applies to. * represents all spiders.
  • Disallow: Tells the crawler not to crawl the specified directory or file. For example, Disallow: /admin/ forbids crawling all content under the admin directory.
  • Allow: Tells the crawler the directories it is allowed to crawl. Usually used in conjunction with Disallow to "make an exception" and allow crawling of a specific subdirectory within a restricted large directory.
  • Crawl-delay: Limits the time interval (in seconds) between crawls to prevent the spider from crawling too fast and crashing the server (Note: Some search engines like Google no longer strictly adhere to this directive, opting for configuration in their webmaster tools instead).
  • Sitemap: Tells the crawler the URL of the website's Sitemap XML file, helping search engines discover all links on the site more efficiently.

Important Notes

  • The Robots protocol is merely an advisory protocol that "guards against gentlemen, not villains." Malicious crawlers can completely ignore it. Therefore, for truly sensitive and confidential data, you must perform permission verification on the server side.
  • This file must be placed in the root directory of your website, for example: https://www.yourdomain.com/robots.txt.