Spider Access Rule Configuration
Generated robots.txt content
Robots.txt Protocol Guide
What is Robots.txt?
- Definition:
robots.txtis an ASCII text file stored in the root directory of a website. It is the first file checked by a search engine spider when visiting a site. - Role: It acts as a "gentleman's agreement" between the website and the crawler. It tells search engines which directories can be crawled and which are forbidden, thereby protecting website privacy and saving server bandwidth.
Core Syntax Rules
User-agent:Defines which search engine spider the rule applies to.*represents all spiders.Disallow:Tells the crawler not to crawl the specified directory or file. For example,Disallow: /admin/forbids crawling all content under the admin directory.Allow:Tells the crawler the directories it is allowed to crawl. Usually used in conjunction with Disallow to "make an exception" and allow crawling of a specific subdirectory within a restricted large directory.Crawl-delay:Limits the time interval (in seconds) between crawls to prevent the spider from crawling too fast and crashing the server (Note: Some search engines like Google no longer strictly adhere to this directive, opting for configuration in their webmaster tools instead).Sitemap:Tells the crawler the URL of the website's Sitemap XML file, helping search engines discover all links on the site more efficiently.
Important Notes
- The Robots protocol is merely an advisory protocol that "guards against gentlemen, not villains." Malicious crawlers can completely ignore it. Therefore, for truly sensitive and confidential data, you must perform permission verification on the server side.
- This file must be placed in the root directory of your website, for example:
https://www.yourdomain.com/robots.txt.
