Robots.txt – 多个用户代理的抓取延迟的正确格式是什么？

发布时间：2020-12-14 18:46:56 所属栏目：资源来源：网络整理

导读：以下是一个示例robots.txt文件,允许每个用户代理具有多个爬网延迟的多个用户代理.抓取延迟值仅用于说明目的,并且在真实的robots.txt文件中将不同. 我已经在网上搜索了正确的答案,但找不到.有很多混合的建议,我不知道哪个是正确/正确的方法. 问题： (1)每个用

以下是一个示例robots.txt文件,允许每个用户代理具有多个爬网延迟的多个用户代理.抓取延迟值仅用于说明目的,并且在真实的robots.txt文件中将不同.

我已经在网上搜索了正确的答案,但找不到.有很多混合的建议,我不知道哪个是正确/正确的方法.

问题：

(1)每个用户代理可以拥有自己的爬行延迟吗？ (我假设是)

(2)在Allow / Dissallow行之前或之后,您将哪个用户代理的爬行延迟线放在哪里？

(3)每个用户代理组之间是否必须有空白.

参考文献：

http://www.seopt.com/2013/01/robots-text-file/

http://help.yandex.com/webmaster/?id=1113851#1113858

本质上,我正在寻找最终的robots.txt文件应该如何使用下面的示例中的值.

提前致谢.

# Allow only major search spiders    
User-agent: Mediapartners-Google
Disallow:
Crawl-delay: 11

User-agent: Googlebot
Disallow:
Crawl-delay: 12

User-agent: Adsbot-Google
Disallow:
Crawl-delay: 13

User-agent: Googlebot-Image
Disallow:
Crawl-delay: 14

User-agent: Googlebot-Mobile
Disallow:
Crawl-delay: 15

User-agent: MSNBot
Disallow:
Crawl-delay: 16

User-agent: bingbot
Disallow:
Crawl-delay: 17

User-agent: Slurp
Disallow:
Crawl-delay: 18

User-agent: Yahoo! Slurp
Disallow:
Crawl-delay: 19

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

(4)如果我想设置所有的用户代理具有10秒的抓取延迟,以下是否正确？

# Allow only major search spiders
User-agent: *
Crawl-delay: 10

User-agent: Mediapartners-Google
Disallow:

User-agent: Googlebot
Disallow:

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot-Image
Disallow:

User-agent: Googlebot-Mobile
Disallow:

User-agent: MSNBot
Disallow:

User-agent: bingbot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Yahoo! Slurp
Disallow:

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

解决方法

(1) Can each user agent have it’s own crawl-delay?

是.每个由一个或多个用户代理行开始的记录可以具有抓取延迟行.请注意,爬网延迟不是original robots.txt specification的一部分.但是,对于那些理解它的解析器,将它们包含起来并不是问题,如规范defines：

Unrecognised headers are ignored.

所以较老的robots.txt解析器将会忽略您的爬网延迟线.

(2) Where do you put the crawl-delay line for each user agent,before or after the Allow / Dissallow line?

没关系

(3) Does there have to be a blank like between each user agent group.

是.记录必须由一行或多行新行分隔.见original spec：

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL,or NL).

(4) If I want to set all of the user agents to have crawl delay of 10 seconds,would the following be correct?

编号.机器人查找符合其用户代理的记录.只有当他们没有找到记录时,才会使用User-agent：*记录.所以在你的例子中,所有列出的机器人(如Googlebot,MSNBot,Yahoo! Slurp等)都将没有爬网延迟.

还要注意,你不能有several records with User-agent: *：

If the value is ‘*’,the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the “/robots.txt” file.

因此,解析器可能会看到(如果没有其他记录匹配)用于User-agent：*的第一个记录,并忽略以下内容.对于您的第一个例子,这意味着以/ ads /,/ cgi-bin /和/ scripts /开头的URL不会被阻止.

即使您只有一个User-agent记录：*,那些Disallow行仅适用于没有其他记录匹配的漫游器！作为您的评论#阻止所有蜘蛛的目录建议,您希望所有蜘蛛都阻止这些URL路径,因此您必须为每个记录重复Disallow行.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!