屏蔽恶意蜘蛛

屏蔽恶意蜘蛛

主要:

if ($http_user_agent ~ “hubspot|CCBot|VelenPublicWebCrawler|Konturbot|my-tiny-bot|eiki|webmeup|ExtLinksBot|Go-http-client|Python|ZoominfoBot|MegaIndex.ru|MauiBot|Amazonbot|ds-robot|intelx.io|coccocbot|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Barkrowler|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|DuckDuckGo|ClaudeBot|coccocbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|MJ12bot|DotBot|heritrix|Bytespider|BLEXBot|serpstatbot|Ezooms|JikeSpider|Barkrowler|InfoTigerBot|SemrushBot|DuckDuckGo-Favicons-Bot|ImagesiftBot|GPTBot|^$” ) {

return 403;

}

#临时禁止,以后可以删除
if ($http_user_agent ~ “hubspot|rwth-aachen.de|^$” ) {

return 403;

}

HttpClient 有时候是恶意的,但有时候会影响

小蜘蛛:

if ($http_user_agent ~ “Phpzhanqun|HostHarvest|python-requests|^$” ) {

return 403;

}

Amazonbot:Amazonbot is Amazon’s web crawler used to improve our services, such as enabling Alexa to answer even more questions for customers. Amazonbot respects standard robots.txt rules. 可以屏蔽

Go-http-client:这个是 是阿里云(或腾讯云 )的全站加速 为了确定最优线路用的蜘蛛,也可能是go语言制作的http客户端,可能其它程序抓取的(https://www.cnblogs.com/rxbook/p/15167301.html);不是正常浏览器,暂作屏蔽。

Bytespider: 字节跳动的蜘蛛,可能为了迅速建立数据库,抓取频率过高。海外市占率低,暂时屏蔽,以后要放出来。

Pro Sitemaps Generator: pro-sitemaps.com 一个生成站点地图的工具,会给网站增加负担,不需要都加,碰到了加就可以。

2024.5.25 增加
ImagesiftBot,这个是抓取图片,给AI用的蜘蛛
researchscan.comsys.rwth-aachen.de: 德国大学研究网站安全的扫描 (临时禁止,以后可以删除)

GPTBot: ChatGPT的蜘蛛,禁用!