百度搜素引擎爬虫像是有什么大病，网站都给爬废了，分享限制搜素引擎及AI爬虫的规则

飞鱼 (UID: 4816) 6月前 [复制链接]

帖子链接已复制到剪贴板

帖子已经有人评论啦，不支持删除！

583 0

昨天中午开始，百度搜索引擎的爬虫就疯狂抓取网站内容，我都打不开网站，一直加载中然后就白屏了。

还以为要被收录了，然后到今天都打不开，没办法只能限制访问频率，分析日志发现了大量的AI爬虫记录，我一个破网站有啥好爬的，所以干脆都给禁了，下面记录下技术手段，有需要的可以参考。（不感兴趣的可以关闭页面）

方案一

先是通过宝塔面板Nginx免费防火墙插件的User-Agent过滤了AI爬虫，参考资料：https://www.52txr.cn/2025/banaicurl.html

新增补充

(ScrapyIAwarioBotIAI2Bot|Ai2Bot-Dolma|aiHitBot|anthropic-ai|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|GPTBot|img2dataset|OAI-SearchBot|Perplexity-User|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|TikTokSpider|VelenPublicWebCrawler|YouBot)

又是用robots.txt限制AI爬虫和百度的爬取频率

# 百度蜘蛛：允许访问，但限制抓取间隔
User-Agent: Baiduspider
Crawl-delay: 5

# AI爬虫及特殊工具：禁止访问整个网站
User-Agent: Scrapy
Disallow: /
User-Agent: AwarioBotI
Disallow: /
User-agent: SemrushBot-BA
Disallow: /
User-agent: SemrushBot-SI
Disallow: /
User-agent: SemrushBot-SWA
Disallow: /
User-agent: SplitSignalBot
Disallow: /
User-agent: SemrushBot-OCOB
Disallow: /
User-agent: SemrushBot-FT
Disallow: /
User-Agent: AI2Bot
Disallow: /
User-Agent: Ai2Bot-Dolma
Disallow: /
User-Agent: aiHitBot
Disallow: /
User-Agent: Amazonbot
Disallow: /
User-Agent: anthropic-ai
Disallow: /
User-Agent: Applebot
Disallow: /
User-Agent: Applebot-Extended
Disallow: /
User-Agent: Brightbot 1.0
Disallow: /
User-Agent: Bytespider
Disallow: /
User-Agent: CCBot
Disallow: /
User-Agent: ChatGPT-User
Disallow: /
User-Agent: Claude-Web
Disallow: /
User-Agent: ClaudeBot
Disallow: /
User-Agent: cohere-ai
Disallow: /
User-Agent: cohere-training-data-crawler
Disallow: /
User-Agent: Cotoyogi
Disallow: /
User-Agent: Crawlspace
Disallow: /
User-Agent: Diffbot
Disallow: /
User-Agent: DuckAssistBot
Disallow: /
User-Agent: FacebookBot
Disallow: /
User-Agent: Factset_spyderbot
Disallow: /
User-Agent: FirecrawlAgent
Disallow: /
User-Agent: FriendlyCrawler
Disallow: /
User-Agent: Google-Extended
Disallow: /
User-Agent: GoogleOther
Disallow: /
User-Agent: GoogleOther-Image
Disallow: /
User-Agent: GoogleOther-Video
Disallow: /
User-Agent: GPTBot
Disallow: /
User-Agent: iaskspider/2.0
Disallow: /
User-Agent: ICC-Crawler
Disallow: /
User-Agent: ImagesiftBot
Disallow: /
User-Agent: img2dataset
Disallow: /
User-Agent: imgproxy
Disallow: /
User-Agent: ISSCyberRiskCrawler
Disallow: /
User-Agent: Kangaroo Bot
Disallow: /
User-Agent: Meta-ExternalAgent
Disallow: /
User-Agent: Meta-ExternalFetcher
Disallow: /
User-Agent: NovaAct
Disallow: /
User-Agent: OAI-SearchBot
Disallow: /
User-Agent: omgili
Disallow: /
User-Agent: omgilibot
Disallow: /
User-Agent: Operator
Disallow: /
User-Agent: PanguBot
Disallow: /
User-Agent: Perplexity-User
Disallow: /
User-Agent: PerplexityBot
Disallow: /
User-Agent: PetalBot
Disallow: /
User-Agent: Scrapy
Disallow: /
User-Agent: SemrushBot-OCOB
Disallow: /
User-Agent: SemrushBot-SWA
Disallow: /
User-Agent: Sidetrade indexer bot
Disallow: /
User-Agent: TikTokSpider
Disallow: /
User-Agent: Timpibot
Disallow: /
User-Agent: VelenPublicWebCrawler
Disallow: /
User-Agent: Webzio-Extended
Disallow: /
User-Agent: YouBot
Disallow: /

结果没有任何卵用...

方案二（一起使用）

豆包给的方案

全局的NGINX配置文件中添加（在http { 内添加）

    # 1. 定义百度蜘蛛的User-Agent匹配规则（必须在http块内）
    map $http_user_agent $is_baidu_spider {
        default 0;
        "~*Baiduspider" 1;  # 匹配百度蜘蛛的 User-Agent
    }


    # 2. 定义限流区域（限制百度蜘蛛的请求频率）
     limit_req_zone $binary_remote_addr$is_baidu_spider zone=baidu_spider:10m rate=100r/m;
    # rate=300r/m：每个IP每分钟最多300次请求（可根据服务器性能调整）

然后到网站配置规则里添加（在server { 内添加）

    # ------------------------ 缩略图专用优化（匹配完整路径） ------------------------
    # 匹配 /_data/i/upload/ 目录下的所有图片文件（含时间子目录，如 /2024/08/08/）
    location ~* ^/_data/i/upload/.*\.(jpg|jpeg|png|webp|avif|heic|heif)$ {
        # 强缓存1年（CDN/浏览器均可缓存）
        add_header Cache-Control "public, max-age=31536000";
        # 兼容旧浏览器（30天缓存）
        expires 30d;
        # 关闭缩略图访问日志（减少磁盘IO）
        access_log /dev/null;
        # 继承全局防盗链规则（非法 Referer 已被拦截，无需重复判断）
    }
    
    
    # ------------------------ AI 爬虫与原图保护 ------------------------
    # 定义需拦截的 User-Agent（AI 爬虫 + 恶意工具）
    set $block_ua 0;
    if ($http_user_agent ~* "(HTTrack|Apache-HttpClient|harvest|audit|dirbuster|pangolin|nmap|sqln|hydra|Parser|libwww|BBBike|sqlmap|w3af|owasp|Nikto|fimap|havij|zmeu|BabyKrokodil|netsparker|httperf|SF|AI2Bot|Ai2Bot-Dolma|aiHitBot|ChatGPT-User|ClaudeBot|cohere-ai|cohere-training-data-crawler|Diffbot|DuckAssistBot|GPTBot|img2dataset|OAI-SearchBot|Perplexity-User|PerplexityBot|Scrapy|TikTokSpider|VelenPublicWebCrawler)") {
        set $block_ua 1;
    }

    # 放行合法搜索引擎（百度、谷歌等）
    if ($http_user_agent ~* "(Baiduspider|Googlebot|bingbot|YandexBot|Sogou web spider|Bytespider)") {
        set $block_ua 0;
    }

    # 针对原图目录（/upload/）强化拦截（仅拦截恶意 UA，不影响正常用户）
    location ~* ^/upload/ {
        if ($block_ua = 1) {
            return 403;
        }
        try_files $uri $uri/ =404;
    }
    
    
    # ------------------------ 对动态页面限流（仅百度蜘蛛受影响） ------------------------
    location ~* ^/(picture.php|index.php) {
        # 直接应用限流（仅当 $is_baidu_spider=1 时，限流生效）
        limit_req zone=baidu_spider burst=20 nodelay;

        # 原有 PHP 处理逻辑（如 include enable-php-84.conf）
        include enable-php-84.conf;
    }
    

    # ------------------------ 其他配置 ------------------------

缩略图什么的是我网站使用的，根据实际情况修改。

方案三

拉黑搜素引擎和AI蜘蛛的IP段（会导致网站内容不被收录）

网站缓过来了在解除试下

这家伙太懒了，什么也没留下。

已有评论 ( 0 )

只看楼主倒序

提示：您必须登录才能查看此内容。

创建新帖

飞鱼

UID:4816

主题数

419

评论数

推荐数

Tips：

自助推广（点击空位自助购买或TG联系）