Nginx Rate Limiting and Anti-Crawler Configuration Solution – Operations Engineer Practical Guide

leo 01/11/2025
《Nginx Rate Limiting and Anti-Crawler Configuration Solution - Operations Engineer Practical Guide》

Introduction
In today’s rapidly developing internet business landscape, websites face various traffic surges and malicious crawler threats. As operations engineers, we need to effectively protect against malicious traffic and crawler attacks while ensuring normal user access. This article deeply explores Nginx-based rate limiting and anti-crawler solutions, providing a complete protection system from principles to practice.

I. Why Rate Limiting and Anti-Crawler Protection?
Business Pain Points Analysis
In actual operations work, we frequently encounter these problems:

  1. Traffic spikes causing excessive server pressure: Sudden legitimate business traffic surges or CC attacks
  2. Resource consumption by malicious crawlers: Frequent crawler requests wasting bandwidth and increasing server load
  3. Data leakage risks: Sensitive information being maliciously collected in bulk
  4. Degraded user experience: Slow or inaccessible normal user visits

Technical Selection Advantages
Choosing Nginx as the core component for rate limiting and anti-crawler protection offers these advantages:
• High performance: Event-driven model handling tens of thousands of concurrent connections per machine
• Low memory usage: Lower resource consumption compared to traditional servers like Apache
• Modular design: Rich third-party modules supporting various functional extensions
• Flexible configuration: Supports complex rule configurations and dynamic updates

II. Nginx Rate Limiting Core Principles
Token Bucket Algorithm
Nginx’s ngx_http_limit_req_module implements rate limiting based on the token bucket algorithm. Core concepts:

  1. System adds tokens to bucket at constant rate
  2. Requests require tokens from bucket
  3. New tokens overflow when bucket is full
  4. Requests are rejected or delayed when bucket is empty

Leaky Bucket Algorithm
Another flow control mechanism with constant output rate:
• Requests enter bucket queue
• Process requests at fixed rate
• New requests discarded when bucket is full

III. Basic Rate Limiting Configuration
3.1 IP-based Request Frequency Limiting

nginx

http {
    limit_req_zone $binary_remote_addr zone=ip_limit:10m rate=10r/s;
    limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
    
    server {
        listen 80;
        server_name example.com;
        
        location / {
            limit_req zone=ip_limit burst=5 nodelay;
            limit_conn conn_limit 10;
            limit_req_status 429;
            limit_conn_status 429;
            proxy_pass http://backend;
        }
        
        error_page 429 /429.html;
        location = /429.html {
            root /var/www/html;
            internal;
        }
    }
}

3.2 URI-based Differentiated Limiting

nginx

http {
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=5r/s;
    limit_req_zone $binary_remote_addr zone=static_limit:10m rate=50r/s;
    limit_req_zone $binary_remote_addr zone=login_limit:10m rate=1r/s;
    
    server {
        listen 80;
        server_name api.example.com;
        
        location /api/ {
            limit_req zone=api_limit burst=2 nodelay;
            proxy_pass http://api_backend;
        }
        
        location ~* \.(jpg|jpeg|png|gif|css|js)$ {
            limit_req zone=static_limit burst=20;
            expires 1d;
        }
        
        location /api/login {
            limit_req zone=login_limit burst=1;
            access_log /var/log/nginx/login_limit.log combined;
            proxy_pass http://auth_backend;
        }
    }
}

3.3 Geographic-based Limiting with GeoIP2

nginx

http {
    geoip2 /usr/share/GeoIP/GeoLite2-Country.mmdb {
        $geoip2_data_country_code country iso_code;
        $geoip2_data_country_name country names en;
    }
    
    map $geoip2_data_country_code $country_limit_rate {
        default 10r/s;
        CN 20r/s;
        US 15r/s;
        ~^(RU|UA)$ 5r/s;
    }
    
    limit_req_zone $binary_remote_addr zone=country_limit:10m rate=$country_limit_rate;
}

IV. Advanced Anti-Crawler Strategies
4.1 User-Agent Detection and Filtering

nginx

http {
    map $http_user_agent $is_crawler {
        default 0;
        ~*bot 1;
        ~*spider 1;
        ~*crawler 1;
        ~*scraper 1;
        ~*python-requests 1;
        ~*curl 1;
        "" 1;
        ~^.{0,10}$ 1;
    }
    
    map $http_user_agent $allowed_crawler {
        default 0;
        ~*googlebot 1;
        ~*bingbot 1;
        ~*baiduspider 1;
    }
    
    server {
        location / {
            if ($is_crawler) {
                set $block_crawler 1;
            }
            if ($allowed_crawler) {
                set $block_crawler 0;
            }
            if ($block_crawler) {
                return 403;
            }
            proxy_pass http://backend;
        }
    }
}

4.2 Request Pattern Analysis

nginx

http {
    map $http_referer $suspicious_referer {
        default 0;
        "" 1;
        "-" 1;
    }
    
    server {
        location / {
            set $risk_score 0;
            
            if ($suspicious_referer) {
                set $risk_score "${risk_score}1";
            }
            
            if ($risk_score ~ "11") {
                access_log /var/log/nginx/suspicious.log combined;
                limit_req zone=freq_check burst=1 nodelay;
            }
            
            proxy_pass http://backend;
        }
    }
}

V. Dynamic Protection and Monitoring
5.1 Real-time Monitoring and Alerting

nginx

http {
    log_format security_log '$remote_addr - $remote_user [$time_local] '
                          '"$request" $status $body_bytes_sent '
                          '"$http_referer" "$http_user_agent" '
                          '$request_time $upstream_response_time '
                          '$geoip2_data_country_code';
    
    server {
        location / {
            access_log /var/log/nginx/security.log security_log;
            
            if ($limit_req_status = "503") {
                access_log /var/log/nginx/rate_limit.log security_log;
            }
            
            proxy_pass http://backend;
        }
    }
}

5.2 Automated Blacklist Management

bash

#!/bin/bash
# auto_blacklist.sh - Automated blacklist script

LOG_FILE="/var/log/nginx/security.log"
BLACKLIST_FILE="/etc/nginx/conf.d/blacklist.conf"

awk -v date="$(date '+%d/%b/%Y:%H')" '
$0 ~ date {
    ip = $1
    if ($9 == "429" || $9 == "403") suspicious[ip]++
    if ($10 > 10000) large_response[ip]++
    total[ip]++
}
END {
    for (ip in suspicious) {
        if (suspicious[ip] > 100) {
            print "deny " ip ";"
        }
    }
}
' $LOG_FILE > $BLACKLIST_FILE

nginx -t && nginx -s reload

VI. Performance Optimization Best Practices
6.1 Memory Usage Optimization

nginx

http {
    limit_req_zone $binary_remote_addr zone=main_limit:50m rate=10r/s;
    
    map $request_uri $normalized_uri {
        ~^/api/v1/([^/]+) /api/v1/$1;
        ~^/static/ /static;
        default $request_uri;
    }
    
    limit_req_zone "$binary_remote_addr:$normalized_uri"
                   zone=uri_limit:30m rate=20r/s;
}

6.2 Modular Configuration
Split configurations into reusable modules:

  • /etc/nginx/conf.d/rate_limits.conf
  • /etc/nginx/conf.d/security_maps.conf
  • /etc/nginx/conf.d/security_headers.conf

VII. Troubleshooting and Debugging
7.1 Common Issue Diagnosis

bash

# Test rate limiting
for i in {1..20}; do 
    curl -s -o /dev/null -w "%{http_code}\n" http://example.com/api/test
done

# Check configuration
nginx -T | grep -A 10 limit_req_zone

7.2 Performance Monitoring Script

bash

#!/bin/bash
check_nginx_performance() {
    echo "=== Nginx Performance Report ==="
    echo "Active Connections:"
    ss -tln | grep :80 | wc -l
    echo "Error Rate (Last 100 requests):"
    tail -100 /var/log/nginx/access.log | \
    awk '{print $9}' | sort | uniq -c | sort -nr
}

VIII. Summary and Outlook
This solution provides a complete Nginx-based rate limiting and anti-crawler system with these characteristics:

Core Advantages

  1. Multi-layer protection: Progressive layers from basic rate limiting to advanced anti-crawler
  2. Intelligent identification: Comprehensive judgment combining multiple characteristics
  3. Performance optimization: Considers high concurrency scenario performance
  4. Operations-friendly: Complete monitoring and automated management

Implementation Recommendations

  1. Progressive implementation: Start with basic rate limiting, gradually add advanced features
  2. Gradual rollout: Test new strategies in small scope before full deployment
  3. Continuous monitoring: Establish complete monitoring system for timely issue detection
  4. Regular optimization: Adjust parameters and strategies based on actual results

Technology Trends
With AI and machine learning development, future protection solutions will become more intelligent:
• Behavior analysis: Intelligent identification based on user behavior patterns
• Real-time learning: Adaptive protection strategy adjustments
• Collaborative defense: Threat intelligence sharing between multiple nodes