Introduction
“Last week, a server disk filled up causing a business outage. Why wasn’t this discovered earlier?” This was the question I faced during a post-incident review meeting. That incident made me realize: passively waiting for monitoring alerts is not enough; proactive inspection is key to discovering hidden risks. After 2 years of refinement, I developed a complete set of server automated inspection scripts that automatically check the health status of 200+ servers daily, identifying dozens of potential failures in advance and preventing multiple possible business interruptions. These scripts have been running stably in multiple enterprises, performing over 100,000 cumulative checks with a fault prediction accuracy rate exceeding 95%. This article will fully share these 7 production-grade inspection scripts, all directly reusable, to help you establish a proactive operations system.
Technical Background: Why Automated Inspection is Needed?
Limitations of Monitoring Systems
Many enterprises deploy monitoring systems like Zabbix and Prometheus, but monitoring doesn’t solve all problems:
- Monitoring Blind Spots:
- Configuration omissions: Newly deployed servers or services might be forgotten to be added to monitoring
- Unreasonable thresholds: Alert thresholds set too high or too low, causing missed reports or false alarms
- Gradual problems: Disk usage slowly increases from 70% to 95%, potentially below alert threshold but already dangerous
- System-level details: Fine-grained checks like file descriptors, zombie processes, log anomalies
- Business logic checks: Processes exist but are deadlocked, unable to respond to requests
- Monitoring system failures: Monitoring agent crashes or network issues cause monitoring failure
Core Value of Automated Inspection
Inspection scripts complement monitoring systems with unique advantages:
- Proactive Problem Discovery: Systematically checks all key metrics without relying on alert rules
- Comprehensive: Covers system, network, application, logs, and multiple dimensions
- Customizable: Flexible customization of check items based on business characteristics
- Traceable: Generates detailed inspection reports for easy problem tracing
- Low Cost: Pure Shell scripts, no additional monitoring agent deployment required
- Offline Capability: Inspection still works even when monitoring system fails
Enterprise Inspection Application Scenarios
- Daily routine health checks (executed at 8 AM daily)
- Comprehensive pre-event checks (before Double 11)
- Post-change validation (executed 30 minutes after deployment)
- Rapid diagnosis during emergency response (manually triggered during failures)
- Audit and compliance checks (security baseline checks)
Core Content: Detailed Explanation of 7 Production-Grade Inspection Scripts
Script 1: Comprehensive System Health Check
This is the most basic yet important inspection script, checking the system’s core resource usage.
Usage:
bash
# Grant execution permission chmod +x system_health_check.sh # Manual execution ./system_health_check.sh # Scheduled execution (every day at 8 AM) echo "0 8 * * * /path/to/system_health_check.sh" | crontab -
Script 2: Deep Disk Space Check
This script focuses on detailed disk space analysis, quickly identifying the “culprits” occupying disk space.
Usage:
bash
chmod +x disk_space_check.sh ./disk_space_check.sh
Script 3: Network Connection and Port Check
Specifically checks network status, connection counts, port listening, and other network-related issues.
Usage:
bash
chmod +x network_check.sh ./network_check.sh
Script 4: Application Process Health Check
Checks the running status, resource usage, and interface responses of key business processes.
Usage:
bash
chmod +x process_health_check.sh ./process_health_check.sh
Script 5: Security Baseline Check
Checks server security configurations, including account security, permission configurations, SSH security, etc.
Usage:
bash
chmod +x security_baseline_check.sh ./security_baseline_check.sh
Script 6: Database Health Check (MySQL)
Specifically checks MySQL database running status, performance metrics, and configurations.
Usage:
bash
chmod +x mysql_health_check.sh ./mysql_health_check.sh
Script 7: Batch Server Inspection Scheduler
A batch execution script that can concurrently perform inspections on multiple servers and aggregate results.
Usage:
- Create a server list file
servers.txt:
text
# Web servers 192.168.1.10 192.168.1.11 192.168.1.12 # Application servers 192.168.1.20 192.168.1.21 # Database servers 192.168.1.30 192.168.1.31
- Execute the batch check:
bash
chmod +x batch_check_scheduler.sh ./batch_check_scheduler.sh
Practical Cases: From Manual to Automated Inspection Systems
Case 1: E-commerce Company Inspection System Construction
Background:
- 200+ servers requiring daily manual inspection
- Inspection work took 2 hours, prone to missed issues
- Multiple business interruptions due to full disks, process anomalies
Solution:
- Deployed Script 1 to all servers, automatically executed daily at 8 AM
- Used Script 7 on management machine for hourly inspections
- Aggregated inspection results into HTML reports, pushed via enterprise WeChat
- Added Script 4 for key business servers, executed every 5 minutes
Results:
- Inspection time reduced from 2 hours to 5 minutes (viewing reports)
- 15 potential failures discovered and handled in advance within 3 months
- Business interruptions reduced from average 3 times/month to 0
- Operations team transitioned from reactive firefighting to proactive operations
Case 2: Internet Company Double 11 Support
Background:
- Surge in traffic during Double 11, requiring high system stability
- Needed intensive server status monitoring
Inspection Strategy:
- 7 days before event: Daily comprehensive inspections, disk space cleanup, system parameter optimization
- 3 days before event: Added security baseline checks to ensure no security risks
- Event day: Core business process checks every 5 minutes
- During event: Real-time monitoring of disk I/O, network connections, database performance
Achievements:
- Discovered and resolved disk space issues on 8 servers before the event
- Zero failures during the event, stable system operation
- Inspection reports served as important reference for emergency plans
Best Practices and Considerations
1. Recommended Inspection Frequency
- Daily Health Check: Once daily (8 AM) – Script 1
- Disk Space Monitoring: Once daily – Script 2
- Network Status Check: Weekly – Script 3
- Process Health Check: Hourly (core business) – Script 4
- Security Baseline Check: Weekly – Script 5
- Database Health Check: Daily – Script 6
- Major Events: Every 5-15 minutes – All scripts
2. Alert Threshold Optimization
Initial Phase:
- Use default thresholds (e.g., CPU 80%, Memory 85%)
- Collect 1-2 weeks of inspection data
- Analyze normal business fluctuation ranges
Tuning Phase:
- Adjust thresholds based on actual business characteristics
- Avoid excessive alerts (cry wolf effect)
- Focus on trend changes rather than absolute values
3. Report Management
Report Retention:
bash
# Automatically clean inspection reports older than 30 days find /var/log -name "system_check_*.log" -mtime +30 -delete
Report Analysis:
- Establish weekly/monthly reporting system
- Focus on trend issues (e.g., continuously rising disk usage)
- Regularly review problems discovered during inspections
4. Permissions and Security
Execution Permissions:
- System check scripts require root privileges
- Use sudo to restrict executable commands for scripts
- Regularly audit script execution logs
SSH Key Management:
- Use dedicated keys for batch inspections
- Restrict sudo privileges for keys
- Regularly rotate keys
5. Integration with Monitoring Systems
Inspection scripts complement, rather than replace, monitoring systems:
Integration Options:
- Output inspection results in JSON format
- Push to monitoring platform via API
- Display inspection status on monitoring dashboard
- Trigger alerts for inspection anomalies
Zabbix Integration Example:
bash
# Send inspection results to Zabbix zabbix_sender -z zabbix-server -s $(hostname) -k check.status -o "$STATUS"
Summary and Outlook
Core Points Review
The 7 production-grade Shell inspection scripts shared in this article cover:
- System Health: CPU, memory, disk, network and other basic resources
- Disk Management: Deep analysis of disk usage, providing cleanup suggestions
- Network Diagnostics: Connection status, port listening, network connectivity
- Process Monitoring: Application process health, resource usage, interface response
- Security Audit: Account security, SSH configuration, file permissions
- Database: MySQL performance metrics, master-slave status, query analysis
- Batch Execution: Concurrent inspection of multiple servers, summary reports
Implementation Roadmap
Phase 1 (1-2 weeks):
- Deploy Script 1 on 1-2 test servers
- Verify script execution results, adjust alert thresholds
- Establish inspection report viewing process
Phase 2 (2-4 weeks):
- Full deployment of Script 1 to all servers
- Deploy Scripts 2-6 based on business characteristics
- Configure scheduled tasks for automated execution
Phase 3 (1-2 months):
- Deploy batch inspection scheduler
- Establish inspection report analysis and problem handling process
- Integrate with existing monitoring systems
Continuous Optimization:
- Adjust check items and thresholds based on actual usage
- Regularly review problems discovered during inspections
- Continuously improve and enrich inspection scripts