Server Automated Inspection Scripts Usage Guide

paz 30/10/2025

Introduction
“Last week, a server disk filled up causing a business outage. Why wasn’t this discovered earlier?” This was the question I faced during a post-incident review meeting. That incident made me realize: passively waiting for monitoring alerts is not enough; proactive inspection is key to discovering hidden risks. After 2 years of refinement, I developed a complete set of server automated inspection scripts that automatically check the health status of 200+ servers daily, identifying dozens of potential failures in advance and preventing multiple possible business interruptions. These scripts have been running stably in multiple enterprises, performing over 100,000 cumulative checks with a fault prediction accuracy rate exceeding 95%. This article will fully share these 7 production-grade inspection scripts, all directly reusable, to help you establish a proactive operations system.

Technical Background: Why Automated Inspection is Needed?

Limitations of Monitoring Systems
Many enterprises deploy monitoring systems like Zabbix and Prometheus, but monitoring doesn’t solve all problems:

  • Monitoring Blind Spots:
    • Configuration omissions: Newly deployed servers or services might be forgotten to be added to monitoring
    • Unreasonable thresholds: Alert thresholds set too high or too low, causing missed reports or false alarms
    • Gradual problems: Disk usage slowly increases from 70% to 95%, potentially below alert threshold but already dangerous
    • System-level details: Fine-grained checks like file descriptors, zombie processes, log anomalies
    • Business logic checks: Processes exist but are deadlocked, unable to respond to requests
    • Monitoring system failures: Monitoring agent crashes or network issues cause monitoring failure

Core Value of Automated Inspection
Inspection scripts complement monitoring systems with unique advantages:

  1. Proactive Problem Discovery: Systematically checks all key metrics without relying on alert rules
  2. Comprehensive: Covers system, network, application, logs, and multiple dimensions
  3. Customizable: Flexible customization of check items based on business characteristics
  4. Traceable: Generates detailed inspection reports for easy problem tracing
  5. Low Cost: Pure Shell scripts, no additional monitoring agent deployment required
  6. Offline Capability: Inspection still works even when monitoring system fails

Enterprise Inspection Application Scenarios

  • Daily routine health checks (executed at 8 AM daily)
  • Comprehensive pre-event checks (before Double 11)
  • Post-change validation (executed 30 minutes after deployment)
  • Rapid diagnosis during emergency response (manually triggered during failures)
  • Audit and compliance checks (security baseline checks)

Core Content: Detailed Explanation of 7 Production-Grade Inspection Scripts

Script 1: Comprehensive System Health Check
This is the most basic yet important inspection script, checking the system’s core resource usage.

Usage:

bash

# Grant execution permission
chmod +x system_health_check.sh

# Manual execution
./system_health_check.sh

# Scheduled execution (every day at 8 AM)
echo "0 8 * * * /path/to/system_health_check.sh" | crontab -

Script 2: Deep Disk Space Check
This script focuses on detailed disk space analysis, quickly identifying the “culprits” occupying disk space.

Usage:

bash

chmod +x disk_space_check.sh
./disk_space_check.sh

Script 3: Network Connection and Port Check
Specifically checks network status, connection counts, port listening, and other network-related issues.

Usage:

bash

chmod +x network_check.sh
./network_check.sh

Script 4: Application Process Health Check
Checks the running status, resource usage, and interface responses of key business processes.

Usage:

bash

chmod +x process_health_check.sh
./process_health_check.sh

Script 5: Security Baseline Check
Checks server security configurations, including account security, permission configurations, SSH security, etc.

Usage:

bash

chmod +x security_baseline_check.sh
./security_baseline_check.sh

Script 6: Database Health Check (MySQL)
Specifically checks MySQL database running status, performance metrics, and configurations.

Usage:

bash

chmod +x mysql_health_check.sh
./mysql_health_check.sh

Script 7: Batch Server Inspection Scheduler
A batch execution script that can concurrently perform inspections on multiple servers and aggregate results.

Usage:

  1. Create a server list file servers.txt:

text

# Web servers
192.168.1.10
192.168.1.11
192.168.1.12

# Application servers  
192.168.1.20
192.168.1.21

# Database servers
192.168.1.30
192.168.1.31
  1. Execute the batch check:

bash

chmod +x batch_check_scheduler.sh
./batch_check_scheduler.sh

Practical Cases: From Manual to Automated Inspection Systems

Case 1: E-commerce Company Inspection System Construction
Background:

  • 200+ servers requiring daily manual inspection
  • Inspection work took 2 hours, prone to missed issues
  • Multiple business interruptions due to full disks, process anomalies

Solution:

  1. Deployed Script 1 to all servers, automatically executed daily at 8 AM
  2. Used Script 7 on management machine for hourly inspections
  3. Aggregated inspection results into HTML reports, pushed via enterprise WeChat
  4. Added Script 4 for key business servers, executed every 5 minutes

Results:

  • Inspection time reduced from 2 hours to 5 minutes (viewing reports)
  • 15 potential failures discovered and handled in advance within 3 months
  • Business interruptions reduced from average 3 times/month to 0
  • Operations team transitioned from reactive firefighting to proactive operations

Case 2: Internet Company Double 11 Support
Background:

  • Surge in traffic during Double 11, requiring high system stability
  • Needed intensive server status monitoring

Inspection Strategy:

  1. 7 days before event: Daily comprehensive inspections, disk space cleanup, system parameter optimization
  2. 3 days before event: Added security baseline checks to ensure no security risks
  3. Event day: Core business process checks every 5 minutes
  4. During event: Real-time monitoring of disk I/O, network connections, database performance

Achievements:

  • Discovered and resolved disk space issues on 8 servers before the event
  • Zero failures during the event, stable system operation
  • Inspection reports served as important reference for emergency plans

Best Practices and Considerations

1. Recommended Inspection Frequency

  • Daily Health Check: Once daily (8 AM) – Script 1
  • Disk Space Monitoring: Once daily – Script 2
  • Network Status Check: Weekly – Script 3
  • Process Health Check: Hourly (core business) – Script 4
  • Security Baseline Check: Weekly – Script 5
  • Database Health Check: Daily – Script 6
  • Major Events: Every 5-15 minutes – All scripts

2. Alert Threshold Optimization
Initial Phase:

  • Use default thresholds (e.g., CPU 80%, Memory 85%)
  • Collect 1-2 weeks of inspection data
  • Analyze normal business fluctuation ranges

Tuning Phase:

  • Adjust thresholds based on actual business characteristics
  • Avoid excessive alerts (cry wolf effect)
  • Focus on trend changes rather than absolute values

3. Report Management
Report Retention:

bash

# Automatically clean inspection reports older than 30 days
find /var/log -name "system_check_*.log" -mtime +30 -delete

Report Analysis:

  • Establish weekly/monthly reporting system
  • Focus on trend issues (e.g., continuously rising disk usage)
  • Regularly review problems discovered during inspections

4. Permissions and Security
Execution Permissions:

  • System check scripts require root privileges
  • Use sudo to restrict executable commands for scripts
  • Regularly audit script execution logs

SSH Key Management:

  • Use dedicated keys for batch inspections
  • Restrict sudo privileges for keys
  • Regularly rotate keys

5. Integration with Monitoring Systems
Inspection scripts complement, rather than replace, monitoring systems:

Integration Options:

  1. Output inspection results in JSON format
  2. Push to monitoring platform via API
  3. Display inspection status on monitoring dashboard
  4. Trigger alerts for inspection anomalies

Zabbix Integration Example:

bash

# Send inspection results to Zabbix
zabbix_sender -z zabbix-server -s $(hostname) -k check.status -o "$STATUS"

Summary and Outlook

Core Points Review
The 7 production-grade Shell inspection scripts shared in this article cover:

  1. System Health: CPU, memory, disk, network and other basic resources
  2. Disk Management: Deep analysis of disk usage, providing cleanup suggestions
  3. Network Diagnostics: Connection status, port listening, network connectivity
  4. Process Monitoring: Application process health, resource usage, interface response
  5. Security Audit: Account security, SSH configuration, file permissions
  6. Database: MySQL performance metrics, master-slave status, query analysis
  7. Batch Execution: Concurrent inspection of multiple servers, summary reports

Implementation Roadmap
Phase 1 (1-2 weeks):

  • Deploy Script 1 on 1-2 test servers
  • Verify script execution results, adjust alert thresholds
  • Establish inspection report viewing process

Phase 2 (2-4 weeks):

  • Full deployment of Script 1 to all servers
  • Deploy Scripts 2-6 based on business characteristics
  • Configure scheduled tasks for automated execution

Phase 3 (1-2 months):

  • Deploy batch inspection scheduler
  • Establish inspection report analysis and problem handling process
  • Integrate with existing monitoring systems

Continuous Optimization:

  • Adjust check items and thresholds based on actual usage
  • Regularly review problems discovered during inspections
  • Continuously improve and enrich inspection scripts