Server Automated Inspection Scripts Usage Guide

Introduction
“Last week, a server disk filled up causing a business outage. Why wasn’t this discovered earlier?” This was the question I faced during a post-incident review meeting. That incident made me realize: passively waiting for monitoring alerts is not enough; proactive inspection is key to discovering hidden risks. After 2 years of refinement, I developed a complete set of server automated inspection scripts that automatically check the health status of 200+ servers daily, identifying dozens of potential failures in advance and preventing multiple possible business interruptions. These scripts have been running stably in multiple enterprises, performing over 100,000 cumulative checks with a fault prediction accuracy rate exceeding 95%. This article will fully share these 7 production-grade inspection scripts, all directly reusable, to help you establish a proactive operations system.

Technical Background: Why Automated Inspection is Needed?

Limitations of Monitoring Systems
Many enterprises deploy monitoring systems like Zabbix and Prometheus, but monitoring doesn’t solve all problems:

Monitoring Blind Spots:
- Configuration omissions: Newly deployed servers or services might be forgotten to be added to monitoring
- Unreasonable thresholds: Alert thresholds set too high or too low, causing missed reports or false alarms
- Gradual problems: Disk usage slowly increases from 70% to 95%, potentially below alert threshold but already dangerous
- System-level details: Fine-grained checks like file descriptors, zombie processes, log anomalies
- Business logic checks: Processes exist but are deadlocked, unable to respond to requests
- Monitoring system failures: Monitoring agent crashes or network issues cause monitoring failure

Core Value of Automated Inspection
Inspection scripts complement monitoring systems with unique advantages:

Proactive Problem Discovery: Systematically checks all key metrics without relying on alert rules
Comprehensive: Covers system, network, application, logs, and multiple dimensions
Customizable: Flexible customization of check items based on business characteristics
Traceable: Generates detailed inspection reports for easy problem tracing
Low Cost: Pure Shell scripts, no additional monitoring agent deployment required
Offline Capability: Inspection still works even when monitoring system fails

Enterprise Inspection Application Scenarios

Daily routine health checks (executed at 8 AM daily)
Comprehensive pre-event checks (before Double 11)
Post-change validation (executed 30 minutes after deployment)
Rapid diagnosis during emergency response (manually triggered during failures)
Audit and compliance checks (security baseline checks)

Core Content: Detailed Explanation of 7 Production-Grade Inspection Scripts

Script 1: Comprehensive System Health Check
This is the most basic yet important inspection script, checking the system’s core resource usage.

Usage:

bash

# Grant execution permission
chmod +x system_health_check.sh

# Manual execution
./system_health_check.sh

# Scheduled execution (every day at 8 AM)
echo "0 8 * * * /path/to/system_health_check.sh" | crontab -

Script 2: Deep Disk Space Check
This script focuses on detailed disk space analysis, quickly identifying the “culprits” occupying disk space.

Usage:

bash

chmod +x disk_space_check.sh
./disk_space_check.sh

Script 3: Network Connection and Port Check
Specifically checks network status, connection counts, port listening, and other network-related issues.

Usage:

bash

chmod +x network_check.sh
./network_check.sh

Script 4: Application Process Health Check
Checks the running status, resource usage, and interface responses of key business processes.

Usage:

bash

chmod +x process_health_check.sh
./process_health_check.sh

Script 5: Security Baseline Check
Checks server security configurations, including account security, permission configurations, SSH security, etc.

Usage:

bash

chmod +x security_baseline_check.sh
./security_baseline_check.sh

Script 6: Database Health Check (MySQL)
Specifically checks MySQL database running status, performance metrics, and configurations.

Usage:

bash

chmod +x mysql_health_check.sh
./mysql_health_check.sh

Script 7: Batch Server Inspection Scheduler
A batch execution script that can concurrently perform inspections on multiple servers and aggregate results.

Usage:

Create a server list file servers.txt:

text

# Web servers
192.168.1.10
192.168.1.11
192.168.1.12

# Application servers  
192.168.1.20
192.168.1.21

# Database servers
192.168.1.30
192.168.1.31

Execute the batch check:

bash

chmod +x batch_check_scheduler.sh
./batch_check_scheduler.sh

Practical Cases: From Manual to Automated Inspection Systems

Case 1: E-commerce Company Inspection System Construction
Background:

200+ servers requiring daily manual inspection
Inspection work took 2 hours, prone to missed issues
Multiple business interruptions due to full disks, process anomalies

Solution:

Deployed Script 1 to all servers, automatically executed daily at 8 AM
Used Script 7 on management machine for hourly inspections
Aggregated inspection results into HTML reports, pushed via enterprise WeChat
Added Script 4 for key business servers, executed every 5 minutes

Results:

Inspection time reduced from 2 hours to 5 minutes (viewing reports)
15 potential failures discovered and handled in advance within 3 months
Business interruptions reduced from average 3 times/month to 0
Operations team transitioned from reactive firefighting to proactive operations

Case 2: Internet Company Double 11 Support
Background:

Surge in traffic during Double 11, requiring high system stability
Needed intensive server status monitoring

Inspection Strategy:

7 days before event: Daily comprehensive inspections, disk space cleanup, system parameter optimization
3 days before event: Added security baseline checks to ensure no security risks
Event day: Core business process checks every 5 minutes
During event: Real-time monitoring of disk I/O, network connections, database performance

Achievements:

Discovered and resolved disk space issues on 8 servers before the event
Zero failures during the event, stable system operation
Inspection reports served as important reference for emergency plans

Best Practices and Considerations

1. Recommended Inspection Frequency

Daily Health Check: Once daily (8 AM) – Script 1
Disk Space Monitoring: Once daily – Script 2
Network Status Check: Weekly – Script 3
Process Health Check: Hourly (core business) – Script 4
Security Baseline Check: Weekly – Script 5
Database Health Check: Daily – Script 6
Major Events: Every 5-15 minutes – All scripts

2. Alert Threshold Optimization
Initial Phase:

Use default thresholds (e.g., CPU 80%, Memory 85%)
Collect 1-2 weeks of inspection data
Analyze normal business fluctuation ranges

Tuning Phase:

Adjust thresholds based on actual business characteristics
Avoid excessive alerts (cry wolf effect)
Focus on trend changes rather than absolute values

3. Report Management
Report Retention:

bash

# Automatically clean inspection reports older than 30 days
find /var/log -name "system_check_*.log" -mtime +30 -delete

Report Analysis:

Establish weekly/monthly reporting system
Focus on trend issues (e.g., continuously rising disk usage)
Regularly review problems discovered during inspections

4. Permissions and Security
Execution Permissions:

System check scripts require root privileges
Use sudo to restrict executable commands for scripts
Regularly audit script execution logs

SSH Key Management:

Use dedicated keys for batch inspections
Restrict sudo privileges for keys
Regularly rotate keys

5. Integration with Monitoring Systems
Inspection scripts complement, rather than replace, monitoring systems:

Integration Options:

Output inspection results in JSON format
Push to monitoring platform via API
Display inspection status on monitoring dashboard
Trigger alerts for inspection anomalies

Zabbix Integration Example:

bash

# Send inspection results to Zabbix
zabbix_sender -z zabbix-server -s $(hostname) -k check.status -o "$STATUS"

Summary and Outlook

Core Points Review
The 7 production-grade Shell inspection scripts shared in this article cover:

System Health: CPU, memory, disk, network and other basic resources
Disk Management: Deep analysis of disk usage, providing cleanup suggestions
Network Diagnostics: Connection status, port listening, network connectivity
Process Monitoring: Application process health, resource usage, interface response
Security Audit: Account security, SSH configuration, file permissions
Database: MySQL performance metrics, master-slave status, query analysis
Batch Execution: Concurrent inspection of multiple servers, summary reports

Implementation Roadmap
Phase 1 (1-2 weeks):

Deploy Script 1 on 1-2 test servers
Verify script execution results, adjust alert thresholds
Establish inspection report viewing process

Phase 2 (2-4 weeks):

Full deployment of Script 1 to all servers
Deploy Scripts 2-6 based on business characteristics
Configure scheduled tasks for automated execution

Phase 3 (1-2 months):

Deploy batch inspection scheduler
Establish inspection report analysis and problem handling process
Integrate with existing monitoring systems

Continuous Optimization:

Adjust check items and thresholds based on actual usage
Regularly review problems discovered during inspections
Continuously improve and enrich inspection scripts

Easy Python

Server Automated Inspection Scripts Usage Guide

New Article

Related articles