It's 3:00 AM, and an alert fires: production-api-01 response times are spiking. You ssh into the box. Where do you look first? How do you isolate if the issue is a CPU bottleneck, memory leak, swap thrashing, or disk I/O saturation?
In this guide, we'll build a structured troubleshooting framework to diagnose performance problems under pressure. We will explore core Linux metrics and learn to identify bottlenecks using standard system utilities.
Rule of Thumb: Never guess. Measure, isolate, and verify using tools that query the kernel directly via the
/procfilesystem.
The First Line of Defense: Load Averages
Your entry point is the load average metric. Run the uptime command or inspect the top panel of top/htop.
dinesh@prod-srv ~ โฏ uptime 15:32:04 up 42 days, 3:12, 2 users, load average: 8.42, 4.10, 2.15
Load averages represent the average number of processes in a runnable or uninterruptible state over 1, 5, and 15 minutes:
- Runnable (CPU): Processes using or waiting for a CPU core.
- Uninterruptible (Disk/IO): Processes blocked waiting for disk or network I/O operations to complete.
If the load average is 8.42 on a 4-core machine, your system is overloaded by 110%. However, this load could either be processes waiting for CPU or processes blocked waiting for disk response. Let's isolate the root cause.
1. CPU Saturation Diagnostic
To inspect CPU distribution, run the vmstat 1 command. It updates every second, displaying system activity snapshots.
dinesh@prod-srv ~ โฏ vmstat 1 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 6 0 0 824300 92040 1832040 0 0 4 20 1205 2400 85 10 3 2 0 7 0 0 812900 92040 1832040 0 0 0 0 1420 3102 90 8 0 2 0
Look at the CPU section columns on the far right:
- us (user): Time spent running non-kernel code (app servers, databases). If high, optimize application logic.
- sy (system): Time spent running kernel code. High system CPU suggests excessive context-switching or driver issues.
- id (idle): Percentage of time CPU is idle.
- wa (iowait): CPU waiting for disk/network I/O. If high, the bottleneck is I/O, not CPU capacity.
2. Memory & Swap Bottlenecks
A common misconception in Linux is that "low free memory" is bad. Linux utilizes unused memory for file caches and buffers to speed up operations. The real indicator of memory pressure is thrashing (swapping active memory pages to disk).
dinesh@prod-srv ~ โฏ free -m total used free shared buff/cache available Mem: 7980 3520 420 120 4040 4100 Swap: 2048 450 1598
Key memory diagnostics:
- Compare available memory (not free memory) to the total memory. Available memory shows what can be freed immediately if requested by processes.
- Inspect
si(swap in) andso(swap out) columns invmstat. If swap out rates (so) are consistently greater than zero, physical RAM is exhausted, forcing the kernel to write pages to disk, causing latency spikes.
3. Disk I/O Saturation
When CPU %wa is high, your disks are saturated. To find out which disk or process is causing the traffic, run iostat.
dinesh@prod-srv ~ โฏ iostat -xz 1 2 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s %util sda 0.00 4.20 1.20 245.00 0.02 8.40 88.50 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Focus on the %util (percentage of CPU time dedicated to handling device requests) column. If a disk is hitting close to 100% util, it is fully saturated. Use iotop -o (requires root permissions) to see exactly which process is reading/writing the most data.
The Quick Diagnostic Flowchart
When you jump on a server, run these four commands in sequence to isolate any performance incident in under a minute:
# 1. Check general load averages uptime # 2. Check CPU utilization profiles and swap activity in real-time vmstat 1 5 # 3. Check actual memory consumption vs available caches free -m # 4. Check detailed disk device utilization rates iostat -xz 1 5
Conclusion
Linux performance tuning begins with data collection. By inspecting system metrics systematically, you can quickly find bottlenecks. Instead of throwing CPU cores or memory at a server issue, you can determine if a database needs caching, if write tasks need to be scheduled off-peak, or if your application code is leaking CPU threads.
Keep these commands handy in your terminal history for the next time your alert system starts ringing!