Finding the bottlenecks in your application can be tricky. Here's a story of how I used iotop and iostat to help build evidence to choose an ebs-optimized disk to solve a saturated disk problem.

problem description

A redis box was acting up. Here's what I'd experience:

  • slow login
  • failing redis backups (ERR Background save already in progress)
  • general slugishness

I suspected the disk, and used iotop to see that it was indeed so. But, I needed more evidence, so recorded the info over time using iostat. This utility returns the same data as iotop, but in a tabular format. To make it more readable, we only grep the lines with iowait in them.

  iostat 1 | grep iowait -A 1

and got this in waves.

  avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.27    0.00    0.73   17.32    0.01   79.66
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   49.25    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00   48.76    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    1.00   49.00    0.00   49.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.25    0.00   50.25
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00   48.76    0.00   49.75
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   49.50    0.00   50.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   68.50    0.00   30.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50   98.50    0.50    0.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.00   99.00    0.00    0.00

Notice the iowait column would spike to 99% and stay there for a while. With another window open, I tried typing during those times and got the lack of response. So, I started another ec2 instance, this time with an ebs-optimized disk with 500 IOPS. This solved the problem. Notice the idle and iowait columns below stay low.

  avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.50    0.00    0.00   98.00
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.51    0.00    0.00   99.49
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    1.00    0.00    0.00   98.01
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.51    0.00    0.00   97.99
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50    0.00    0.50   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.49    0.00    0.00   98.01
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.51    0.00    0.00   99.49
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    1.00    0.00    0.00   98.50
--
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.50    0.00    0.00   99.00

newrelic told us the same story.

before ebs-optimized disks

after ebs-optimized disks

moral of the story

Often times you'll have a hunch that something's going wrong on your network, but without the apropriate tools, you're just making guiesses. Using iotop and iostat builds the evidence you need when it comes to disk usage.