Monitoring Disk I/O Metrics with Prometheus and Grafana

Disk IOPS (input/output per second) stats are metrics that gives you insights on how your system interact with your storage devices. These metrics measure read and write operations.

Fundamental Disk IOPS metrics

IOPS (Input/Output Per Second)

Grafana disk I/O visualisation

IOPS stats measures how many Input(write)/Output(read) transaction your storage devices perform per second.

IOPS metrics are exported by node exporter.

irate(node_disk_reads_completed_total[$__rate_interval])
irate(node_disk_writes_completed_total[$__rate_interval])

IOPS depends on your disk but a higher number generally mean better performance, but a context matter , IOPS metrics is meaningless without a latencey figure:

  • High IOPS but low latency = healthy
  • Low IOPS + high latency = disk contention

Latency (Average time per operation)

Grafana disk latency visualisation

Latency represents the time it takes for the I/O request to be completed.

Latency = Queue time + processing (READ or Write) operation.

Latency is the most important metrics to consider for a storage performance. Lower latency means better performance.

irate(node_disk_write_time_seconds_total[$__rate_interval])/irate(node_disk_writes_completed_total[$__rate_interval]) # Read Latency
irate(node_disk_write_time_seconds_total[$__rate_interval])/irate(node_disk_writes_completed_total[$__rate_interval]) # Write Latency

Utilisation

Grafana disk Utilisation visualisation

Disk Utilisation is the percentage of time the disk busy processing I/O operations.

High disk I/O time utilisation (e.g: > 80%) means te disk is busy almost all the time handling I/O requests. Technically, it means:

  • Slow or failing hardware
  • Concurrent access bottlenecks
  • High I/O workload
  • Insefficient access patterns

If latency is also high, this means the disk can’t keep up , which leads to performance degradation.

rate(node_disk_io_time_seconds_total[1m]) * 100
  • High utilization + high latency = saturated disk

Throughput or Bandwidth

Grafana disk bandwidth visualisation

Troughput or Bandwidth represents the amount of data in Megabytes per second (MB/s) transferred to or from the storage device.

irate(node_disk_written_bytes_tota[$__rate_interval]) 
irate(node_disk_read_bytes_total[$__rate_interval])

Here is how to interpret the bandwidth metrics:

  • If your read/write MB/s is close to the specs of your disk, it’s OK.
  • If you see very low bandwidth but high I/O wait or high disk utilization, the disk might be overloaded on small random reads (IOPS problem, not bandwidth).
  • If both bandwidth and IOPS are high, the disk is at full load.