Post

EBS gp3 vs Local NVMe: When Network Storage Wins

A practical guide to evaluating EBS gp3 against local NVMe SSDs. Covers I/O profiling, fio benchmarking with realistic block sizes, latency analysis, right-sizing provisioned IOPS and throughput, and the cost model that makes over-provisioning expensive.

EBS gp3 vs Local NVMe: When Network Storage Wins

The common assumption is that local NVMe SSDs are always faster than EBS. That assumption is wrong — or at least, it depends entirely on your I/O profile and the EC2 instance generation you are running.

This post walks through a real evaluation of EBS gp3 versus local NVMe for a write-heavy database workload. The results were surprising: EBS gp3 outperformed local NVMe on newer instance types across all metrics. More importantly, the post explains why block size matters for benchmarking, how to determine your own I/O profile, and how to right-size gp3 provisioning without wasting money.

This evaluation applies to general-purpose, memory and cpu optimized instance families (m, r, c-series). Storage-optimized instances (i3, i4i, is1, im4gn, i8, etc) are purpose-built around local NVMe — EBS is not a substitute for those.

Update March 2026: AWS has significantly increased gp3 volume limits — IOPS from 16,000 to 80,000, throughput from 1,000 MiB/s to 2,000 MiB/s, and volume size from 16 TiB to 64 TiB. These new limits do not invalidate the test results below but they significantly expand the range of workloads where EBS gp3 is viable — including read-heavy and low-residency workloads that were previously only feasible on either local NVMe or the more expensive io1 and io2 EBS volumes.

The new limits make it very easy to over-provision and incur significant costs. Maxing IOPS to 80,000 alone adds $385/month per volume — nearly 5x the storage cost of a 1 TB volume. Across a cluster, this can add thousands per month. Always right-size based on your I/O profile, not by defaulting to maximum values. See the cost breakdown below.


When does EBS gp3 make sense?

Not every workload benefits from EBS. Whether gp3 is viable depends on your application’s I/O profile:

Workload characteristicEBS gp3 viable?Recommendation
High data residency, write-dominantYesEBS gp3 — this post’s tested scenario
High data residency, read-dominantYesReads served from RAM anyway
Low residency, write-dominantLikely yesBenchmark — writes are async, latency tolerant
Low residency, read-dominantBenchmark firstTest at realistic block sizes, compare clat p99 vs local NVMe
Dataset far exceeds RAM, high read IOPS (>50K)CautionLocal NVMe may still be necessary for latency and most likely cheaper over all

The fundamental weakness of EBS compared to local NVMe is small-block random read latency. EBS reads traverse the network to reach the storage backend — local NVMe does not. This difference is masked when data is served from RAM but becomes critical when it is not.

If your application keeps its working set in memory (high cache hit ratio / high data residency), disk reads are limited to background operations like replication, compaction, and index maintenance. These are not latency-sensitive. In this scenario, EBS is not just viable — it can be faster than local NVMe on newer instance types, as the benchmarks below demonstrate.

If your application regularly fetches data from disk on the client request path (low residency), every read becomes latency-sensitive and local NVMe has the advantage. With the new gp3 limits (80,000 IOPS), EBS can now match the IOPS demand of many read-heavy workloads, but latency (+ costs) remains a concern — benchmark with realistic block sizes and compare clat percentiles before assuming the new limits make EBS viable for your read-heavy profile.


Why block size matters for benchmarking

Applications don’t expose I/O block size configuration — the storage engine determines it internally. Running fio with default 4K blocks may produce misleading results because the actual application I/O pattern is almost always different.

The relationship between IOPS and throughput depends entirely on how large each I/O operation is. The same application can look IOPS-bound or throughput-bound depending on its block size:

1
2
3
4
5
Throughput = IOPS x Block Size

Example:
  10,000 IOPS x 4 KB  =  40 MiB/s   (IOPS-bound, throughput is low)
  10,000 IOPS x 64 KB = 640 MiB/s   (throughput-bound, IOPS is low)

Deriving your I/O profile from production metrics

Look at your monitoring dashboards (Grafana, CloudWatch, Datadog, etc.) for two disk metrics on the same server, over the same time period:

  • Disk Write IOPS (or Read IOPS)
  • Disk Write Bytes/sec (or Read Bytes/sec)

Then calculate:

1
Average Block Size = Bytes per second / IOPS

For example, if you see 8,000 write IOPS and 500 MiB/s write throughput:

1
500 MiB / 8,000 = ~64 KB per write operation

Do this for both reads and writes — they often differ. Reads may be small (4-16K) while writes may be large (64-256K), or vice versa, depending on the application.

The workload (for our use case: in-memory NoSQL server ) tested in this post had the following profile observed over a 90+ day production period:

  • Writes: ~10,000 IOPS at ~500 MiB/s = ~50-64 KB per write operation
  • Reads: Low baseline IOPS with large byte spikes — large sequential reads (replication, compaction), not small random reads
  • Read source: Nearly all application reads served from RAM (~100% resident), disk reads were background operations

Test methodology

Based on the production I/O profile, fio was configured with mixed block sizes — small random reads (4-16K) and fixed 64K writes:

1
2
3
4
fio --directory=<mount> --name fio_rnd_rwmix --randrepeat=1 \
    --ioengine=libaio --direct=1 --bsrange=4k-16k,64k-64k \
    --iodepth=64 --size=2G --readwrite=randrw --rwmixread=33 \
    --numjobs=16 --time_based --runtime=900 --group_reporting
ParameterValueReason
ioengine=libaioLinux native async I/ORealistic for database workloads
direct=1Bypass OS page cacheMost databases manage their own caching
bsrange=4k-16k,64k-64kMixed read/write sizesMatches observed I/O pattern
iodepth=64Deep queueRealistic for busy database under peak load
numjobs=1616 parallel workersSimulates concurrent application threads
rwmixread=3333% read, 67% writeReflects write-heavy production workload
runtime=90015 minutesSustained test to eliminate burst/cache effects
group_reportingAggregate all jobsSingle summary output

All EBS volumes provisioned at maximum gp3 settings at the time of testing: 16,000 IOPS, 1,000 MiB/s throughput.

All tests ran on identically configured instances, repeated multiple times for consistency and reproducibility.


Results

Impact of block size on benchmarks

This is why using the correct block size matters. Both tests ran on an r5d.4xlarge (EBS bandwidth: 4,750 Mbps):

4K block size (fio default):

MetricLocal NVMeEBS gp3
Read BW221 MiB/s20.7 MiB/s
Write BW449 MiB/s42.1 MiB/s
Read IOPS56,0005,302
Write IOPS114,00010,700

With 4K blocks, NVMe dominates because EBS is capped at 16K IOPS and each operation is tiny.

Production-like block sizes (mixed: 4k-16k read, 64k write):

MetricLocal NVMeEBS gp3
Read BW249 MiB/s188 MiB/s
Write BW505 MiB/s381 MiB/s
Read IOPS4,0563,043
Write IOPS8,0446,018

With realistic block sizes, the gap narrows significantly. EBS is still slower here, but the bottleneck is the instance-level EBS bandwidth limit of 4,750 Mbps, not EBS itself.

Newer instance generation removes the bottleneck

Testing on r6id.4xlarge (EBS bandwidth: 10 Gbps) with production-like block sizes:

MetricLocal NVMeEBS gp3Delta
Read BW286 MiB/s333 MiB/s+16% EBS
Write BW581 MiB/s675 MiB/s+16% EBS
Read IOPS4,6015,477+19% EBS
Write IOPS9,23310,800+17% EBS

EBS gp3 outperforms local NVMe on the newer instance across all metrics.

Latency comparison (15-minute sustained test)

Latency measured on r6id.4xlarge under full saturation with iodepth=64 and 16 concurrent jobs:

Read completion latency (clat):

PercentileLocal NVMeEBS gp3Delta
p5061 ms57 ms-4 ms
p9092 ms97 ms+5 ms
p95104 ms110 ms+6 ms
p99120 ms133 ms+13 ms
p99.5125 ms142 ms+17 ms
p99.9134 ms159 ms+25 ms
p99.99180 ms180 ms0 ms

Write completion latency (clat):

PercentileLocal NVMeEBS gp3Delta
p5074 ms60 ms-14 ms
p90106 ms101 ms-5 ms
p95118 ms113 ms-5 ms
p99133 ms136 ms+3 ms
p99.5138 ms146 ms+8 ms
p99.9148 ms163 ms+15 ms
p99.99197 ms186 ms-11 ms

Key observations:

  • Median (p50) latency is actually better on EBS for both reads and writes
  • Tail latency (p99.9) is ~15-25 ms higher on EBS — negligible for applications with async disk persistence
  • p99.99 is comparable or better on EBS
  • For workloads where data is served from RAM, application read latency is unaffected — disk latency only affects background operations

Key takeaways

  1. Block size is critical for benchmarking — testing with default 4K blocks makes EBS look 10x worse than NVMe, testing with realistic block sizes shows EBS is actually faster on newer instances
  2. Instance-level EBS bandwidth is often the real bottleneck — older instance generations (r5, m5, c5) cap at ~593 MiB/s, upgrading the instance generation can be more impactful than changing storage type
  3. EBS gp3 can outperform local NVMe on throughput — on r6i/r7i class instances with 10+ Gbps EBS bandwidth
  4. Latency is comparable under saturation — tail latency differences of ~15ms are irrelevant for async persistence with high-residency workloads
  5. Operational benefits matter — EBS volumes survive instance stop/start, local NVMe SSDs require full data recovery on instance maintenance events

Guide to right-sizing EBS gp3 provisioning

This section is a general guide for any workload. It describes how to determine the right gp3 IOPS and throughput settings based on your application’s I/O profile.

While the examples use AWS technology, the techniques described here are generic and should work on any cloud vendor with network-attached block storage.

The problem with over-provisioning

gp3 charges separately for IOPS and throughput above the baseline (3,000 IOPS / 125 MiB/s). Setting every volume to maximum “just in case” wastes money. Most applications don’t need anywhere near the limits. The key is figuring out what your application actually requires.

Step 1: Determine your application’s I/O block size

The relationship between IOPS and throughput depends entirely on how large each I/O operation is:

1
Throughput = IOPS x Block Size

Use your monitoring dashboards to calculate the average block size as described in Why block size matters above.

Step 2: Identify peak usage, not average

Look at your metrics during peak hours, not averages. Provision for the peak with some headroom (~20-30%) so the disk is never the bottleneck. An undersized disk during peak causes latency spikes and queue buildup.

From your dashboards, note:

  • Peak write IOPS and peak write throughput
  • Peak read IOPS and peak read throughput

Step 3: Determine which dimension is your bottleneck

gp3 has two independent limits — IOPS and throughput. Depending on your block size, one will be the constraint:

Block SizeBottleneckExample
Small (4-16 KB)IOPSDatabases with small random lookups
Medium (32-64 KB)EitherDepends on the workload
Large (128-256 KB)ThroughputStreaming, logging, sequential writes

Small block sizes (4-16 KB): You’ll hit the IOPS limit before the throughput limit. Provision IOPS to match your peak. Throughput can stay at baseline.

1
2
Example: 12,000 IOPS x 8 KB = 96 MiB/s
  Provision: 12,000 IOPS / 125 MiB/s (baseline throughput is enough)

Large block sizes (64-256 KB): You’ll hit the throughput limit before IOPS. Provision throughput to match your peak. IOPS can stay lower.

1
2
Example: 5,000 IOPS x 128 KB = 640 MiB/s
  Provision: 5,000 IOPS / 640 MiB/s (IOPS baseline of 3,000 is almost enough)

Step 4: Account for instance-level EBS bandwidth

Even if you provision gp3 at high values, the EC2 instance type has its own EBS bandwidth cap. Check the AWS documentation for your instance type’s EBS bandwidth.

Instance GenerationTypical EBS Bandwidth
r5 / m5 / c54,750 Mbps (~593 MiB/s)
r6i / m6i / c6i10,000 Mbps (~1,250 MiB/s)
r7i / m7i / c7i10,000+ Mbps

There is no point provisioning beyond what the instance can deliver. If your instance caps at 593 MiB/s, provisioning 1,000 MiB/s on gp3 wastes money.

Step 5: Validate with fio

Before deploying, validate your provisioning with fio using block sizes that match your application (see Test methodology above). Key points:

  • Never benchmark with default 4K blocks unless your application actually does 4K I/O
  • Use --direct=1 to bypass OS cache
  • Run tests long enough to exhaust any burst credits (at least 20-30 minutes)
  • Use --group_reporting to get aggregated results (especially for clat stats)
  • Compare IOPS and throughput against your provisioned limits — you should see the volume hitting the limits you set, not the instance bandwidth limit

Provision only what you need based on your peak I/O profile. The baseline is generous enough for many workloads — only increase when your metrics show you are hitting the limit.

You can freely adjust EBS throughput and IOPS values even on mounted/in-use volumes, at the cost of some time, may vary (this maybe AWS specific).

Interpreting latency (clat) results

When running fio with --group_reporting, you get completion latency (clat) percentile tables. These numbers depend heavily on test parameters — particularly iodepth and numjobs — so absolute values are only meaningful in context.

What clat measures: The time from when an I/O request is submitted to the device until it completes. Higher queue depths mean more requests waiting, so latency increases. A p99 of 100ms under iodepth=64 with 16 jobs is not the same as p99 of 100ms under iodepth=1.

General guidance for database workloads under saturated conditions (high iodepth, many concurrent jobs — worst case):

PercentileWhat it tells youAcceptable rangeConcern threshold
p50 (median)Typical request latency< 100 ms> 150 ms
p99Latency for 1-in-100 requests< 200 ms> 300 ms
p99.9Worst-case tail latency< 250 ms> 500 ms
p99.99Extreme outliers< 500 ms> 1,000 ms

Important caveats:

  • These thresholds assume saturated conditions (high iodepth, many jobs). Under light load, latencies should be significantly lower — if you see p50 > 10ms under light load, something is wrong.
  • Async vs sync persistence matters. Applications with async disk writes tolerate higher disk latency because clients don’t wait for disk. Applications where clients block on disk I/O (e.g., synchronous database commits) need much tighter latency — halve the thresholds above.
  • Compare, don’t evaluate in isolation. The primary value of clat percentiles is comparing two storage options under identical test conditions. A 15ms difference at p99.9 between two options is negligible. A 10x difference indicates a real problem.
  • Watch for cliffs, not gradual increases. A smooth curve from p50 to p99.99 (e.g., 60ms → 80ms → 130ms → 160ms → 180ms) indicates predictable behaviour. A sudden jump (e.g., 60ms → 80ms → 130ms → 160ms → 900ms) at p99.99 indicates occasional stalls — investigate whether it is device-level queuing, instance throttling, or EBS credit exhaustion.

How to read clat in the context of EBS vs NVMe:

ScenarioWhat it means
EBS p50 ≤ NVMe p50EBS handles typical load as well or better
EBS p99 within 20% of NVMe p99Tail latency difference is acceptable for most workloads
EBS p99.9 more than 2x NVMe p99.9Investigate — may indicate EBS network jitter under load
EBS p99.99 significantly worseLikely sporadic, acceptable unless your application is latency-critical at extreme percentiles

EBS cost model

gp3 pricing has three independent components (rates as of March 2026, eu-central-1 — verify current pricing):

ComponentRate (per month)Baseline (included free)
Storage$0.08 / GB
IOPS$0.005 / IOPSFirst 3,000 IOPS
Throughput$0.06 / MB/sFirst 125 MB/s

1 TB volume — cost by provisioning level:

ConfigurationStorageIOPSThroughputTotal/month
Baseline (3K IOPS / 125 MB/s)$80$0$0$80
Right-sized example (10K IOPS / 500 MB/s)$80$35$22.50$137.50
Old max (16K IOPS / 1,000 MB/s)$80$65$52.50$197.50
New max (80K IOPS / 2,000 MB/s)$80$385$112.50$577.50

4 TB volume — cost by provisioning level:

ConfigurationStorageIOPSThroughputTotal/month
Baseline (3K IOPS / 125 MB/s)$320$0$0$320
Right-sized example (10K IOPS / 500 MB/s)$320$35$22.50$377.50
Old max (16K IOPS / 1,000 MB/s)$320$65$52.50$437.50
New max (80K IOPS / 2,000 MB/s)$320$385$112.50$817.50

Multi-node impact — multiply per-volume cost by node count:

Cluster sizeBaseline (1 TB)New max (1 TB)Wasted/month
4 nodes$320$2,310$1,990
8 nodes$640$4,620$3,980

The IOPS component dominates the cost at high provisioning levels. Always determine your actual I/O profile before provisioning — most workloads need a fraction of the maximum.


Further reading

This post is licensed under CC BY 4.0 by the author.