Franck Pachot

Posted on Jun 17

HorizonDB cache hierarchy: RAM, NVMe SSD, and multi-AZ storage behind PostgreSQL

#postgres #azure #cloud #database

HorizonDB extends PostgreSQL with disaggregated storage, making it appear to developers like PostgreSQL while offering cloud-native high availability, elasticity, and performance. Performance metrics, especially I/O, can reveal differences due to additional components such as local caches and remote storage along the read/write paths.

Three aspects can expose these internal workings at the PostgreSQL layer:

Cache utilization, observable via EXPLAIN BUFFERS, comparing shared buffer hits with read calls to retrieve pages.
Average read times from EXPLAIN or metrics help infer the storage layer: less than a microsecond for memory, tens of microseconds for local NVMe SSD cache, and about a millisecond for remote persistent storage.
Wait events, as I/O operations handled outside PostgreSQL's main code path, in extensions, may show different patterns in pg_stat_activity compared to traditional PostgreSQL setups.

This approach is similar to tests I previously performed on OCI and AWS. I insert 100 MB of data (~12,800 rows) into a PostgreSQL table every two minutes and run full table scans with EXPLAIN ANALYZE:


\timing on
-- create the table
drop table if exists large;
create table large ( filler text ) with (fillfactor=10);
-- insert 100 MB
insert into large
 select string_agg( chr((32+random()*94)::int) ,'' )
 from generate_series(1,1000) c, generate_series(1,12800) r
 group by r
\; 
-- check the size
select 'size', pg_table_size('large') , pg_size_pretty( pg_table_size('large') ) , now()
\; 
-- query with full table scan
explain (costs off, analyze, buffers) select * from large
-- repeat every two minutes
\watch i=120 c=1000

I log the execution plans, including the elapsed time for the sequential scan and the number of buffers accessed (shared buffer hits and reads). Some metrics are gathered in Azure, and I also run the PostgreSQL for VS Code extension dashboard, which shows other metrics exposed by PostgreSQL.

It starts with 100% shared buffer hits

The storage grows by 100 MB every two minutes as new inserts are added:

The inserted rows remain constant, but as the table size grows, the read workload increases because more rows are scanned during each execution:

Since this involves a full table scan and the table size grows, it reads an increasing number of blocks, all resulting in shared buffer hits without reading from files, as the inserted pages remain in shared buffers:

The sequential scan in PostgreSQL employs a small ring buffer, but this occurs only on cache misses. Since the pages are already present in the shared buffer from prior insertions, this does not apply here.

This continues, and the table size reaches the shared buffer size.

The inserts fill up the shared buffer (11 GiB)

I run on a HorizonDB instance with 16 GiB RAM, provisioned like in the previous post, and 11 GiB are allocated to the shared buffer:

postgres=> \dconfig shared*

                                        List of configuration parameters

            Parameter             |                                    Value
----------------------------------+------------------------------------------------------------------------------
 shared_buffers                   | 11241MB
 shared_memory_size               | 11776MB
 shared_memory_size_in_huge_pages | 5888
 shared_memory_type               | mmap

Until I added 11 GB, the sequential scan only involved shared hit buffers and did not perform any reads (1433600 x 8KB blocks equals 11,200 MB):


               Tue 16 Jun 2026 10:33:01 AM GMT (every 120s)

 ?column? | pg_table_size | pg_size_pretty |              now
----------+---------------+----------------+-------------------------------
 size     |   11747319808 | 11 GB          | 2026-06-16 10:33:01.555657+00
(1 row)

            Tue 16 Jun 2026 10:33:01 AM GMT (every 120s)

                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on large (actual time=0.005..507.198 rows=1433600 loops=1)
   Buffers: shared hit=1433600 dirtied=12800
 Planning Time: 0.013 ms
 Execution Time: 551.265 ms
(4 rows)

Time: 4724.370 ms (00:04.724)

The next run, with 100 MB more data, begins to show some buffers read (16071 x 8KB = 125MB):

               Tue 16 Jun 2026 10:35:01 AM GMT (every 120s)

 ?column? | pg_table_size | pg_size_pretty |              now
----------+---------------+----------------+-------------------------------
 size     |   11852201984 | 11 GB          | 2026-06-16 10:35:01.555654+00
(1 row)

This is consistent with the newly appended pages, 100 MB, displacing the same amount of older buffers once shared buffers were full.

            Tue 16 Jun 2026 10:35:01 AM GMT (every 120s)

                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on large (actual time=0.007..624.128 rows=1446400 loops=1)
   Buffers: shared hit=1430329 read=16071 dirtied=12800 written=753
 Planning Time: 0.022 ms
 Execution Time: 669.210 ms
(4 rows)

Time: 4667.335 ms (00:04.667)

In vanilla PostgreSQL, sequential scans use a bulk-read strategy with a small ring of shared buffers, typically 256KB, to avoid flooding the main shared-buffer cache. With buffered I/O, pages not found in shared buffers may still be served from the operating system page cache, which is the usual double-buffering behavior. In such an experiment, the read (cache misses) increase by 100 MB each time due to buffer evictions from newly inserted data. I observed that on other managed services, asymptotic throughput decayed while cache evictions increased:

In HorizonDB, where the caching layers are different, we do not observe the same gradual decay. After the table scan size exceeded the shared-buffer size, subsequent scans were mostly reported as reads by PostgreSQL, indicating that most older pages were no longer in shared buffers, while only the most recently inserted 100 MB remained as hits. The next run reads the full table via disk reads (1446401 x 8KB = 11300 MB), with only the last 100 MB served from the shared buffer (12799 x 8KB = 100 MB):

               Tue 16 Jun 2026 10:37:01 AM GMT (every 120s)

 ?column? | pg_table_size | pg_size_pretty |              now
----------+---------------+----------------+-------------------------------
 size     |   11957084160 | 11 GB          | 2026-06-16 10:37:01.556217+00
(1 row)

             Tue 16 Jun 2026 10:37:01 AM GMT (every 120s)

                              QUERY PLAN
-----------------------------------------------------------------------
 Seq Scan on large (actual time=2.825..82165.728 rows=1459200 loops=1)
   Buffers: shared hit=12799 read=1446401 dirtied=12949
 Planning Time: 0.019 ms
 Execution Time: 82219.965 ms
(4 rows)

Time: 86399.819 ms (01:26.400)

PostgreSQL BUFFERS only shows the shared-buffer view. A read is a miss from shared buffers, but in a disaggregated architecture, it may still be served by a local cache before reaching durable remote storage.

Let's do some maths to estimate the storage layer of these buffer reads, considering that most of the per-buffer processing time is dominated by access latency:

Shared buffer hits: reading 11 GB in 507ms results in roughly 0.00038 ms per I/O (~0.38 µs), indicative of RAM access.
Buffer reads: reading 11 GB in 1 minute and 26 seconds translates to 134 MB/s, approximately 16,000 IOPS with 8KB I/Os. Since the query is single-threaded and has no parallelism, the average I/O time is about 0.0568 ms (~57 µs). This reflects reading from the local NVMe SSD cache.

I've highlighted the execution plans at the key inflection point. We can review those metrics throughout the entire run using the VS Code dashboard.

Because this is the only table in the database, the total size increases by 100 MB every two minutes, reaching 11 GB:

As table scans increase, shared buffer hits or reads also rise. When the table size hits 11 GB, all reads are served by storage (read, no longer from shared buffers), replacing the earlier 100% cache hits.

The cache hit ratio shows when it falls below 100%:

The query execution time increases from 500ms with RAM to 1 minute with an SSD:

Here is another EXPLAIN showing when those metrics were taken:

               Tue 16 Jun 2026 10:55:01 AM GMT (every 120s)

 ?column? | pg_table_size | pg_size_pretty |              now
----------+---------------+----------------+-------------------------------
 size     |   12901031936 | 12 GB          | 2026-06-16 10:55:01.555546+00
(1 row)

             Tue 16 Jun 2026 10:55:01 AM GMT (every 120s)

                              QUERY PLAN
-----------------------------------------------------------------------
 Seq Scan on large (actual time=0.114..71725.275 rows=1574400 loops=1)
   Buffers: shared hit=12800 read=1561600 dirtied=12800
 Planning Time: 0.022 ms
 Execution Time: 71781.188 ms
(4 rows)

Time: 76089.679 ms (01:16.090)

Reading 12 GB in 1 minute and 12 seconds results in approximately 166 MB/s, corresponding to around 20,000 IOPS with 8 KB I/O operations and an average of 0.046 ms per I/O (~46 microseconds). This aligns with reads served from the local NVMe SSD cache rather than the remote durable storage path.

Even if these reads are not counted as cache hits by PostgreSQL metrics—which only track the shared buffers—they still constitute a highly effective local cache, achieving 10 GB reads in just one minute. This provides faster I/O without requiring larger compute instances.

Next cache level: local NVMe SSD

While the run was ongoing, I observed the query duration:

There were some peaks in response time that temporarily reduced throughput in inserted and fetched rows:

The same is visible from the block reads:

I gathered the execution plans from that period, showing a single run at 11:21 UTC (13:21 CET on the dashboard), with reduced throughput—reading 11 GB over 5 minutes instead of 1 minute.

             Tue 16 Jun 2026 11:17:01 AM GMT (every 120s)

                              QUERY PLAN
-----------------------------------------------------------------------
 Seq Scan on large (actual time=0.111..77440.938 rows=1715200 loops=1)
   Buffers: shared hit=14286 read=1700914 dirtied=12800 written=12555
 Planning Time: 0.022 ms
 Execution Time: 77503.429 ms
(4 rows)

Time: 81571.846 ms (01:21.572)

INSERT 0 12800

              Tue 16 Jun 2026 11:19:01 AM GMT (every 120s)

 ?column? | pg_table_size | pg_size_pretty |             now
----------+---------------+----------------+-----------------------------
 size     |   14159659008 | 13 GB          | 2026-06-16 11:19:01.5558+00
(1 row)

             Tue 16 Jun 2026 11:19:01 AM GMT (every 120s)

                              QUERY PLAN
-----------------------------------------------------------------------
 Seq Scan on large (actual time=0.109..62897.716 rows=1728000 loops=1)
   Buffers: shared hit=345599 read=1382401
 Planning Time: 0.022 ms
 Execution Time: 62958.979 ms
(4 rows)

Time: 66659.325 ms (01:06.659)

INSERT 0 12800

               Tue 16 Jun 2026 11:21:01 AM GMT (every 120s)

 ?column? | pg_table_size | pg_size_pretty |              now
----------+---------------+----------------+-------------------------------
 size     |   14264598528 | 13 GB          | 2026-06-16 11:21:01.555743+00
(1 row)

              Tue 16 Jun 2026 11:21:01 AM GMT (every 120s)

                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on large (actual time=4.852..288105.348 rows=1740800 loops=1)
   Buffers: shared hit=313478 read=1427322 dirtied=12800 written=22539
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.029 ms
 Execution Time: 288179.656 ms
(6 rows)

Time: 292242.942 ms (04:52.243)

INSERT 0 12800

               Tue 16 Jun 2026 11:25:53 AM GMT (every 120s)

 ?column? | pg_table_size | pg_size_pretty |              now
----------+---------------+----------------+-------------------------------
 size     |   14369480704 | 13 GB          | 2026-06-16 11:25:53.798728+00
(1 row)

              Tue 16 Jun 2026 11:25:53 AM GMT (every 120s)

                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on large (actual time=30.632..87584.215 rows=1753600 loops=1)
   Buffers: shared hit=14238 read=1739362 dirtied=12800 written=11000
 Planning Time: 0.015 ms
 Execution Time: 87647.389 ms
(4 rows)

Time: 91315.980 ms (01:31.316)

Doing the maths, 1,427,322 reads in 288,105 milliseconds yields an average of roughly 288,105 ms / 1,427,322 ≈ 0.202 ms per read. This indicates that the scan is no longer fully served by the fastest local SSD-cache path, and some reads fall through to a slower storage path.

I reviewed the HorizonDB compute instance metrics in the Azure portal. The inserted and returned tuples align with our observations. Variations are expected due to the bursty workload: the query executes for about a minute, then pauses for two minutes, causing the 10-second metric buckets to capture the active rate during execution and near-zero during idle times.

The pattern shifts when the query time matches the job's 2-minute wait time. Here is the query time:

The memory usage has been rising until the shared buffers reached their allocated size (Linux allocates lazily):

Finally, it is interesting to observe the network I/O from the compute instance, which relates to the read and write operations to the storage:

The slower run can be explained by some reads taking a lower cache tier or being delayed by contention/throttling. The correlation with network activity makes remote storage involvement plausible, possibly due to SSD cache misses.

Returning to the VS Code dashboard, we can also observe the statistics about vacuum, checkpoint, and WAL:

On the dashboard, the average active-session timeline is visible, along with wait-event names for reads. These suggest that reads are handled by an extension rather than the PostgreSQL core code, as they appear as HorizonDB_SSDCache_Read wait events under the Extension wait type.

Some observations of HorizonDB_Storage_Read indicate longer read times and increased network activity.

Conclusion

These experiments show that HorizonDB behaves like PostgreSQL at the SQL layer while exposing the effects of a multi-tier storage architecture through latency, throughput, and wait events.

With a monotonically increasing table size scanned, we have observed the following phases:

Phase	PostgreSQL view	Effective per 8 KiB block	HorizonDB reality
Fits in shared buffers	`shared hit`	~0.35–0.38 µs	PostgreSQL shared-buffer residency
After shared-buffer overflow, normal runs	mostly `read`	~46–57 µs	Local low-latency cache below PostgreSQL
“Slower” peaks	mostly `read`	~0.20 ms	Slower storage path, including remote-storage reads

The configuration (large shared_buffers and effective_cache_size equal to it) shows that HorizonDB minimizes reliance on the OS page cache and instead uses a multi-tier caching strategy, with a fast NVMe SSD in the stateless compute node, to reduce reads from durable remote storage.

From PostgreSQL’s point of view, the scan is still a normal sequential scan, and BUFFERS still reports only the PostgreSQL shared-buffer state. When the table fits in shared buffers, the scan is served as shared hits and completes in a few hundred milliseconds. Once the table grows beyond shared buffers, PostgreSQL reports mostly read buffers. In a disaggregated storage architecture, those reads do not necessarily mean cold remote storage. They may be served from a lower cache tier beneath PostgreSQL.

The effective per-block timings show three distinct regimes in this test: sub-microsecond effective access times when pages are already in shared buffers, tens of microseconds per PostgreSQL block during normal post-overflow scans, and occasional slower periods of a few tenths of a millisecond per block. These slower periods correlate with longer query times and increased network activity, which makes lower-tier reads or storage-path delays plausible, although this experiment alone cannot distinguish cache misses from contention, throttling, or other synchronization effects.

The important lesson is that PostgreSQL cache-hit ratio alone is not enough to understand performance in a disaggregated system. A drop from shared hit to read may look dramatic in PostgreSQL metrics, but it means “not in shared buffers”, not necessarily “read from remote durable storage” in HorizonDB. The observed latency and wait events indicate an additional cache tier below PostgreSQL, where normal reads are served from a fast local NVMe SSD cache, while slower peaks use a slower storage path.

HorizonDB is in preview on Azure, so those metrics may improve, and your feedback is welcome.