Running PostgreSQL on Compression-enabled ZFS

“Does PostgreSQL support compression?” is a common question we get from our customers, and it’s easy to see why they’re asking. Many of them generate and collect large volumes of log and event stream data and store it in text formats such as JSON or CSV. Typically these files are then compressed using a codec like gzip, and compression ratios approaching 3x-4x are not uncommon. The problem is that you sacrifice these storage savings as soon as you load the data into a database that doesn’t support compression.

Unfortunately, the answer to this question is neither a clear cut “yes” or “no”. If certain criteria are satisfied PostgreSQL’s storage layer will transparently compress variable length types using a mechanism known as The Oversized-Attribute Storage Technique, or TOAST. The problem is that TOAST’s compression won’t kick in unless the size of a row exceeds the TOAST_TUPLE_THRESHOLD value (which is set to 2KB on most systems), and even when this happens TOAST will only compress the variable length types contained in the row. TOAST does a good job of satisfying its intended purpose, which is to avoid the overhead and complexity of row-chaining by compressing a row to make it fit within one of PostgreSQL’s 8KB buffer cache pages, but it’s a poor substitute for a general purpose compression layer.

Enter ZFS

In the absence of general purpose compression support in PostgreSQL we began to wonder if we could achieve some of the same benefits by leveraging compression at the filesystem layer. In the past decade several file systems have added support for transparent compression. Currently the most mature option is ZFS, an open-source filesystem originally designed at Sun Microsystems. In addition to transparent, in-line compression with a choice of several codecs (gzip, LZJB, and LZ4), ZFS also supports a bunch of other powerful features including software RAID, virtual storage pools, snapshots, deduplication and encryption. ZFS’s biggest limitation is its CDDL license which prevents it from being distributed with the Linux kernel. Recently, this situation has improved significantly as a result of the ZFS on Linux Project which provides ZFS packages for many of the major Linux distributions.

CitusDB on ZFS

In order to verify our hunch that ZFS is a viable substitute for native compression support in the database, we decided to test the performance and compression characteristics of CitusDB running on top of ZFS and compare this to the default configuration that uses ext3. Further, we focused on a workload that was completely bound by disk I/O in our experiments.

We ran the tests on EC2 using two c1.xlarge instances initialized with Ubuntu 12.04 LTS. We installed CitusDB 2.0 on both nodes and then arranged the nodes in a 1:1 master/worker configuration. For sample data we used 38GB of event logs from the GithubArchive dataset. We converted the sample data from JSON to CSV format using a script similar to one provided in the GithubArchive project, and then loaded the results into CitusDB using the \STAGE command and this schema. After the load operation finished we calculated the compression ratio by measuring the size of the data directory on the worker node.

The results indicate that ZFS achieved a 2x compression ratio using the default LZJB compression codec, and close to a 4x compression ratio using the gzip codec.

Before we get carried away with these results we need to quantify the impact compression has on performance. Running ZFS with compression enabled forces the file system do additional work on top of every disk read and write operation. It’s reasonable to assume that this will have a negative impact on overall performance, but the question we really need to answer is whether the performance impact is so significant that it outweighs the benefits. In order to do this we devised a simple benchmark composed of seven queries, which we then executed against the 38GB dataset we loaded in the previous step. It’s important to note that the CitusDB worker nodes will process the shards in parallel, which implies that we are also testing the ability of the file system or kernel’s IO scheduler to handle concurrent read requests.

These results are surprising to say the least. We know that enabling compression forces the CPU to do additional work on top of every read and write operation, so how is it possible for Query 1 to run nearly four times faster when the underlying data is compressed with gzip? The explanation for this counterintuitive result hinges on the fact that these queries are IO bound. In other words, the primary bottleneck limiting the performance of these queries is not the speed of the CPU but rather the rate at which data can be read from the disk into the kernel’s buffer cache. With compression enabled these operations take less time because less data needs to travel between the disk and the buffer cache. In addition, seek times are lower because compression reduces the physical distances between logical blocks of data on the disk, and the disk buffer’s hit rate improves since it is able to store more data in the same amount of space (note that we’re talking about the hard drive’s buffer -- this last advantage does not apply to the buffer cache since ZFS decompresses blocks before saving them in the cache).

While doing background research for this blog post we also got a chance to investigate some of the other features besides compression that differentiate ZFS from older file system architectures like ext3. One of the biggest differences is ZFS’s approach to scheduling disk IOs which employs explicit IO priorities, IOP reordering, and deadline scheduling in order to avoid flooding the request queues of disk controllers with pending requests. We were curious to see how well these techniques work in practice. To do that we dropped the old githubarchive table and loaded a new copy, this time splitting the table into 23 shards instead of 2. The total amount of data that each query runs against remains unchanged, but CitusDB will now divide the work between 23 processes instead of 2.

The first thing we need to point out is that the horizontal axis on this chart is twice as large as the one in the previous section. The second thing worth noting is that this test hosed ext3. Queries which took on average 350 seconds to execute with two processes now take roughly twice as long when run with 23 processes. These results are disappointing but won’t surprise anyone who is familiar with the performance characteristics of sequential and random reads on conventional file systems. Disks can handle sequential reads much faster than random reads since the former do not require the disk to reposition the read head. In both test runs the individual CitusDB worker processes made sequential read requests, but this characteristic disappears and things start to look random if the IO scheduler simply enqueues IO requests in the order in which they are received. In contrast ZFS’s IO scheduler attempts to reorder IO operations in an effort to limit the amount of time the disk spends seeking from one block to the next, and the results we obtained here support the conclusion that ZFS’s approach actually works.

So what are the takeaways from this experiment? First and foremost we succeeded in proving our hypothesis that ZFS makes it possible to achieve 3-4x compression ratios for event stream and log datasets. Second, we demonstrated that ZFS and compression actually improves performance when queries are IO bound. Finally, we also determined that ZFS on Linux is maturing and will likely become a viable option for enterprise deployments in the future.

Carl wishes to thank Metin Doslu for contributing background research and results that are used in this article.

About

CitusDB is a scalable analytics database that's built on top of PostgreSQL.

In this blog, we share our ideas and experiences on databases and distributed systems.