Disk caching in the year 2020

Submitted by gil on Sun, 10/18/2020 - 7:41pm

In the previous blog post I went over the design of storage systems and how they protect themselves from data loss in the face of disk failure or unclean shutdowns. If you’re putting together the hardware yourself modern Linux systems and modern hardware give you endless cost, performance and durability tradeoffs for whatever system you want. But 2020 also gives us the cloud, and storage systems in the cloud whose durability far exceeds whatever you could build yourself. Cloud providers also have the scale to amortize the insane costs of these exotic, durable, and yet still performant system designs across their broad user bases, bringing down their prices and making them accessible to us mortals. In the year 2020, if you’re in the cloud, durability just isn’t a concern to you so it is time to start focusing on performance.

Caching basics in 2020

2020 also brings affordable, reliable and speedy SSDs, and if you’re designing a system yourself you want to have some way to take advantage of SSD performance. Building everything out of SSDs isn’t an awful choice, it’ll be simple and probably won’t even cost that much. However, the expected usage of most storage systems will have some bulk of data rarely read or written giving an opportunity to reduce marginal costs with tiered storage. If you have a way to keep cold data on slower but cheaper magnetic disks and can use SSDs when appropriate this tiered storage design gives you the speed benefits of SSDs when you need them the most and the cost benefits of magnetic disks.

Tiered storage introduces complexity back into the storage system, increasing operational overhead and the risk of data loss. If you can implement tiered storage at the application layer you have some of the best tools available to you to tame that complexity. For example, a filer could keep some file shares known to be cold, like backups, directly on magnetic storage and others known to be hot directly on SSD storage. A database could have tablespaces for slow and fast storage and assign cold and hot tables to each. You could keep table indexes in fast storage with reduced reliability knowing that disk failures in the fast storage won’t cause unrecoverable data loss. These are all simple systems, with well-understood failure modes and controlled operational complexity. They also give you the most benefit from SSD performance because their simplicity skips the CPU and I/O overhead introduced by more complex systems.

But what if your application layer doesn’t give you tiered storage? What if you can’t afford the operational complexity required to maintain it yourself? Delegating where the data goes initially is easy, but moving it around is harder. And assumptions can change, with cold data becoming warm, and you need continuous feedback to make and question your storage decisions. Caching is the next choice, where the system makes decisions on where to keep data automatically. It’s usable anywhere and can bring its benefits to every storage system but it comes at the cost of potentially suboptimal usage of the fast storage and a potentially increased risk of data loss.

Caches typically implement multiple caching algorithms, giving you more tradeoffs between speed, durability and cost. Writeback caching has all data written to fast storage first and later copied asynchronously to the slow storage. It gives the best performance as cache reads can use data from the newest writes immediately and writes go at the speed of the fast pool. But should your fast storage system suffer data loss before the asynchronous copy finishes you’ll have permanent data loss. Your system performance also hinges on the intelligence of the writeback cache algorithm more than other algorithms. Imagine a system that does backups - if the writeback cache is unable to determine that backup writes are cold and shouldn’t fill up the cache you’ll wind up evicting all the useful data for extremely cold data. This logic can’t be simplified or formalized for general use and requires tuning and measurement.

Writearound caching has writes go directly to slow storage and only reads make use of the cache. It’s a blunt tool, unable to improve write speeds, and unable to cache recently written data, but it never has to implement the dark art algorithms like writeback does to get good performance against pathological workloads. You also won’t lose data if your cache disk dies. There is also writethrough caching where writes are done to both fast and slow disks at the same time, improving the cache hit rate of recently written data but still limiting your write speed to the speed of the slow disk.

The previous blog post introduced the RAID write hole and a few techniques to close it. If you’re building your own storage system you are probably considering RAID and using one of those techniques to improve the durability of the system. Journaling is a well-supported and reliable way to close the write hole and under the hood these journals are often just writeback/writethrough/writearound caches, with all of the benefits and drawbacks as described above. In other words, a RAID array of magnetic disks with large enough SSD journals is also a RAID array with SSD cache. Journal disk failure is still a threat, putting the less aggressive cache algorithms into strong consideration, because an unfortunate disk failure in a writeback journal results in losing all data on the entire array.

The storage available to you in the cloud shifts your priorities the other way. Durability is a given as discussed earlier and you’re now looking to squeeze what performance you can out of your design. More aggressive caching algorithms like the writeback cache are now in scope and should be seriously considered, even for ordinary cloud setups.

Finally, I want to cover the Linux buffer and page caches, just for completeness. Your application's I/O hits these first and all of the other cache systems later. If a cache hit is done here any other disk caching code is never run at all. A 2017 summary of the history, current state and future of these Linux caches is on LWN. The page cache carves system memory up into equally sized pages, typically 4 kilobytes, and maintains attributes on each page's uses. A page could cache a region of data on a disk, meaning that the page's contents directly correspond to 4 kilobytes on a disk somewhere. A read request fills that cached page and future reads can be serviced by that page. A write to the disk updates the cached page and marks it as dirty, signalling the kernel to rewrite the corresponding data on the disk at its leisure. The buffer cache is a legacy API, internal to the kernel, that is implemented on top of the page cache. This API works with smaller units of data, called blocks, that are mapped onto relevant page cache entries for each file. Reads and writes to blocks are passed through to the page cache subsystem. The buffer cache was only separate from the page cache in much older kernels. The modern API for I/O is built around I/O requests and the bio struct, instead of stateful buffer cache blocks, and allows for I/O requests to be batched, reordered, or executed in parallel.

Disk caching in 2020

So what cache options are available to us on a modern system? As mentioned in the previous section, you can always just buy a bunch of RAM and you’ll get some in-memory caching provided by the kernel's page cache out of the box. If you’re using a RAID array there are ways to use the array’s journal as a cache. Modern kernels come with more exotic tools, though, and I will cover a few of them.

bcache, introduced in 2010, ships with most kernels. To use bcache you format your backing (slow) disk or logical volume with bcache and create whatever filesystem you want on top of the resulting bcache device. You then enable caching on the backing device by formatting a cache device (your SSD) and attaching it to the backing disk via sysfs. bcache implements writeback, writethrough and writearound caching, with the ability to switch between cache algorithms at runtime, and has implemented its own code to identify read patterns that shouldn’t be cached. The filesystem also exposes plenty of statistics and tunables for those who need to measure the performance of their system. Failure of the cache disk in writeback cache mode results in data loss. However, you still get a saving throw with fsck as bcache keeps the filesystem data on the backing store in its correct order. I’m very optimistic about bcache, I think it’s a good option for those rolling their own hardware and in the cloud and part 3 of this blog series will go through a bcache setup in the cloud.

Linux’s LVM also surfaces a device mapper target called dm-cache that provides similar caching on top of logical volumes. It’s configured through the typical LVM command line tools and is shipped with most kernels. dm-cache offers writethrough and writeback modes and the ability to safely stop caching on a logical volume. For whatever reason it seems that lvmcache is not nearly as popular as bcache. It’s not that much more complex to get started with lvmcache over bcache but I suspect the overhead of having to wade through the massive LVM documentation and familiarize yourself with LVM terminology is enough to keep many away.

ZFS has ARC, its own sophisticated cache algorithm that heavily ties into the overall design of ZFS. Writes are initially kept in RAM (ARC) and asynchronously committed to disk. L2ARC periodically grabs the data most likely to be evicted from ARC and persists it to its own disk. Later reads that result in an ARC cache miss can then be read out of L2ARC if the data exists there. This is similar to writearound caching in that it’ll only cache previously read data and the failure of the L2ARC disk doesn’t risk data loss. Your workload only benefits from L2ARC when the working set of hot reads is larger than the size of your ARC cache - not because L2ARC is bad, but because you just don’t need it until ARC is exhausted.

Another ZFS feature that comes up when discussing ZFS performance is the ZFS Intent Log, or ZIL. It’s a journal used to guarantee the durability of writes of filesystem metadata and user data and is not a cache. To improve the overall speed of writes and reduce the I/O load on the main ZFS disks the ZIL is often moved to a fast SSD disk in a configuration called SLOG. Even in this configuration the ZIL is still not a cache: it is only ever read from when the filesystem has an unclean shutdown and has to be restored to a consistent state. It comes up because it is also relevant to users looking to improve their ZFS performance in the same way that caching is a way to improve ZFS performance. By adding a SLOG you get the improved write performance of writeback caching but only the improved write performance, and with L2ARC you get improved reads of cache hits but only for reads that were cache hits. To take advantage of SSD read and write performance in ZFS you have to add both SLOG and L2ARC to your filesystem, requiring at least one SSD for L2ARC and two for SLOG.

I get the feeling that ZFS was really intended for general purpose filesystems and filers with general purpose workloads. The design is optimized for write durability whenever it can (a good, safe default!) and does a good job getting there. A dedicated ZFS appliance can soak up all the RAM you can throw at it and can continue to scale with separate, additional disk clusters for the ZIL and the L2ARC. And even if you were to put databases or other applications that carefully micromanage their writes onto a ZFS filesystem you’ll still be able to get SSD write improvements from an SLOG on a SSD. But suppose you had a database on ZFS and you wanted to keep the entire index on fast storage. This seems like a good use case for L2ARC as the size of your indexes on a non-trivial database will exceed the size of RAM. Because L2ARC can only cache data previously read you’d have to come up with some sort of cache warming to get your index on the SSD, and only until very recently could L2ARC caches persist across clean system restarts. It makes more sense to put the indexes on their own tablespaces that are explicitly located on SSD disks. ZFS gives you the tools to do this but there is no real set it and forget it architecture like with a true writethrough SSD cache. The combination of ZFS and Linux also seems to be geared towards filers. The Linux kernel still maintains its normal page cache alongside ARC and the same data can be cached in both places redundantly. There is also the risk that Linux soaks up cache hits first in a large page cache and ARC rarely gets a chance to grow. Systems that exhibit either behavior have to set a large, minimum ARC size to keep the page cache small and force hits into ARC - something you can really only get away with on a device dedicated to storage like a filer.

Btrfs does not yet have its own caching functionality.

For the next and final blog post in this series, I’m going to walk through what bcache looks like in the cloud and try and get some benchmarks.