Durability in the year 2020

Submitted by gil on Thu, 10/08/2020 - 9:13am

It is the year 2020 and we still don’t have great answers to data durability in the face of unclean shutdowns. Unclean shutdowns are things like power outages, system faults and unlucky kernel panics and preventing data loss when they happen is a hard problem. I’m going to talk through a few ways these can cause data loss in this blog post but you can probably come up with new ones on your own - exotic failures that no system will ever handle correctly. In the year 2020, the conventional wisdom is still true. Always take backups, and RAID is no substitute for backups.

How do we deal with these unclean shutdowns? If minimizing data loss from unclean shutdown is in scope for your system you should put serious consideration into putting time and money into functionality that minimizes the chance of unclean shutdowns. If you fear power outages you invest in battery-powered uninterruptible power supplies. If you fear kernel panics you should run stable distro kernels with simple, widely tested software like ext4/xfs and mdraid.

Even in the face of these unclean shutdowns there are some systems that do a better job at dodging fate, getting lucky and persisting good data than others. There’s RAID level 1, a classic simple design: data is written to two disks simultaneously and writes are acknowledged when both disks finish writing. If the application layer is properly using write barriers and not progressing until the write completes you’ll not be committing transactions and losing data afterwards. But if you get unlucky with an unclean shutdown gone very wrong you could have some exotic data loss. The recovery mode for an unclean RAID 1 array as done by md is to copy the unclean blocks from the first available device in the array to the rest of the drives. Should the unclean shutdown cause any issues with the data written to that first drive you’ll have undetected corruption.

RAID levels 3, 4 and 5 introduce parity data that can validate disk content and recreate corrupted or missing data from bad or missing disks, avoiding some of the corruption problems with RAID level 1. But they have their own weakness - the famous write hole, where data previously written safely to the array is damaged because of an unclean shutdown during a later write. These RAID levels write data out as stripes, larger chunks of data that can potentially contain unrelated older data alongside the new data you’re trying to write. Any problems that happen while writing out a stripe risk glitching the entire stripe and result in data loss.

Linux’s md has picked up a few solutions to closing the write hole in recent years. It now supports a journal, where the array’s write status is continually written to a separate disk. If an unclean shutdown happens it is now possible to identify the stripes in the array that were mid-write and unwind or correct the array’s data by replaying the transaction log on the journal disk. And the journal doesn’t just close the write hole problem, it also can help speed up array writes by optionally using the journal as a writethrough cache. In this mode the actual data in the write request is first committed to the journal disk and only later flushed to the array. If you use a speedy SSD for your journal disk you now get much faster write speeds and the same resistance to individual disk failure of a RAID array.

But if you have to use a separate disk for the journal aren’t you back to square one? What happens if the journal's disk fails, either from normal operations or as part of an unclean shutdown? If the journal disk fails you will have data loss. RAID level 1 is certainly a solution here, and it seems like the failure modes for RAID 1 are slim enough, especially for modern SSDs, that it makes a real improvement to the overall reliability of the system. In recent times the improved software support and hardware availability of persistent memory (PMEM) means that you could build a system with a fast, reliable journal disk that survives power outages. On Linux, PMEM hardware is exposed to the system as an ordinary block device, can be put into a RAID array of its own, and used by mdadm as a journal device, giving you a journal resistant to the failure of any one PMEM DIMM.

md also gained the Partial Parity Log in recent years, a new technique for writing RAID level 5’s parity data. It incorporates parity data about the previous state of a stripe into the write done for the updated state of the stripe, allowing for the previously written data to be recovered in case of failure and closing the write hole. Extra data must be written and read, slowing array performance, and there is still an opportunity for data loss with newly written data as the PPL only calculates its extra parity data for old data in the stripe and not the new write. But on the flip side the PPL doesn’t require a separate journal disk as it writes the extra parity data directly into the array, sidestepping a potential I/O bottleneck in the journal disk and any durability issues with it. But this extra parity data computed by the PPL increases the I/O and CPU load of writes, lowering write performance overall.

The ZFS filesystem, known for its rampant layering violations, actually does a pretty good job at taking advantage of its situation and providing a durable filesystem in the face of unclean shutdowns. ZFS on Linux is mature, stable, and used by many. Its RAID-Z design closes the write hole by having variably-sized stripes, shrunk down to the size of each write. Previously written data won’t be included in a stripe and thus can’t be damaged during writes of new data. The filesystem’s metadata is also mutated with copy-on-write semantics instead of in place, so all unclean shutdowns can be rolled back to a previously good state. It’s just a really solid design and it delivers robustness without any of the elaborate workarounds described earlier. These design choices come with the tradeoff of decreased performance but often not large enough to turn use cases away from it. If you are going to be maintaining hardware yourself ZFS should be on your radar.

Finally, btrfs is still hanging around. I remember reading about it on lwn.net in 07, my freshman year of college, and thinking that this was the filesystem of the future, comparable with ZFS on features but GPLed and developed in-tree. But as the years went by the team never really pulled it together, despite corporate backing from a few places, and btrfs has left a trail of users burnt by bugs and big unfixed issues even in cutting-edge kernels. This mailing list post was posted in June of 2020 and lists a shocking number of workarounds for data loss bugs. I really do not trust btrfs to do well in any unclean shutdown.

Durability in the face of unclean shutdowns and other issues is an important thing to have, but the cost/performance/durability tradeoff can often lean in favor of reducing the probability of unclean shutdown with battery backup instead of complicating the design of the storage system. Many businesses and individuals also no longer manage their own hardware and operate in the cloud. The cloud block devices used for bulk storage are internally redundant and extremely durable. Unclean shutdowns rarely happen as cloud hardware runs in datacenters with uninterruptible power systems and live migration of server instances. In the cloud you have endless hardware available to you and it's extremely durable, swinging the system design pendulum away from durability and back towards performance. In the next post I’ll explore the performance tradeoffs of storage systems.