Monday, 24 March 2014

Warning: BBWC may be bad for your health

Over the past few days I've been looking again at I/O performance and reliability. One technology keeps cropping up: Battery Backed Write Caches. Since we are approaching World backup day I thought I'd publish what I've learnt so far. The most alarming thing in my investigation is the number of people who automatically assume that a BBWC assures data integrity. Unfortunately the internet is full of opinions and received wisdom, and short on facts and measurement. However if we accept that re-ordering of the operations which make up a filesystem transaction on a journalling filesystem undermines the integrity of the data, then the advice from sources who should know better is dangerous.

Write barriers are also unnecessary whenever the system uses hardware RAID controllers with battery-backed write cache. If the system is equipped with such controllers and if its component drives have write caches disabled, the controller will advertise itself as a write-through cache; this will inform the kernel that the write cache data will survive a power loss.”

There's quite a lot here to be confused about. The point about RAID controllers is something of a red herring. There's a lot discussion elsewhere about software vs hardware RAID – and the short version is that modern computers have plenty of CPU to handle the RAID without a measurable performance impact, indeed many of the cheaper devices offload the processing work to the main CPU. On the other hand hardware RAID poses 2 problems:

  1. The RAID controller must write its description of the disk layout to the storage – it does this in a propretary manner meaning that if (when?) the controller fails, you will need to source compatible hardware (most likely the same hardware) to access the data locked away in your disks
  2. While all hardware RAID controllers will present the configured RAID sets to the computer as simple disks, your OS need visibility of what's happenning beyond the controller in order to tell you about failing drives. Only a small proportion of the cards currently available are fully supported in Linux.

It should however be possible to exploit the advantages (if any) offerred by the non-volatile write cache without using the hardware RAID functionality. It's an important point that the on-disk caches must be disabled for any assurance of data intregrity. But there an omission in the statement above which will eat your data.

If you use any I/O scheduler other than 'noop' then the writes sent to the BBWC will be re-ordered. That's the point of I/O scheduler. Barriers (and more recently FUA) provide a mechanism for write operations to be grouped into logical transactions within which re-ordering has no impact on integrity. Without such boundaries, there is no guarantee that the 'commit' operation will only occur after the data and meta data changes are applied to the non-volatile storage.

Even worse than losing data is that your computer wont be able to tell you that it's all gone horribly wrong. Journalling filesystem are a mechanism for identifying and resolving corruption events arising due to incomplete writes. If the transaction mechanism is compromised from out-of-sequence writes, then the filesystem will most likely be oblivious to the corruption and report no errors on resumption.

For such an event to lead corruption, it must occur when a write operation is taking place - since writing to the non-volatile storage should be much faster than to disk, and that in most systems reads are more comon than writes, writes will only be occurring for a small proportion of the time. But with faster/more disks and write intensive applications this differential decreases.

When Boyd Stephen Smith Jr said on the Debian list that a BBWC does not provide the same protection as barriers he got well flamed. He did provide links that show that the barrier overhead is not really that big.

A problem with BBWC is that the batteries wear out. The better devices will switch into learning mode to measure the battery health either automatically or on demand. But when they do so, the cache ceases to operate as non-volatile storage and the device changes it's behaviour from write-back to write through. This has a huge impact on both performance and reliability. Low end devices won't know what state the battery is in until the power is cut. Hence it is essential to choose a device which is fully supported under Linux.

Increasingly BBWC is being phased out in favour of write caches using flash for non-volatile storage. Unlike the battery-backed RAM there is no learning cycle. But Flash wears out with writes. The failure modes for these devices are not well understood.

There's a further consideration for the behaviour of the system when its not failing: The larger the cache on the disk controller or on disk, the more likely that writes will be re-ordered later anyway – so maintaining a buffer in the host systems memory and re-ordering the data there just means adding more latency before data is in non-volatile storage. NOOP will be no slower and should be faster most of the time.

If we accept that a BBWC with a noop scheduler should ensure the integrity of our data, then is there any benefit from enabling barriers? According to RedHat, we should disable them because the BBWC makes them redundant and...
enabling write barriers causes a significant performance penalty.”
Wait a minute. The barrier should force the flush of all data held in non-volatile memory. But we don't have any non-volatile memory after the OS/filesystem buffer. So the barrier should not be causing any significant delay. Did Redhat get it wrong? Or are the BBWC designers getting it wrong and flushing the BBWC at barriers?

Could the delay be due to moving data from the VFS to the I/O queue? We should have configured our system to minimize the size of the write buffer (by default 2.6.32 only starts actively pushing out dirty pages when the get to 10% of the RAM – that means you could have 3Gb of data in volatile storage on a 32Gb box). However, many people also report performance issues with Innodb + BBWC + barriers, and the Innodb engine should be configured to use O_DIRECT, hence we can exclude a significant contribution to performance problems from the VFS cache.

I can understand that the people developing BBWC might want to provide a mechanism for flushing the write back cache – if the battery is no longer functional or has switched to “learning mode” then the device needs to switch to write through mode. But its worth noting that in this state of operation, the cache is not operating a non-volatile store!

Looking around the internet, it's not just Redhat who think that a BBWC should be used with no barriers and any IO scheduler:

The XFS FAQ states
“it is recommended to turn off the barrier support and mount the filesystem with "nobarrier",”
Percona say you should disable barriers and use a BBWC but don't mention the I/O scheduler in this context. The presentation does later include an incorrect description of the NOOP scheduler.

Does a BBWC add any value when your system is running off a UPS? Certainly I would consider a smart UPS to be the first line of defence against power problems. In addition to providing protection against over-voltages it should be configured to implement a managed shutdown of your system, meaning that transactions at all levels of abstraction will be handled cleanly and under the control of the software which creates them.

Yes, a BBWC does improve performance and reliability (in combination with the noop scheduler, a carefully managed policy for testing and monitoring battery health and RAID implemented in software). It is certainly cheaper than moving a DBMS or fileserver to a two-node cluster, but the latter provides a lot more reliability (some rough calculations suggest about 40 times more reliable). If time and money are no object then for the best performance, equip both nodes in the cluster with BBWC. But make sure they are all using the noop scheduler.

Further, I would recommend testing the hardware you've got – if you see negligible performance impact with barriers and a realistic workload then enable the barriers.