Over the past few days
I've been looking again at I/O performance and reliability. One
technology keeps cropping up: Battery Backed Write Caches. Since we
are approaching World backup day I thought I'd publish what I've learnt so far. The
most alarming thing in my investigation is the number of people who
automatically assume that a BBWC assures data integrity.
Unfortunately the internet
is full of opinions and received wisdom, and short on facts and
measurement. However if we accept that re-ordering of the operations
which make up a filesystem transaction on a journalling filesystem
undermines the integrity of the data, then the advice from sources
who should know better is dangerous.
From the RedHat Storage Administration Guide:
“Write barriers are also unnecessary whenever the system uses hardware RAID controllers with battery-backed write cache. If the system is equipped with such controllers and if its component drives have write caches disabled, the controller will advertise itself as a write-through cache; this will inform the kernel that the write cache data will survive a power loss.”
There's
quite a lot here to be confused about. The point about RAID controllers
is something of a red herring. There's a lot discussion elsewhere
about software vs hardware RAID – and the short version is that
modern computers have plenty of CPU to handle the RAID without a
measurable performance impact, indeed many of the cheaper devices
offload the processing work to the main CPU. On the other hand
hardware RAID poses 2 problems:
- The RAID controller must write its description of the disk layout to the storage – it does this in a propretary manner meaning that if (when?) the controller fails, you will need to source compatible hardware (most likely the same hardware) to access the data locked away in your disks
- While all hardware RAID controllers will present the configured RAID sets to the computer as simple disks, your OS need visibility of what's happenning beyond the controller in order to tell you about failing drives. Only a small proportion of the cards currently available are fully supported in Linux.
It
should however be possible to exploit the advantages (if any)
offerred by the non-volatile write cache without using the hardware
RAID functionality. It's an important point that the on-disk caches
must
be disabled for any assurance of data intregrity. But there an
omission in the statement above which will eat your data.
If
you use any
I/O scheduler other than 'noop' then the writes sent to the BBWC will
be re-ordered. That's the point of I/O scheduler. Barriers (and more
recently FUA) provide a mechanism for write operations to be grouped
into logical transactions within which re-ordering has no impact on
integrity. Without such boundaries, there is no guarantee that the
'commit' operation will only occur after the data and meta data
changes are applied to the non-volatile storage.
Even
worse than losing data is that your computer wont be able to tell you
that it's all gone horribly wrong. Journalling filesystem are a mechanism for identifying and resolving
corruption events arising due to incomplete writes. If the
transaction mechanism is compromised from out-of-sequence writes,
then the filesystem will most likely be oblivious to the corruption
and report no errors on resumption.
For such an event to lead corruption, it must occur when a write operation is taking place - since writing to the non-volatile storage should be much faster than to disk, and that in most systems reads are more comon than writes, writes will only be occurring for a small proportion of the time. But with faster/more disks and write intensive applications this differential decreases.
For such an event to lead corruption, it must occur when a write operation is taking place - since writing to the non-volatile storage should be much faster than to disk, and that in most systems reads are more comon than writes, writes will only be occurring for a small proportion of the time. But with faster/more disks and write intensive applications this differential decreases.
When
Boyd Stephen Smith Jr said on the Debian list that a BBWC does not
provide the same protection as barriers he got well flamed. He did provide links that show that the barrier overhead is not
really that big.
A
problem with BBWC is that the batteries wear out. The better devices
will switch into learning mode to measure the battery health either
automatically or on demand. But when they do so, the cache ceases to
operate as non-volatile storage and the device changes it's behaviour
from write-back to write through. This has a huge impact on both
performance and reliability. Low end devices won't know what state
the battery is in until the power is cut. Hence it is essential to
choose a device which is fully supported under Linux.
Increasingly
BBWC is being phased out in favour of write caches using flash for
non-volatile storage. Unlike the battery-backed RAM there is no
learning cycle. But Flash wears out with writes. The failure modes
for these devices are not well understood.
There's
a further consideration for the behaviour of the system when its not
failing: The larger the cache on the disk controller or on disk, the
more likely that writes will be re-ordered later anyway – so
maintaining a buffer in the host systems memory and re-ordering the
data there just means adding more latency before data is in
non-volatile storage. NOOP will be no slower and should be faster
most of the time.
If
we accept that a BBWC with a noop scheduler should ensure the
integrity of our data, then is there any benefit from enabling
barriers? According to RedHat, we should disable them because the
BBWC makes them redundant and...
“enabling write barriers causes a significant performance penalty.”
Wait
a minute. The barrier should force the flush of all data held in
non-volatile memory. But we don't have any non-volatile memory after
the OS/filesystem buffer. So the barrier should not be causing any
significant delay. Did Redhat get it wrong? Or are the BBWC designers
getting it wrong and flushing the BBWC at barriers?
Could
the delay be due to moving data from the VFS to the I/O queue? We
should have configured our system to minimize the size of the write
buffer (by default 2.6.32 only starts actively pushing out dirty
pages when the get to 10% of the RAM – that means you could have
3Gb of data in volatile storage on a 32Gb box). However, many people
also report performance issues with Innodb + BBWC + barriers, and the
Innodb engine should
be configured to use O_DIRECT, hence we can exclude a significant
contribution to performance problems from the VFS cache.
I
can understand that the people developing BBWC might want to provide
a mechanism for flushing the write back cache – if the battery is
no longer functional or has switched to “learning mode” then the
device needs to switch to write through mode. But its worth
noting that in this state of operation, the cache is not operating a
non-volatile store!
Looking
around the internet, it's not just Redhat who think that a BBWC
should be used with no barriers and any IO scheduler:
The
XFS FAQ states
“it is recommended to turn off the barrier support and mount the filesystem with "nobarrier",”
Percona say you should disable barriers and use a BBWC but don't mention the I/O scheduler in this context. The presentation does later include an incorrect description of the NOOP scheduler.
Does
a BBWC add any value when your system is running off a UPS? Certainly
I would consider a smart UPS to be the first line of defence against
power problems. In addition to providing protection against
over-voltages it should be configured to implement a managed shutdown
of your system, meaning that transactions at all levels of
abstraction will be handled cleanly and under the control of the
software which creates them.
Yes,
a BBWC does improve performance and reliability (in combination with
the noop scheduler, a carefully managed policy for testing and
monitoring battery health and RAID implemented in software). It is certainly
cheaper than moving a DBMS or fileserver to a two-node cluster, but
the latter provides a lot more reliability (some rough calculations
suggest about 40 times more reliable). If time and money are no
object then for the best performance, equip both nodes in the cluster
with BBWC. But make sure they are all using the noop scheduler.
Further,
I would recommend testing the hardware you've got – if you see
negligible performance impact with barriers and a realistic workload
https://github.com/axboe/fio
then
enable the barriers.
No comments:
Post a Comment