Monday, 24 March 2014

Warning: BBWC may be bad for your health


Over the past few days I've been looking again at I/O performance and reliability. One technology keeps cropping up: Battery Backed Write Caches. Since we are approaching World backup day I thought I'd publish what I've learnt so far. The most alarming thing in my investigation is the number of people who automatically assume that a BBWC assures data integrity. Unfortunately the internet is full of opinions and received wisdom, and short on facts and measurement. However if we accept that re-ordering of the operations which make up a filesystem transaction on a journalling filesystem undermines the integrity of the data, then the advice from sources who should know better is dangerous.


Write barriers are also unnecessary whenever the system uses hardware RAID controllers with battery-backed write cache. If the system is equipped with such controllers and if its component drives have write caches disabled, the controller will advertise itself as a write-through cache; this will inform the kernel that the write cache data will survive a power loss.”

There's quite a lot here to be confused about. The point about RAID controllers is something of a red herring. There's a lot discussion elsewhere about software vs hardware RAID – and the short version is that modern computers have plenty of CPU to handle the RAID without a measurable performance impact, indeed many of the cheaper devices offload the processing work to the main CPU. On the other hand hardware RAID poses 2 problems:

  1. The RAID controller must write its description of the disk layout to the storage – it does this in a propretary manner meaning that if (when?) the controller fails, you will need to source compatible hardware (most likely the same hardware) to access the data locked away in your disks
  2. While all hardware RAID controllers will present the configured RAID sets to the computer as simple disks, your OS need visibility of what's happenning beyond the controller in order to tell you about failing drives. Only a small proportion of the cards currently available are fully supported in Linux.

It should however be possible to exploit the advantages (if any) offerred by the non-volatile write cache without using the hardware RAID functionality. It's an important point that the on-disk caches must be disabled for any assurance of data intregrity. But there an omission in the statement above which will eat your data.

If you use any I/O scheduler other than 'noop' then the writes sent to the BBWC will be re-ordered. That's the point of I/O scheduler. Barriers (and more recently FUA) provide a mechanism for write operations to be grouped into logical transactions within which re-ordering has no impact on integrity. Without such boundaries, there is no guarantee that the 'commit' operation will only occur after the data and meta data changes are applied to the non-volatile storage.

Even worse than losing data is that your computer wont be able to tell you that it's all gone horribly wrong. Journalling filesystem are a mechanism for identifying and resolving corruption events arising due to incomplete writes. If the transaction mechanism is compromised from out-of-sequence writes, then the filesystem will most likely be oblivious to the corruption and report no errors on resumption.

For such an event to lead corruption, it must occur when a write operation is taking place - since writing to the non-volatile storage should be much faster than to disk, and that in most systems reads are more comon than writes, writes will only be occurring for a small proportion of the time. But with faster/more disks and write intensive applications this differential decreases.

When Boyd Stephen Smith Jr said on the Debian list that a BBWC does not provide the same protection as barriers he got well flamed. He did provide links that show that the barrier overhead is not really that big.

A problem with BBWC is that the batteries wear out. The better devices will switch into learning mode to measure the battery health either automatically or on demand. But when they do so, the cache ceases to operate as non-volatile storage and the device changes it's behaviour from write-back to write through. This has a huge impact on both performance and reliability. Low end devices won't know what state the battery is in until the power is cut. Hence it is essential to choose a device which is fully supported under Linux.

Increasingly BBWC is being phased out in favour of write caches using flash for non-volatile storage. Unlike the battery-backed RAM there is no learning cycle. But Flash wears out with writes. The failure modes for these devices are not well understood.

There's a further consideration for the behaviour of the system when its not failing: The larger the cache on the disk controller or on disk, the more likely that writes will be re-ordered later anyway – so maintaining a buffer in the host systems memory and re-ordering the data there just means adding more latency before data is in non-volatile storage. NOOP will be no slower and should be faster most of the time.

If we accept that a BBWC with a noop scheduler should ensure the integrity of our data, then is there any benefit from enabling barriers? According to RedHat, we should disable them because the BBWC makes them redundant and...
enabling write barriers causes a significant performance penalty.”
Wait a minute. The barrier should force the flush of all data held in non-volatile memory. But we don't have any non-volatile memory after the OS/filesystem buffer. So the barrier should not be causing any significant delay. Did Redhat get it wrong? Or are the BBWC designers getting it wrong and flushing the BBWC at barriers?

Could the delay be due to moving data from the VFS to the I/O queue? We should have configured our system to minimize the size of the write buffer (by default 2.6.32 only starts actively pushing out dirty pages when the get to 10% of the RAM – that means you could have 3Gb of data in volatile storage on a 32Gb box). However, many people also report performance issues with Innodb + BBWC + barriers, and the Innodb engine should be configured to use O_DIRECT, hence we can exclude a significant contribution to performance problems from the VFS cache.

I can understand that the people developing BBWC might want to provide a mechanism for flushing the write back cache – if the battery is no longer functional or has switched to “learning mode” then the device needs to switch to write through mode. But its worth noting that in this state of operation, the cache is not operating a non-volatile store!

Looking around the internet, it's not just Redhat who think that a BBWC should be used with no barriers and any IO scheduler:

The XFS FAQ states
“it is recommended to turn off the barrier support and mount the filesystem with "nobarrier",”
Percona say you should disable barriers and use a BBWC but don't mention the I/O scheduler in this context. The presentation does later include an incorrect description of the NOOP scheduler.

Does a BBWC add any value when your system is running off a UPS? Certainly I would consider a smart UPS to be the first line of defence against power problems. In addition to providing protection against over-voltages it should be configured to implement a managed shutdown of your system, meaning that transactions at all levels of abstraction will be handled cleanly and under the control of the software which creates them.

Yes, a BBWC does improve performance and reliability (in combination with the noop scheduler, a carefully managed policy for testing and monitoring battery health and RAID implemented in software). It is certainly cheaper than moving a DBMS or fileserver to a two-node cluster, but the latter provides a lot more reliability (some rough calculations suggest about 40 times more reliable). If time and money are no object then for the best performance, equip both nodes in the cluster with BBWC. But make sure they are all using the noop scheduler.

Further, I would recommend testing the hardware you've got – if you see negligible performance impact with barriers and a realistic workload https://github.com/axboe/fio then enable the barriers.


Monday, 6 January 2014

Transactional websites and navigation


There's lot's of things that make me happy. In my professional life, it's getting stuff done, helping other people or learning something new. Recently I learnt something which was probably widely known but I'd managed to miss all these years. I'm so inspired that I'm going to share it with you all.

A lot of transactional websites crumble when you dare to do something as reckless as use the back button or open more than one window on the site. The underlying reason is that the developer is storing data relating to the transaction – i.e. specific to a navigation event in a single window - in the session – which is common to all the windows. A very poor way to mitigate the problem is to break the browser functionality by disabling the back button or interaction via a second window. I must admit to having used this in the past to mitigate the effects of problems elsewhere in the code (alright, if you must know – I bastardized Brooke Bryan's back button detector, and as for the new window....well the history length is zero)

But how should I be solving the problem?

The obvious solution is to embed all the half-baked ingredients of a transaction in each html page sent to the browser and send the updated data model back to the server on navigation. This can work surprisingly well as long as the data on the browser is cleared down between sessions. But with increasingly complex datasets, this becomes rather innefficient, particularly on slow connections. Further there are times when we want the transaction state to reflect the session state: consider a shopping basket – if a user fills a shoppng basket then ends their session we might want to retain the data about what they put in their shopping basket – but we might also want to release any stock reserved by the act of adding it to the basket. Often the situation arises where we end up with (what should be) the same data held in more than one place (browser and server). At some point the representations of the truth will diverge – and at that point it all goes rather pear shaped.

A while back I created a wee bit of code for point and click form building – PfP Studio. Key to the utility of this was the ability to treat a collection of widgets (a form) as a widget itself. And the easiest way to achieve that was to support multiple windows. When I first wrote this, I decided that the best way to handle the problem was to add partitioned areas to the session – one for each window. This depended on the goodwill of the user to open a new window via the functionality in the app rather than the browser chrome: each window had to carry a identifier (the “workspace”) across navigation events. Effectively I was rolling my own session handling with transids


This has a number of issues – the PHP sites warns about leaking authentication tokens but there's also a lot of overhead when you start having to deal with javascript triggerred navigation, and PRGs.

Then the other day I discovered that the browser window.name property was writeable in all major browsers! Hallelujah! Yes, it still means that you need to do some work to populate links and forms, but it's a lot simpler than my previous efforts – particularly with a javascript heavy site.

Any new window (unless it has been explicitly given a name in an href link or via window.open) has an empty string as the name – hence you simply set a random value if it's empty – and the value persists even if you press the back button.

While I think I've explained what I'm trying to say, a real example never goes amiss:

if (''==window.name) {
   var t=new Date();
   window.name=t.getMilliseconds()+Math.random();
}


Tuesday, 1 October 2013

Daily Mail Fail


What looked like an interesting link appeared in my inbox the other day, so I followed it to read the article. The link in question was to a page on the www . thisismoney . co . uk site - owned and operated by the Daily Mail and proud to describe itself as "Financial Website of the year".

I did not expect the Daily Mail to let the facts get in the way of a good story – and this did little to improve my impression of them, however I was surprised at how poor the performance was....and then discovered how poor they really were at IT services.

I noticed that the content continued to load for some time after landing on the page.

Broadbandspeedchecker.co.uk clocks my download speed at 44.95 Mb/s, not bad, although the latency from Maidenhead seems high at 168ms RTT. But the page from the Daily Mail took 47.42 seconds to get to the onload event then continued downloading stuff for a further 42 seconds: 1 minute and 19 seconds to download a single page?

There was only 1.4Mb of data in total, but split across no less than 318 requests across 68 domains, including 12 404s from *.dailymail.co.uk, erk!

But digging further I found that the site did not just perform badly – it's probably illegal.

In addition to (what appears to be) the usual 4 Google Analytics cookies, my browser also acquired session cookies from .thisismoney.co.uk, .rubiconproject.com, b3-uk.mookie1.com (x2), .crwdcntrl.net (x2) and.......129 cookies with future expiry dates.

FFS!

(a full list appears below)

For the benefit of any readers outside the European Union, member countries must all implement a set of LAWS (not rules, or guidelines) regarding the use of any data stored on a computer, including cookies. In the UK, these are described by the Privacy and Electronic Communications (EC Directive) (Amendment) Regulations 2011, which websites were required to implement in 2012.

Did the Daily Mail inform me that it was going to store these cookies?

No

Did the Daily Mail ask for my consent to store these cookies?

No

Did the Daily Mail provide any information about cookies on the page?

No

Did the Daiy mail provide a link to their privacy policy on the page?

Yes, in teeny-weeny text – the very last visible element on the page.

Did the Daily Mail offer me a chance to opt-out of accepting the cookies?

No

Is this a world record?

Maybe?



In the absence of any means to tell the Daily Mail I don't want their cookies via their website, I thought I would use the method built into my browser (although the cookie law does require that I should not have to jump through these hoops for compliance). So I enabled the do-not-track feature in Firefox deleted the cookies and cache, hit the reload button, waited a further 44 seconds (my ISP has transparent caching).....


Can you guess what happenned next?


All the cookies came back again.

The challenge

Do you know of a worse site than this for dumping cookies? Add a comment and a link to your analysis and I'll publish it.

Monday, 16 September 2013

Zend Optimizer Plus - still not following the party line

Having previously failed to get a significant difference in benchmarks between PHP 5.3 and 5.5, I was successful in establishing that the optimizer was producing code which ran slightly faster (about 2% with DokuWiki).

This time I tried ramping up the concurrency to see if PHP could deliver on its performance promises.

Running

ab -n 12000 -c $X http://localhost/src/doku.php

For $X in [10,50,100,200,300,400,500,600,700,800]  I got....

For less than 500 concurrent connections, PHP 5.3.3 is slower - but only about 5%. At more than 500 concurrent requests, 5.5.1 is slower!

I'm just no good at this benchmarking business.

Both were running from nginx+php-fpm with the same (common) config. 5.5.1 had full ZOP+ optimization enabled.

Wednesday, 4 September 2013

Zend Optimizer Plus - trying to do it right

Reading further (didn't bookmark the link and can't find it now) alongside benchmarks showing ZOP+ to be around 20% faster for "real world" applications there's also mention of a big reduction in memory usage.

I'm quite prepared to believe that better use of memory is possible - the runtime footprint of PHP code seems to be around 8 times the footprint of the script on disk - so plenty of scope for improvement. I was running my tests with 100 concurrent connections - not nearly saturating my machine. I expect that running with a much higher load / less memory would translate into the the performance improvements reported elsewhere - more testing required?

Meanwhile I had another look at the optimizer. I repeated the setup from last time, running with PHP 5.5.1, ZOP+ with filemtime checking off, fetching a single Dokuwiki page. The control test with full optimization - as per previous run - is giving slightly different results than last time. Since I'm running this on my home machine and it's also running lots of other things like X, KDE, my browser, mail client... it's possible that the system isn't in exactly the state it was in when I ran the previous tests.

Full Optimization: 

opcache.optimization_level=0xffffffff

6.334 ms/req

Optimization disabled : 

opcache.optimization_level=0

6.452 ms/req

Repeating the test several times gave consistent results; about 2% improvement in speed.

Not revolutionary, but every little helps - and in fairness Dokuwiki already has good performance optimization in the PHP code.

When time allows I'll go back and look at memory / CPU usae while running ZOP+ vs APC

Tuesday, 27 August 2013

Doing it wrong again - this time with Zend Optimizer plus

The guys at PHP have now committed to shipping Zend Optimizer Plus with future releases of PHP, I thought I'd have a play around with it.

tl;dr

While normally I run my PHP from Apache + mod_php, for the purposes of this exercise, it was easier for me to set up nginx / php-fpm using PHP 5.3.3/APC 3.1.2 and 5.5.1/Zend Opcache 7.0.2dev. All were compiled from source using default settings on a 32-bit PCLinuxOS 2012 distribution (kernel 2.6.38.8) on a dual AMD 4200 machine with 2Gb of memory.

Tests were run using ab on localhost, each test was run by first seeding the opcocde cache (ab -n 2 -c 2) then taking a measurement (ab -n 1000 -c 100).

For both APC and ZOP+ tests were run with an without checking for modified timestamps on PHP source files.

Final scores

My results showed Zend Optimizer Plus to be no faster than APC. Although the numbers are within the margin of error, if anything ZOP+ was slower.

Config Application Timestamps time per request (ms)
Php 5.3.3 + APC Dokuwiki yes 6.274
Php 5.3.3 + APC Dokuwiki no 6.271
Php 5.5.1, ZOP Dokuwiki yes 6.643
Php 5.5.1, ZOP Dokuwiki no 6.293
Php 5.3.3 + APC + mysql Wordpress yes 36.012
Php 5.3.3 + APC + mysql Wordpress no 35.868
Php 5.5.1, ZOP, mysqlnd Wordpress yes 35.905
Php 5.5.1, ZOP, mysqlnd Wordpress no 35.978
static HTML from Dokuwiki Dokuwiki n/a 0.171
static HTML from Wordpress Wordpress n/a 0.182

This seems to fly in the face of what is currently being reported elsewhere.

More stuff about my methodology

I wanted to create a reasonably realistic test, hence using 2 off the shelf Content Management Systems. Wordpress uses a MySQL backend for its data: and there is a further difference beteween the APC and ZOP+ configurations: the former uses libmysqlclient while the latter was built with MySQLnd (which meant I had to rewrite the database class to use the mysqli_ functions in place of mysql). The effects on performance are complex and tied to the level of concurrency but at 100 concurrent HTTP requests I was expecting this to be minimal. Dokuwiki, on the other hand, uses file based storage.

Other reviews

The PHP wiki page about the change links to a spreadsheet showing what look like impressive stats. The stats are reported as requests per second for various configurations.


Some more reviews:

https://managewp.com/boost-wordpress-performance-zend-optimizer - doesn't show response times / requests/second but does say that load and memory usage were lower implying greater capacity using ZOP+ compared with APC

http://halfelf.org/2013/trading-apc-for-zend/ reports a similar reduction in CPU and memory, but again no response times.

http://www.ricardclau.com/2013/03/apc-vs-zend-optimizer-benchmarks-with-symfony2/ results again in req/s, and showing an improvement of around 10-15%

http://massivescale.blogspot.co.uk/2013/06/php-55-zend-optimiser-opcache-vs-xcache.html compared ZOP+ and Xcache finding approx 15-20% improvement in req/s and similar reduction in response times with Joomla.

The optimizer bit(s)

PHP opcode caches have been around for a long time. ZOP+ brings something new: a code optimizer. Since it is still generating pcode it doesn't apply the CPU specific tweaks that a native code compiler does. Despite the 32 bits integer used to set the optimizer flag, only 6 flags are recognized by the optimizer (and the last pass only cleans up the debris left by the first 5). The optimizations are mostly substitutions - replacing PHP's builtin constants with literals, post increment with pre-increment, compile time type-juggling of built-in constants and such like. There is no inlining of functions in loops. No branch order prediction. Having dug through the code, I was not expecting it the optimizer to deliver revolutionary speed improvements.

And yet, Dimitry's spreadsheet shows 'ZF Test (ZF 1.5)' going from 158 req/s to 217 req/s!

I presume this refers to the Zend Framework. While this far from speedy I find it astonishing that the performance of the code should improve so much with a few relatively simple tweaks to the opcodes - it rather suggests that there is huge scope for optimizing the code by hand. Although I also note that the performance of 'Scrom (ZF App)' only improves by around 8%.

What am I doing wrong?

The consistent difference (apart from the opcode cache) in my experiment was using different versions of PHP - to a certain extent I'm not really comparing like-for-like. I can only assume that if I ran APC against PHP 5.5.1 and/or ZOP+ with PHP 5.3.3 I would see a very different story. However if you are seeking optimal performance at low load levels (rather than optimal capacity) then there seems to be little incentive to apply this upgrade.

There are anecdotal reports of stability issues with APC on PHP 5.4+; there may be sound technical and economic reasons why APC is not being actively maintained and for ZOP+ to be a better strategic choice.

A clear choice?

I can live without APC's support for user-data caching. But the elephant in the room is the fact that ZOP+ does not reclaim memory: if your code base is larger than the cache size, or the cache fills up with old verions of code, it forces a full flush and re-initialization. This should not be a problem for sites with a dedicated devops personnel managing releases to a small number of applications using a continuous deployment strategy. However for the rest of us there needs be a significant performance advantage with ZOP+ to make this a price worth paying.

Saturday, 17 August 2013

Starting a new website - part 3

So the decision was made, I would stick with Dokuwiki and use PJAX for loading the pages.

A bit of coding and hey presto...www.scottishratclub.co.uk

(live site is sill running an incomplete version of the code - note to self - get the current version deployed)

In order to structure the Javacript changes nicely, keep everything tidy and fit in with Dokuwiki too, the functionality is split across a syntax plugin (for implementing widgets, including initializing PJAX, fixing the problems introduced by defering loading of the javascript and accomodating a strict Content Security Policy). This then places some constraints on how further widgets are implemented so it's really a framework (yeuch!). Anyway the plugin is called Jokuwiki

In order to use PJAX, the source page needs to be slightly modified (but it's JUST 5 LINES OF CODE!):


if ('true'!=$_SERVER['HTTP_X_PJAX']){
....top part of page
} ?><div id="pjax-container">
....stuff loaded via pjax
</div><?php
if ('true'!=$_SERVER['HTTP_X_PJAX']){
....bottom part of page
}

But just to make it real easy, I published a template too. Not the one I used on my website - but if anyone want it....let me know.

The impact of PJAX on performance is rather large:


Of course it had to be deployed to the site. So I dusted down pushsite and fired it up with a recipe for deploying the site. About 50 files loaded then I stopped getting responses from the remote system. I ran it again.....a further 20. The socket was still connected but nothing happening. Switching to passive mode didn't help. Adding throttling didn't help. I spent several hours battling with it and gave up. Same story the following day - so I logged a call with the service provider. The following day, they suggested using a different FTP server.....same problem. They said they'd get back to me.

Since I had no ssh access, I couldn't unpack a tarball from the shell - doing it via a PHP script invoked from the web would have meant I'd have to spend just as much time fixing the permissions as uploading the stuff by hand. But a bit of rummaging around in cpanel and I found that there was a back/restore option running over HTTP - so I download a backup, unpacked it, overwrote the backed-up website with my new site, packed it up and restored it onto the server. Job done.