Wednesday 29 November 2017

Unexpected PHP/MySQL failure!

Somehow my life/code is never as simple as the examples on w3schools.

Once again I should to explain a lot of the context (and apologize for some of it) before I get to the point. But the short version is both the mysql and mysqli extensions seem to cast all mysql data as strings.

Normally this goes unnoticed when you're working with PHP - its dynamic type conversion handles all the bumps. But not this time.

I wrote an application for tracking data quality some time ago. This runs various rules against several large datasets and collects the exceptions in a single polymorphic table which looks something like:

rule VARCHAR(20)
yearmonth INTEGER
PK1 INTEGER
PK2 VARCHAR(30)
PK3 VARCHAR(30)
data CLOB
age INTEGER

Each row of output from each rule is stored in a record in this table. Periodically the historic data is deleted.

Yes, I confess I'm holding a structured opaque data blob in a relational database! But by synthesizing PK1, PK2 and PK3 from the rule data it means I can compare month on month outputs to get the age of the exception. And I have a reliable unique identifier for each record in the (very large) table.

All well and good. This is currently tracking 320 rules and the application maintains summary level data. It has been running for about 8 years. But users as *so* demanding! They wanted to see one of the rules (customer whose email addresses pass a regex check but fail MX lookup) decorated with information come a separate database. Not such a big deal. The application has the capability to mailmerge the output of one ruleset to create a query for another (clever, eh?). So all that was needed was a bit of configuration.

But now the story takes a twist. The number of records involved is.....lets just say a lot. This all runs on a very old version of MySQL (because that's how RHEL rolls its distro, and everyone hates systemd) which was originally configured to give great performance on low latency storage (i.e. MyISAM) but now runs on a SAN with very high latency (but massive bandwidth).

Because of these considerations, along with the fact that I rarely rebuild the indexes, joining the output of the two rules was HORRENDOUSLY slow. Its supposed to be available at the click of a button, but I got bored waiting after one hour. This version of MySQL does not do merge joins. So I wrote my own in PHP. This took around 5 seconds to retrieve the data. But huge chunks of the data were missing! I eventually tracked this down to the fact that the mysql PHP extension was converting the PK1 integer attribute to a string. So "110" was greater than "1000" and my code was discarding lots of rows when trying to join the two datasets.

I can understand that there would be a cost if the library had to call back to the DBMS to get the type of each attribute in a resultset, but looking at the C API, it looks like the client lib is returning a correctly typed attribute, but the PHP engine is casting it as a string!

But, of course, the MySQL extension has long been deprecated. So I decided to see of MySQLi was similarly afflicted. After a brief diversion when I discovered about SO_PEERCRED I found that MySQLi does the same thing.