Somehow my life/code is never as simple as the examples on w3schools.
Once again I should to explain a lot of the context (and apologize for some of it) before I get to the point. But the short version is both the mysql and mysqli extensions seem to cast all mysql data as strings.
Normally this goes unnoticed when you're working with PHP - its dynamic type conversion handles all the bumps. But not this time.
I wrote an application for tracking data quality some time ago. This runs various rules against several large datasets and collects the exceptions in a single polymorphic table which looks something like:
rule VARCHAR(20)
yearmonth INTEGER
PK1 INTEGER
PK2 VARCHAR(30)
PK3 VARCHAR(30)
data CLOB
age INTEGER
Each row of output from each rule is stored in a record in this table. Periodically the historic data is deleted.
Yes, I confess I'm holding a structured opaque data blob in a relational database! But by synthesizing PK1, PK2 and PK3 from the rule data it means I can compare month on month outputs to get the age of the exception. And I have a reliable unique identifier for each record in the (very large) table.
All well and good. This is currently tracking 320 rules and the application maintains summary level data. It has been running for about 8 years. But users as *so* demanding! They wanted to see one of the rules (customer whose email addresses pass a regex check but fail MX lookup) decorated with information come a separate database. Not such a big deal. The application has the capability to mailmerge the output of one ruleset to create a query for another (clever, eh?). So all that was needed was a bit of configuration.
But now the story takes a twist. The number of records involved is.....lets just say a lot. This all runs on a very old version of MySQL (because that's how RHEL rolls its distro, and everyone hates systemd) which was originally configured to give great performance on low latency storage (i.e. MyISAM) but now runs on a SAN with very high latency (but massive bandwidth).
Because of these considerations, along with the fact that I rarely rebuild the indexes, joining the output of the two rules was HORRENDOUSLY slow. Its supposed to be available at the click of a button, but I got bored waiting after one hour. This version of MySQL does not do merge joins. So I wrote my own in PHP. This took around 5 seconds to retrieve the data. But huge chunks of the data were missing! I eventually tracked this down to the fact that the mysql PHP extension was converting the PK1 integer attribute to a string. So "110" was greater than "1000" and my code was discarding lots of rows when trying to join the two datasets.
I can understand that there would be a cost if the library had to call back to the DBMS to get the type of each attribute in a resultset, but looking at the C API, it looks like the client lib is returning a correctly typed attribute, but the PHP engine is casting it as a string!
But, of course, the MySQL extension has long been deprecated. So I decided to see of MySQLi was similarly afflicted. After a brief diversion when I discovered about SO_PEERCRED I found that MySQLi does the same thing.
Wednesday, 29 November 2017
Saturday, 22 July 2017
Validate Input : Escape Output
When training
inexperienced programmers, there are 2 mantras I always need to
repeat:
-
You don't write code for computers to understand; you write it for humans to understand.
-
validate input : escape output
Since this the
latter has cropped up a few times recently, I have documented the
reasoning here.
Both validation and
escaping are about protecting your system against attack and ensuring
good data quality.
Validate input
Validating input
means checking that the input falls within the acceptable boundaries
for your application. Does it look like the right data type? Is it
within a sensible range? Does it contain malware?
The first 2 examples
are there to avoid processing data which is obviously wrong. People
make mistakes when entering data – and prompting them to fix the
error is massively cheaper than trying to find and fix the bad data
later. Checking for malware might be as simple as rejecting content
which looks like HTML markup, or you might test the content against a
complete AP14 gateway with multiple heuristic malware checkers. The
point is that your code must decide whether to accept the data and
process it or reject it .
Why not transform input?
If your underlying
platform is vulnerable to buffer overflows or similar attacks, then
by the time the thread of execution reaches your code your system is
already compromised.
Later in your
processing you may well transform the data, but not at the point
where it enters your code. You need to ensure that you know how the
data is represented anywhere you are processing it; the simplest way
to do that here is to leave it in as raw a state as possible.
Exceptions
Validating input
In some cases you
may choose to reject the data silently, or re-route it to a honeypot
– but that is a very esoteric edge case. And both are still forms
of validation.
Transforming input
Again, there are
some edge cases where it is necessary to modify the input in order to
validate it, for example:
-
your code expects data encrypted with specific keys
-
the input data has tamper resistance added which should be removed before processing (such as anti-csrf tokens)
Escape output
Escaping output means transforming it to a form where:-
the original data is recoverable in a suitable
state/representation for further processing
-
the content of the data does no interfere with the control
channel for the data
Any time is leaving your code and going somewhere is it should be rendered in an appropriate format for the receiving process. Sometimes a single script may have multiple output vectors – a local record in a database, a notification email to the user, html to the browser. Each need different representations of the data.
Hence to ensure the right representation of the data for the target, the transformation should be near to the point where it is output – both in the sequence of execution and the in the structure of the code.
While a few of the transformation functions have both an encode and decode implementation in PHP, it doesn't make any sense to try to reverse the encoding of the data for one output channel in order to write it to another within the scope of a single script. All the programming languages I have used are similarly asymmetric.
Exceptions
The only exception to the "Escape output" rule is If the data channel is independent from the control channel - for example when writing to a local file (although even then, you need to be careful not to write en EOF character as data to a file opened for text).
Monday, 6 March 2017
Today I saved the planet
Well, prolonged its demise a little.
My teenage son, despite repeated requests, frequently walks away from the computer leaving it running. And now that I've got a shiny new and very quiet computer it is not always obvious that it has been left running.
While formerly I would just changed the program the window manager invoked as a screensaver Linux systems have got a lot more convoluted. Whether this complexity is a good or necessary thing - I will reserve judgement.
Anyway, to cut a long story short, the solution for me was to use xautolock which, once started, monitors for a period of inactivity then runs a program. Setting it to start automatically on my Mint / mate box (no XClients / Xsession) was a matter of adding a .desktop entry to the XDG autostart directory:
[Desktop Entry]
Type=Application
Exec=xautolock -detectsleep -time 50 -locker "/sbin/shutdown -h now"
Hidden=false
X-MATE-Autostart-enabled=true
Name[en_GB]=xautolock
Name=xautolock
Comment[en_GB]=
Comment=
My teenage son, despite repeated requests, frequently walks away from the computer leaving it running. And now that I've got a shiny new and very quiet computer it is not always obvious that it has been left running.
While formerly I would just changed the program the window manager invoked as a screensaver Linux systems have got a lot more convoluted. Whether this complexity is a good or necessary thing - I will reserve judgement.
Anyway, to cut a long story short, the solution for me was to use xautolock which, once started, monitors for a period of inactivity then runs a program. Setting it to start automatically on my Mint / mate box (no XClients / Xsession) was a matter of adding a .desktop entry to the XDG autostart directory:
[Desktop Entry]
Type=Application
Exec=xautolock -detectsleep -time 50 -locker "/sbin/shutdown -h now"
Hidden=false
X-MATE-Autostart-enabled=true
Name[en_GB]=xautolock
Name=xautolock
Comment[en_GB]=
Comment=
Subscribe to:
Posts (Atom)