When training inexperienced programmers, there are 2 mantras I always need to repeat:
You don't write code for computers to understand; you write it for humans to understand.
validate input : escape output
Since this the latter has cropped up a few times recently, I have documented the reasoning here.
Both validation and escaping are about protecting your system against attack and ensuring good data quality.
Validating input means checking that the input falls within the acceptable boundaries for your application. Does it look like the right data type? Is it within a sensible range? Does it contain malware?
The first 2 examples are there to avoid processing data which is obviously wrong. People make mistakes when entering data – and prompting them to fix the error is massively cheaper than trying to find and fix the bad data later. Checking for malware might be as simple as rejecting content which looks like HTML markup, or you might test the content against a complete AP14 gateway with multiple heuristic malware checkers. The point is that your code must decide whether to accept the data and process it or reject it .
Why not transform input?
If your underlying platform is vulnerable to buffer overflows or similar attacks, then by the time the thread of execution reaches your code your system is already compromised.
Later in your processing you may well transform the data, but not at the point where it enters your code. You need to ensure that you know how the data is represented anywhere you are processing it; the simplest way to do that here is to leave it in as raw a state as possible.
In some cases you may choose to reject the data silently, or re-route it to a honeypot – but that is a very esoteric edge case. And both are still forms of validation.
Again, there are some edge cases where it is necessary to modify the input in order to validate it, for example:
your code expects data encrypted with specific keys
the input data has tamper resistance added which should be removed before processing (such as anti-csrf tokens)
Escape outputEscaping output means transforming it to a form where:
the original data is recoverable in a suitable
state/representation for further processing
the content of the data does no interfere with the control
channel for the data
Any time is leaving your code and going somewhere is it should be rendered in an appropriate format for the receiving process. Sometimes a single script may have multiple output vectors – a local record in a database, a notification email to the user, html to the browser. Each need different representations of the data.
Hence to ensure the right representation of the data for the target, the transformation should be near to the point where it is output – both in the sequence of execution and the in the structure of the code.
While a few of the transformation functions have both an encode and decode implementation in PHP, it doesn't make any sense to try to reverse the encoding of the data for one output channel in order to write it to another within the scope of a single script. All the programming languages I have used are similarly asymmetric.
The only exception to the "Escape output" rule is If the data channel is independent from the control channel - for example when writing to a local file (although even then, you need to be careful not to write en EOF character as data to a file opened for text).