Saturday 6 February 2010

DokuWiki blogging

I've used DokuWiki as a base for a few PHP projects since it provides access control navigation and other such niceties, combined with a very capable Wiki CMS for publishing static content.

Adding your code to it can be as simple as editing a page and entering:

[php]
include("myPhpCode.inc.php");
[/php]

(where I've subsitituted square brackets for the usual angle brackets - since blogger.com seems not to like the latter)

There are a some complications around dealing with when the headers are sent vs when the contents of the page is generated, and around when the session is written back - but nothing insurmountable.

So when my boss tasked me to come up with a simple CMS solution which could provide RSS and atom feeds, needed authentication for adding content, and needed to have "lists of things" using a blog within DokuWiki seemed like an obvious solution.

Dokuwiki has a plugin 'blogtng' intended to replace the older 'blog' plugin. Both use files for holding the content which was not ideal (I expect that we'll be adding a lot of stuff in there, and require the ability to edit/remove old stuff, also actively pruning old data might be required too for performance/storage reasons). However blogtng uses a database for indexing the content. Great, I thought.

And then I discover that the PHP from RHEL5 is compiled without sqlite support. That's odd - but I'm not disheartened - I go find out local software guardian and ask for the installation media. There then follows much running around in the manner of the Keystone Cops - indeed if this were a movie rather than a blog it would undoubtedly be shown speeded up, and accompanied by Yakkety Sax - a la Benny Hill. Still no CD-ROM, however as far as I can tell, RedHat never implemented the sqlite extension. There are some third party implementations for RHEL :) except for version 5 :(

So then I try to rewrite blogtng using mysql. Which proves to be messy. Although it appears to have been written with the notion of some abstraction from the underlying database, direct calls to the extension occur throughout the code. So I first rewrote all these into the abstractin layer. Next, I doscover that sqlite seems to be completely untyped, so I have to reverse engineer the possible datatypes and update all the DDL scripts. Then I find that all the DML needs to be changed too. I also find that it seems to be creating blog entries whenver I view a page (even before I've added the page markup to create any blogs).

The RHEL5 does support the PDO extension with the sqlite driver, and now that all the database connection code is in one place, it was fairly easy to port it back to that and restore the original DDL and DML. It seems I can now create blogs, blog entries and comments, but still getting blog entries created randomly - so I go back and check the repository - its a known bug. Grrrr.

PHP and long running processes

It seems this question keeps coming up on the PHP newsgroups and, now that I've plugged into Stack Overflow - I keep seeing it their too:

How I do I start a PHP program which takes a long time to complete and how do I track its progress?

While these tend to attract lots of replies, they are usually wrong.

The first thing to consider is that you need to seperate the the thing which takes a long time from its initiation, the ongoing monitoring and whatever final reporting is required.

Since we're talking about PHP its fair to assume that in most cases, the initiation will be a PHP script running in a webserver. However this is not a good place to keep a long-running program.

1) webservers are all about turning around requests quickly - indeed most have failsafe mechanisms to prevent one request hanging about too long.

2) the webserver ties the request to both the execution of the script and to the client socket connection. Typically NOT keeping a browser window open somewhere waiting for the job to complete is an objective for the exercise. Although the dependence on the client connection can be reduced via ignore_user_abort() that was never its intended purpose.

3) long-running typically means it will have quite different resource requirements than a typical web page script - e.g. lots of file handles being opened and closed, more memory being consumed.

Most commentators come back with the suggestion of spawning a seperate thread of execution, either using fork or via the shell. The former obviously does not solve the webserver related issues if the interpreter is running as a module - you're just going to fork the webserver process. You've not solved any of the web related issues and created a whole lot of new ones.

You need to create a new process certainly.

The obvious type of process to create would be a standalone PHP interpreter to process the long running job. So is there a standalone interpreter available to the webserver? The prospective implementor would need to check (and whether the webserver runs as chroot). So lets assume there is, our coder writes:


print shell_exec('/usr/bin/php -q longThing.php &');


A brave attempt. However they will soon find that this doesn't behave as well as they expected and keeps stopping. Why? because all the process they created runs concurrently with the php which created it, it is still a child of that process. Now this is where it starts to get complicated. In our example above, the webserver process finishes with the users script immediately after it creates the new process - however it will probably hang around waiting to be assigned a new request to deal with. However at some point the controller for the webserver processes will decide to terminate it - either as a matter of policy because it has dealt with a certain number of requests (for apache: MaxRequestsPerChild) or because it has too many idle processes (apache's MinSpareServers). However the webserver process should not stop until all its child processes have terminated. How this is dealt with varies by operating system and of course, webserver. Regardless, the coder has created a situation which should not have arisen.

But on a Unix system there are lots of jobs which run independently for long periods of time. They achieve this by:

1) they are first started, say as pid 1234, and try to fork, say to pid 1235 after calling fork, pid 1234 exits
2) pid 1235 will become the daemon - it closes all its open fds including those for stdin, stdout and stderr
3) pid 1235 now calls setsid(), this dissociates this process from the tree of processes which led to its creation (and typically makes it a child of the 'init' process).

You can do all this in a PHP script, assuming you've got the posix and pcntl extensions. However in my experience its usually a lot simpler to ask an existing daemon to run the script for you:


print `echo /usr/bin/php -q longThing.php | at now`;


But how do you get progress information? Simple, just get your long running script to report its progress to a file or a database, and use another, web-based script to read the progress / show the final result.

Troubleshooting (updated Sep 2014)

Following on from the feedback I've received, there's a couple of things to check if it doesn't go according to plan.

The atd has its own permissions system implemented via /etc/at.allow and /etc/at.deny - see you man pages for more info.

On Redhat machines, the apache uid is configured with shell /bin/nologin - this will silently discard any jobs submitted to it, hence a more complete solution is:

putenv("SHELL=/bin/bash");
print `echo /usr/bin/php -q longThing.php | at now 2>&1`;

A note about systemd (updated Jun 2016)

The latest "feature" to be announced for systemd is that it will kill user processes when they logout. I don't currently have a machine running systemd to test what impact this might have, but since the apache user never logs in never mind logging out, and I recommend using atd to invoke the process (which is specifically designed to run a program regardless of whether the user is logged in) I don't expect a negative impact on my solution.

Not quiet times

It's been a while since my last post since $ork, in their wisdom have blocked access to a huge list of sites including this one. While my exorcising my demons in public does not directly improve my productivity - it does provide a release (a few of my colleagues have expressed a concern that I may go postal at any moment - having been privy to some of my recent frustrations!).

Anyway, to catch up.....

Car / Mill Motors (Paisley):
1) Trading Standards were worse than useless, waiting for their advice cost me several weeks which could have been better spent on other things. The advice they gave me was incomplete and substantially wrong in several places.

2) The Credit Card denied all responsibility - so I went to the Financial Ombidsman - who agreed that there was a problem with the car, and that the Credit Card company should have resolved this.

The end result is that I got a full refund for the repairs.