Some Thoughts on Work Queues
Tuesday
Nov 03, 2009
7:36 pm
Github has recently posted a wonderful article about their history with job queues. It is a superb read and I recomend looking both at it and the README file for Resque, GitHub's new Redis back job queue. It is replacing the venerable delayed_job. In this post I am going to briefly look at some of the concepts of Worker Queues in an attempt to better explain my hopes for Updater, my delayed_job clone for DataMapper.
First a bit about why we need worker queues. There are two obvious and one not so obvious reason to to use a worker queue. First, they are sued to complete resource intensive processing outside of a web request. Once enough data is available to give reasonable feedback to the user the webserver can respond and go on to the next request while while the worker queue deals with processing the result. Second, is the situation where some event needs to happen, but will not be triggered by a user. GitHub seems to do a rather small amount of such jobs, but once a queue exists that ability to do something at or slightly after a certain time is much easier to do then it is with cron, at or other general purpose tools. Finally, worker queues are useful for fixing thinks that go wrong out of sight of the end user.
Having looked at why jobs end up on the queue, the rest of the time I'm going to spend talking about the methods to get them off. One of the first things a developer ought to consider in a worker queue is how often it will be used. My experience has been that once there in place they become a goo solution for a lot of issues that might have happened with some other system. None the less, there will be applications and use cases that hit thier queue once a day and others that will need to run 50,000 jobs every few seconds (al a GitHub). I think one of the biggest tricks is writing a system that has low overhead for the first case and scales to the second. Having said this one of the things I don't like about either delayed_job or Resque is their reliance on workers polling the queue. For a heavy load, a worker will usually receive a job, but in a light load (isn't that most of us?) this behavior is a huge waist of resources. I much prefer the method Unicorn uses where workers wait until a job comes in for rather then asking over and over for something. I am therefore working to rewrite Updater to use the select method (postix only) to dispatch new jobs for the workers. I also put the workers to sleep until the next job I know needs to happen is ready, then use process signals (USR2) to wake up a process when a new job enters the queue.
Another issue is what to do about job failures. Does one tray the job again? How often and in how long? Do you keep a record of the failure? What about success? Delayed_job takes the opinion that the job shuld be retried a number of times hoping the error is transitory. If it fails to many times, it is left for dead on the queue in hopes that the SysAdmin will come by and find out why it died. My personal belief is that what to do after a failure is to heavily dependent on the system and the reason for the failure to be handled directly by the queuing system. The way I have built Updater, each Job can have a link to another job that will only be run in case of failure. This job is then called with information about the failed job and can pass this back into the application for further instruction. Of course there are a lot of common situations that do not need to hit application code, i.e. "If the remote connection is down just try again in 5 minutes." To handle this I am in the process of writing a library of standard error handlers that can be loaded into the queue and referenced by other jobs. I am also working on adding links to jobs that will be called upon success and jobs that will the queue will ensure are called regardless of how the job completed.
Finally, I want to point out the differing views about how to split up the job queue. Both defunct (Maintainer of Resque) and tobi (Maintainer of delayed_job) now agree that it is better to be able to tag a job with a label rather then partitioning with a priority. I am much more interested in the small and sparse end of the spectrum so I don't have GitHub's experience with huge disparate queues. What I do want to be able to do is let an instance ensure that an event is scheduled for it. For example a blog post that is scheduled to go live some time in the future should be able to find the job that will push it to the live page and alter or destroy that job if the author changers their mind. Again GitHub's load is mostly fire-and-forget, while both Antarestrader and Blog end up attaching jobs to objects to be run some time in the future.
I hope to get version 0.3 of Updater out tomorrow. Both blog comments and further progress on AntaresTrader are waiting on a stable version of this technology. I hope to be able to post more with specificity about what I have done then. In the mean time I want to end with a thank you to the GitHub staff for not only their great service with their product directly to the community, but also for being willing to share their experience using these different technologies. It is us eminence help to those of us that don't get to run a large sight on our own equipment .