In June 2015 we switched our background job processing framework from Delayed::Job (DJ) to Sidekiq. Overnight, we got 10x the workers at a quarter of the memory footprint. We went from four DJ workers and four processes, to 45 Sidekiq workers in a single multi-threaded process. This was a huge boost to our scalability, and made us think that our background job framework would be set for at least another year. However, CoverHound grew so much in 2015 that we had to scale Sidekiq just six months later.
In December 2015 we began to see periodic pileups in one of our Sidekiq queues. Within the space of ten minutes, the queue would go from 0 to 4,000 jobs, then eventually, it would go back down to 0.
What's causing this backup, and how can we fix it?
As part of debugging, we first made sure that our background job server wasn't maxed out on CPU or memory during the queue backup time. We found that the CPU and memory were well under 50% usage, so we ruled out server load as the potential cause (though that turned out to be a little misleading).
Our second step was to analyze the jobs in the backed-up queue. We thought that perhaps a slow database query was causing each job to run very slowly which was resulting in the accumulating slowdown. We put debugging messages throughout the job to measure how long each section was taking. After initially focusing on areas in the code involving database queries and not seeing anything stand out, we came across some timings in the code (that did not involve any database access) that were off. There were methods that would normally run in a tenth of a second, that were taking up to 12 seconds during one of the slowdowns.
The huge increase in the amount of time that a given function would take made us suspect that Sidekiq was hitting some kind of server limit, and this pointed us back to server performance graphs. We noticed that although the CPU load was only at 33%, it was consistent at that level during the entire queue backup. Before and after that, the CPU varied between 5% and 33%.
This realization lead us to a closer look at the Sidekiq documentation where we found that each Sidekiq process only uses a single CPU core. We had hit the limit of what a single core could process at one time and would need to add more Sidekiq processes. Fortunately Capistrano and Sidekiq integration made this easy. We just had to add a single line to our configuration file:
We then updated our system monitoring infrastructure to correctly monitor two individual running instances, and haven't had the problem since. Not only did launching Sidekiq make scaling and growth possible, but it also improved the user experience for our customers, and that's one of our highest priorities.
Do you have experience with Sidekiq or job processing frameworks? We'd love to hear from you. Are you interested in what we are working on, and want to join the team? Check out our job listings in San Francisco and Southern California.