Earlier today, we brought 230,000+ articles online under a new blog called Tucson Citizen Morgue, Part 2. Most of these articles have never been seen before on the web and come from an internal digital archive. There is some overlap with Tucson Citizen Morgue, Part 1. Part 1 contains photos, polls, slideshows, and more (shiny!). Part 2 is just text and the occasional hyperlink and bold/italic text (less shiny!).
I’m writing this post for those interested in the technical aspects of publishing a large archive of backdated content – particularly from the aspect of using WordPress. I was the primary person involved in the technical aspects of this deployment. Both archives are served through WordPress, so the challenges involved are fairly intense. There are approximately 350,000 articles spanning two WordPress blogs. So, the technology and time required is where things get interesting. Since bringing each set of content online was a similar process taking approximately the same amount of time, it would appear that the approach is a fairly universal solution for importing massive amounts of content into a CMS.
The first thing I did was to backup the content to be imported. The source databases and servers were/are going away, so I wanted to make sure I could reference a more permanent clone if the need arose in the future. You can never have too many backups anyway and hard drive space is cheap. Plus, referencing a local database ended up being a LOT faster than referencing rows across the network. It is also nice to know that you can freely lock the tables for increased performance.
The next thing I did was analyze existing blogs in our WordPress setup to learn how posts were stored and to understand what fields in the database did what. This didn’t take too long just because I had been messing around with WP tables directly for a while. The goal was to directly insert rows into the the database and make WordPress think it had inserted the rows into the database itself without using any WordPress functions. This is harder than it seems because of permalinks – getting those right outside of WordPress turns out to be quite difficult. An incorrect permalink calculation leads to a hard-to-diagnose HTTP 404 error.
After that, I started work on analyzing the data in the database. My primary goals here were to:
- Look for strange characters not in UTF-8 format – the target character set – and figure out how to deal with them.
- Figure out who wrote each article (i.e. the byline) and come up with a clean way to get them into the ‘wp_users’ table.
- Clean up rogue HTML.
- Import odd content such as slideshows (Part 1 only).
- Deal with “out of memory” issues.
That last one might come as a surprise until you learn that I was writing all the import scripts in PHP and occasionally hitting the configured memory limit of 128MB RAM. PHP tends to leak RAM like a sieve and will show its creaky side when it comes to medium-sized data processing projects. However, PHP has one of the best HTML cleanup libraries in existence (HTMLPurifier), which I needed in order to handle that third bullet point.
One of the most difficult things to do was figure out who wrote each article. WordPress uses the information for creating a link to all articles by that author and our stats show that users like clicking on those links, so being accurate here is pretty critical. String parsing is hard from a computer’s perspective and bylines ended up being a hodgepodge of strings mashed up with human error. We are talking over a decade of stories here – so human error creeps in pretty frequently. I wrote a special script to help me look at all the bylines and came up with a giant array of mappings that fixed all the bylines.
Fortunately, the main content of Morgue Part 1 was already in UTF-8 format for the most part with minor little issues with a few early entries that were easy enough to fix. Morgue, Part 2 was an entirely different story altogether. Some stories used UTF-8. Some stories used ISO-8859-1 (Windows). Some stories had Mac-Roman (the Tucson Citizen was a newspaper after all). And a bunch of them had mixed character sets – thanks to a character set “conversion” that happened years ago. Basically, a complete disaster. I ended up writing a special script that allowed me to, with a web browser, analyze “weird characters” and form a 500KB file to “fix” every last little character set issue. I’m pretty sure I got some things wrong. For example, baseball sports articles with “5[something] innings” ended up just turning into “5′ innings”. I’m pretty sure I also managed to get some things right. For instance, characters in French words for many food/restaurant review articles.
I ended up writing multiple import scripts to get the content into the format it is in today. The last stage was always the most painful – insert all the content into the database. As an example, the last stage script for Part 2 took approximately four hours to complete from start to finish. There is nothing more boring than watching a script run and run and run…. During the last stage, headlines and article bodies are parsed and cleaned up, permalinks calculated, tags and category determined, author(s) mapped, and everything finally inserted into the various tables all at one time – if the script died 2/3rds of the way through, I opted for fixing the problem and restarting from the beginning. For Part 1, polls, slideshows, and other content was transferred/imported too.
One interesting aspect about Part 1 was creating redirects for inbound links. If you find an old story online, it previously linked to a (mostly) broken article before Part 1 went online and search engines had a tough time figuring things out. Permalinks made for a tricky scenario – especially regarding performance, so a special database table was set up to map an old link to its correct WordPress-friendly permalink. When an inbound link comes in for an old story, a special script is run that determines exactly where to redirect and executes the redirect. From the user perspective, it happens instantaneously. From our perspective, we maintain our link equity with the major search engines.
Part 1 also taught us a few lesson about performance. First, important sections of the website run on a five-minute updated cron script. For example, the homepage pulls content from all of the blogs. This is actually cached content because calculating it in real-time is a very expensive operation. As soon as Part 1 came online, it caused the cron scripts to enter into an infinite loop (take too long) and caused the server to crash – whoops! So we now exclude really large blogs from being considered for most processing.
The performance of Part 1 introduced us to some significant show-stopper problems and set back the launch date for Part 2. Part 1 only had 120,000 articles in it, but it severely impacted the server. We have learned that WordPress starts to be unable to handle the load when a blog contains more than 15,000 posts. I am fairly confident we are the only website in existence with a single WordPress blog exceeding 100,000 posts. So, for the longest time we were up against a severe technical hurdle. When Giffords was shot, the Tucson Citizen went down hard. And “blog 106″ (aka /morgue/) was the culprit because suddenly everyone was searching for Giffords-related articles, which took them to the Morgue, Part 1 blog. Taking that blog offline for a few days alleviated the situation and allowed visitors to view the rest of the site – sort of…the site was up and down over the next few days anyway but the entire site was permanently offline when Part 1 was available. This problem happened even though the site was caching content. We’ve learned that it doesn’t matter how good the caching scheme is if there is a blog with more than 15,000 posts in it that huge numbers of visitors are requesting.
So, what we discovered was a major scaling issue. We don’t really have the money for multiple servers but I managed to, after pulling many teeth and being rather annoying, get IT to get me one more server for use as an “emergency backup server” and for testing purposes. The backup server isn’t as nice as the fancy-schmancy Intel Xeon Quad Core with oodles of RAM the main server has, but we can theoretically bring our backup server online and live in about 15 minutes of a total failure of this server. Anyway, I digress. Back to the scaling issue:
There are some interesting aspects about these two blogs that I eventually realized that we can exploit. First, they are archive content blogs. They will never have another post or anything else done with them ever again. Maybe. Part 1 is missing video due to a power surge that fried the video server before we got to extracting that content BUT the videos do still theoretically exist on backup DVDs – recovering that content is a separate project. Anyway, for the moment, the blogs are static content that are “read-only” (i.e. no comments allowed).
WordPress.com is the largest installed user of WordPress. They have a massive server farm and use a little-known plugin called HyperDB that spans database requests across multiple servers to help alleviate the performance issues of WordPress. There are very few users of HyperDB and all of them have scalability issues with WordPress, so support is very limited. It took a while and fiddling, but I eventually figured out how to abuse our backup server’s MySQL database for use as the primary SQL query engine for both Morgue blogs. Basically, HyperDB determines if the request is for one of those two blogs and then uses the backup server database to handle the request on a read-only basis. A perfect fit for these two blogs. So, if another Giffords-like event happens (hopefully nothing like it happens again!), the main server has a better chance of not going down due to two large blogs. It will also be nice to have the “slow MySQL query log” on the main server be free of notifications about these blogs.