Moving to WordPress – Part 1: Content

As I mentioned in a previous post, I am done with Movable Type, and I’m in the process of moving this blog to WordPress. That turns out to be a lot easier said than done.

(Warning: Much technical computer geekery ahead.)

WordPress has a Movable Type import tool, but It doesn’t solve what I consider to be the most important problem: It doesn’t preserve permalinks. Over the years, many people have linked to posts on Windypundit — Google webmaster tools reports over 90,000 inbound links — and I want to make sure that as many as possible continue to work. Not only is it basic ettiquette not to break links, but having working links to my site is also important to maintaining what passes for my search engine rankings.

I searched for other tools and methods but I couldn’t find anything that would solve the permalinks problem without introducing other problems of some kind. Eventually, out of frustration and a desire to learn new tricks, I decided on a rather crazy course of action: I decided to port my blog to WordPress by writing my own program.

I chose to write it in C#, mostly because I’d just been hired for a job programming in C# and I figured writing the importer would be a good way to learn C#. Also, I already owned Microsoft Visual Studio 2010 Professional and I’m comfortable with the development environment.

That was a year ago. I got in enough C# practice that I had no trouble when I started the job, but I kind of lost interest in writing the blog migration tool. Development slowed to a crawl.

A few weeks ago, however, I started getting a lot of spam comments that weren’t being caught by my anti-spam tools. Knowing that I’d have much better tools if I switched to WordPress, I decided it was in my best interest to get the job done.

The main tool I wrote is the BlogMigrator, which pulls all of my posts out of Movable Type and generates a WordPress extended RSS (WXR) file suitable for import into WordPress.

It starts by downloading all the posts from the current Windypundit website, which it does by connecting directly to the MySql database that Movable Type uses to store all my posts. It queries to get all the authors, categories, posts, and comments in the blog, including all pages (such as the About pages) and all draft posts loaded into an ADO.NET dataset, which it saves to a file. From then on, the migrator just loads blog posts from the file, avoiding the time consuming download from the server. If I want the latest stuff instead of my locally cached copy, I just delete the DataSet file and the BlogMigrator downloads a new one next time I run it.

Next the program builds an in-memory model of the Movable Type blog, including all authories, categories, blog entries, template maps, and comments. This is a fairly mechanical process. After that’s done, it iterates over all the blog entries and generates a report of the location and publishing status of every post. I use these reports (and others) as input to the iterative development process.

The next step is to traverse the Movable Type blog model and build a matching WordPress blog model. All of the basic concepts are the same, but there are a lot of little details that change, including the names of the data fields, and I try to follow the naming conventions of each blog technology as much as I can. (E.g. Movable Type author have single name field, WordPress authors have first and last names.) Among the steps of the conversion are splitting the author name into first and last names, generating unique IDs for each post, merging main and extended post text, converting the URL format for the post, and converting from local time to universal time.

The next pass looks into the actual content of each post and uses the HTML Agility Pack to analyze the HTML and catalog every element and all the class and style attributes. It also generates reports of which posts each of those items is used in. I’ve been using those reports to make iterative modifications to the program. For example, some posts have embedded HTML class junk that was introduced when blog authors cut-and-pasted from Microsoft Word. By finding these classes in the reports, I was able to modify the program to strip them out. I have a whole collection of whitelists, blacklists, and replacement tables.

In other cases, where the reports showed that a strange class or misspelled element was used in only a handful of posts, I’ve just gone back in to the blog on MovableType to fix the problem at the source, which is easier than adding code to fix the problem. Then I re-import the database and re-run the BlogMigrator to confirm the problem is fixed. (For problems fixed by the tool, there are before- and after-cleanup reports so I can verify the problem is gone.)

Another thing I had to handle was custom tags. MovableType allows you to create custom HTML-like tags for your blog by writing a little PHP and/or Perl code, and I had built a few of them over the years. I had to modify the HTML Agility Pack to recognize them as legitimate tags, and then I had the BlogMigrator replace them either by generating raw HTML or by re-writing them into custom WordPress shortcodes (which are a similar concept to MT’s custom tags). Fortunately, I was mostly able to implement the shortcodes by reusing the PHP code I had already written.

The BlogMigrator also catalogs all the links and images in each post, generating CSV files that I can view in Excel. Each link is identified as internal — back to Windypundit — or external depending on the hostname. Internal links are further classified by checking whether they point to a known Windypundit post URL or something else, such as an image or sound file.

I then have a separate AssetDownloader program that reads in these report files and downloads all the assets on the site and builds a directory structure for them. It filters out file types that are not static assets, such as links to .php files. I can upload all the files in that directory to the new website so the internal links work, although the program rewrites them with a new top-level subdirectory so they won’t collide with the new blog’s native assets. It also cleans up problems like replacing spaces in the URLs with underscores.

I then have a third program, the Probulator, that reads the link report, rewrites every link to point at the new blog, and tests the link to make sure it works. The first time, it found 61 broken internal links, due to basename shortening, badly formed URLs, embedded spaces, and so on. I went back to the fix (or remove) the links.

It also found a couple of dozen links that were broken because my program was using the date of publication of a post to generate the URL and Movable Type was using the date of creation (or vice versa, I can’t remember).

The Probulator also tries to download a copy of every blog post by using its original URL — except for the hostname, which it rewrites to refer to my test server. This serves as a test for broken links, and it also provides local copies for further analysis by two more programs.

The SanityChecker program examines each blog post for odd bits of HTML that might not format correctly. For example, all post content should be a list of a limited set of tags — <p>, <ul>, <ol>, <h5>, <h6> — or <blockquote>, which should contain a list of the same set of tags. The program reported anything that did not match that pattern. This uncovered a flaw in my implementation of one of the shortcode replacements for custom tags. It also found a bunch of posts which had been authored using mangled HTML. (God bless Joel Rosenberg’s memory, but he didn’t know a damned thing about HTML.) I had to go back to the Movable Type version of the blog to clean those up before importing.

The last program is the LinkVerifier, which finds all the links in every post and makes sure that all the internal ones still work. A few of the links are to blog-engine-specific resources, such as category archives and author about pages, that can’t be easily mapped. I’ll keep a list of those so I can go in and fix them later.

At this point, my process for a full import goes something like this:

  • Delete the database cache file (if I’ve changed something on the Windypundit site and I want to re-download everything).
  • Run the BlogMigrator.
  • Run the AssetDownloader.
  • Zip up the downloaded assets, upload them to the new blog host, and extract them into the proper directory.
  • On the new blog host, restore a backup of the WordPress database that has all the configuration items set but doesn’t have any posts in it.
  • Use the WordPress importer to upload the WXR file from the BlogMigrator that contains all the posts.
  • Hit the blog homepage to verify that it’s working.
  • Run the Probulator, check the reports for missing items.
  • Run the SanityChecker, check the reports for problems.
  • Run the LinkVerifier, check the reports for problems.
  • Go fix some problems and try the process again.

I’ve pretty much been doing that in my spare time for the past couple of weeks, and I think I’m almost done. I’ll probably roll out the live site in the next few days.

Leave a reply