Restoring from Backups
Wednesday
Sep 02, 2009
9:40 pm
Backups are only as good as your ability to restore them. This is one of the axioms that I learned working tech support for a county government office. I also majored in Physics, which meant collect data now and analyze it later. I bring this up because todays job was adding the restore feature that would read the backup feature I implemented a couple of weeks ago. Now that I have data that is actually worth keeping (and keeping in the same place) backup and restore suddenly is not just another feature, its important.
Todays lesson is a corollary to that lesson of so long ago: Restore is always trickier then Backup.
In this case I choose to save everything into a big XML file. If you read the last post you will know that I finally figured out how to get it to save without having to right click and SaveAs. XML is a nice format for a lot of reasons. There are great tools like Nokogiri that can parse it quickly. It is very simple to create with tools like HAML that I already use extensively, and it's plain text so all those nifty Unix Command Line tools like awk, grep and ruby -n will work on it.
The one big draw back is that it does not play nicely with white space without a conscious effort. I write my posts in markdown which is whitespace sensitive. The other is that certain characters must be escaped to not have special meaning <, > and & in particular. Easily solve on the backup side with HAML's tilde (~) and the h function right?
Well all the data went in just fine, but it was a mess to get back out. WordPress solves this problem by wrapping everything in cdata blocks. If Haml can generate these I have not discovered how, and while Nokogiri can parse these things, it does not have an obvious way of stripping out the cdata guards. (Again there may be a back water function I just have not seen). What I ended up with was a blob of HTML encoded data that I needed to decode without loosing the whitespace.
For future reference Merb::Parse.unencode is for HTTP encoded string not HTML. There is a function that will encode -- the ubiquitous h method, but it is expected that the only thing that would want to read the output would be the user-agent on the other end. Having to eat my one dog food left me with CGI.unescapeHTML
All well and good, but then what are these unknown characters suddenly littering my older posts? Turns out that WordPress had used non-breaking spaces every time I put more then on space in a row. I learned to type in the old days (early 90's) when we were still expected to double-space after a period, so there are a lot of those little things everywhere in my writing. They didn't survive two trips through the encode-decode cycle and were producing garbage on the screen. In the end I had to just gsub(' ',' ') the whole mess to get rid of them.
The good news is that I can now restore backup files. This is important as I need to rebuild the database on the next upgrade fro reason I have explained elsewhere. This is going to be a major project, and will probably have to wait till Friday.