Viewing Category: MachBlog  [clear category selection]

Dealing with UUID values in URLs

A few days ago, David Flinner posted a comment via Google+ about a blog post I made recently. I saw the ugly URL back to my blog and clicked on it for no particular reason. My instance of MachBlog threw an exception because the URL contained a UUID with a trailing dot. When MachBlog searched for the post matching that primary key, it came up empty. This sort of problem happens all the time, and I don't know what is so difficult for spiders to pick up the proper URL. I know I should migrate someday — especially to something that could deal with human comments — but there are other things higher up on the todo list.

At any rate, I decided to take another look at my Apache URL rewriting rules and fix the issue. Previously, I was supporting a shortcut URL like /blog/UUID, which redirects to /machblog/index.cfm?event=showEntry&entryId=UUID. I was also checking to see that incoming URLs had a proper UUID (the right number of hex values separated in the correct positions by dashes). I made a change to the rules so that trailing characters would be hacked off. Here are the new rules:

# MachBlog shortcut RewriteRule ^/blog/([-a-f0-9]{35}) /machblog/index.cfm?event=showEntry&entryId=$1 [NC,R,L] # URL decode a doubly-encoded UUID RewriteCond %{QUERY_STRING} entryId=([.]{8})%2D([.]{4})%2D([.]{4})%2D([.]{16}) [NC] RewriteRule .* /machblog/index.cfm?event=showEntry&entryId=%1-%2-%3-%4 [NE,R,L] # Truncate extra characters RewriteCond %{QUERY_STRING} entryId=([-a-f0-9]{35})[^&] [NC] RewriteRule .* /machblog/index.cfm?event=showEntry&entryId=%1 [NE,R,L]

Note that it's important to use the no-escape (NE) flag on the rewrite so that extra URL encoding isn't introduced.

MSNBot Madness

The MSN search engine (AKA Live Search, apparently) uses the MSNBot to crawl websites for content. For whatever reason, I see that it unnecessarily percent encodes values in the query string, causing the dash character used in the GUID to be represented with %2D instead of -. Even worse, I see it making the same request again using a double encoding on the separator character in the GUID: %252D. When MachBlog uses the value from the URL to query the correct blog post, it encounters an error because the parameter isn't a standard 35 character GUID.

I don't see any other spiders making this error. However, I suspect what happened is that the MSNBot parsed the XML RSS feed like a normal HTML page, and added all the /rss/channel/item/link text nodes to the URL parse stack. MachBlog uses the URLEncodedFormat function when building the URL. This may have been changed in newer versions of MachBlog -- I didn't check. A fairly simple fix would be to check the format of URL.entryId before using it in a query.

I decided to attack the problem at the web server instead. Some mod_rewrite rules match the pattern of a percent encoded GUID and break it into groups. The subsequent rule uses the groups to build a redirection.

RewriteCond %{QUERY_STRING} entryId=([\w]{8})%2D([\w]{4})%2D([\w]{4})%2D([\w]{16}) [NC] RewriteRule .* /machblog/index.cfm?event=showEntry&entryId=%1-%2-%3-%4 [NE,R,L]

What if the event was something other than "showEntry"? Well, assuming that the cause of this whole problem is that the XML RSS feed is being parse as an HTML page, that's the only event specified.

A Fresh MachBlog

This is a brand new installation of MachBlog. :) I checked out /branches/1.1 and made several tweaks to tune it to my installation. Setting up the application server using Apache, Tomcat, and Open BlueDragon was a bit of a trick. After quite a bit of experimentation, I found a solution that I'm happy with.

The server is a Vivio Linux VPS running CentOS 4.7 and Apache 2.0. Unfortunately, the Extra Packages for Enterprise Linux 4 (EPEL) repository from the Fedora Project don't have a current OpenJDK. I downloaded Java JDK 1.6 directly from Sun. This older version of Apache also doesn't come with the mod_proxy_ajp module. I considered mod_jk, but figured that a plain ol' HTTP proxy would be fine. Apache is configured to read static files from a regular user's home directory, but proxy the CFML requests. The result of the proxy request is modified with the ProxyPassReverse directive so that response headers don't use the localhost:8080 server and port.

<VirtualHost *:80> DocumentRoot /home/user/webroot RewriteCond %{SCRIPT_FILENAME} \.cfm$ RewriteRule /(.*)$ http://localhost:8080/$1 [P] ProxyPassReverse / http://localhost:8080/ </VirtualHost>

Tomcat is configured to run by the ordinary user, rather than as a system service. Therefore, all the Tomcat and Open BlueDragon files are within the user's home directory, and owned by that user, which simplifies file permissions when the application server needs to write uploaded files. Any changes to this user's Tomcat or Open BlueDragon installation, including updates and crashes, aren't system-wide. The Tomcat root is /home/user/server/tomcat. All of the sample web applications have been removed from $TOMCAT/webapps. There is a host context file $TOMCAT/conf/Catalina/localhost/ROOT.xml that sets the docBase attribute to /home/user/server/openbluedragon.

I checked out the source for Open BlueDragon yesterday and built the WAR file to deploy manually. In this hosting configuration, I must add symbolic links for directories in $OPENBD (/home/user/server/openbluedragon) to real directories in $WEBROOT (/home/user/webroot), such as $OPENBD/MachII to $WEBROOT/MachII. There aren't many directories in the root, so it isn't much of an issue.

Note: Don't forget to add the JavaMail jar to $OPENBD/WEB-INF/lib. Again. :)

The following is an oversimplified, and perhaps quite useless, diagram of this hosting setup. I'm looking for a term for this arrangement. Possibly Ordinary User Tomcat and Open BlueDragon Installation, as opposed to Shared Server Tomcat and Open BlueDragon Installation. Unfortunately, these don't indicate where the static web files are located, and how CFML requests are handled.