Sunday, August 05, 2007

Getting REXML to play nicely with the non-english world

Its a large world out there, and not everyone uses the same character set. REXML does its best, converting everything into UTF-8 so you don't have to worry. Unfortunately older versions of REXML (such as version 3.1.2.1 presently in Ubuntu) fail to correctly parse most XML feeds encoding specification. Fortunately, the latest version (3.1.7 as of this writing) has fixed the regular expression to properly match the encoding types. This is great for importing non UTF-8 documents, but it simply reads in UTF-8 documents without first cleaning them. Most of the encoding & decoding in REXML is done using Iconv, so with a small patch against version 3.1.7 (based on Secure UTF-8 Output in rails) we can make REXML strip out invalid UTF-8 characters.

Not only does this help our application, but also for any XML documents we produce it ensures that we actually are following the encoding, making the world a slightly better place.

3 comments:

rishi said...
This comment has been removed by a blog administrator.
Anonymous said...

But how do you update just rexml? It seems to be hard integrated into ruby.

Holden Karau said...

@Anonymous
You shouldn't have to worry about this anymore, rexml has been fixed upstream now, so applying this patch would be counter productive.

Free Blog Counter