Thursday, January 21, 2010

Character Encoding Hell

Over the years, one of the perennial hassles that web services programmers grapple with, is undoubtedly character encoding. One of the biggest contributing factors to the problem goes all the way back to the fact that the original definition of URL-encoding failed to specify how to deal with UTF-8 or other non-ascii encodings outside the reserved character set. The current spec states that if you want to send UTF-8 strings, for example "てすと   (te-su-to) ", then you should percent encode each byte of the UTF-8 character sequence (%E3%81%A6).

Two excellent pages for test data, should you need to test some multi-byte UTF-8 characters are:

Sadly for AJAX applications, actually generating these percent encoded byte sequences is non-trivial, and doesn't seem to be readily available "off the shelf". I've had to resort to JS source such as this.Without such scripts, various attempts to use "off the shelf" JS methods produce wacky results, such as unicode representations of the strings (like \uXX\uXX) which is completely useless for transmission in a URL.

On the server, there are also problems in receiving and properly decoding the bytes. One of the biggest problems is that Java web servers don't do it in a standard way. For example, when I was working on Tomcat 5.x,  the Servlet getParameter method would assume ISO-8859-1 (Latin 1) encoding, which would garble any properly UTF-8 percent encoded bytes. There is a well hidden setting to switch the default to UTF-8.

On the other hand, Jetty *does* assume the URL is UTF-8 percent encoded (love jetty!!). So without some config tweaks, don't expect servlet containers to *uniformly* deliver properly decoded UTF-8 strings.

My latest installment of UTF-8 Hell comes as I implement some more REST capabilities for Here is the content of a message I just posted to Paul Sandoz of Jersey fame (man, that guy must never sleep. he is *on the ball* on the Jersey mailing list). Mad props to Jersey. It is just killer. But I digress:

Hi Paul,

I got to the bottom of this by trying to unmarshal the string in three different ways. As I already mentioned the first way was just to call FormDataBodyPart.getValue().

toString(). This produced the improperly decoded String.

Then I tried two other ways, both of which correctly unmarshalled the bytes from the POST. As supporting information, here is the CURL line I was testing with, and an excerpt from the CURL trace, showing the proper bytes being posted.

curl --trace traced -F line=てすと http://localhost:8080/nextdb/rest/geoff/testchars/lines/rowid/PK/1

The 9 bytes highlighted below are the three japanese characters.

=> Send data, 148 bytes (0x94)
0000: 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d ----------------
0010: 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 63 61 --------------ca
0020: 33 32 31 65 31 30 39 66 36 37 0d 0a 43 6f 6e 74 321e109f67..Cont
0030: 65 6e 74 2d 44 69 73 70 6f 73 69 74 69 6f 6e 3a ent-Disposition:
0040: 20 66 6f 72 6d 2d 64 61 74 61 3b 20 6e 61 6d 65  form-data; name
0050: 3d 22 6c 69 6e 65 22 0d 0a 0d 0a e3 81 a6 e3 81 ="line".........
0060: 99 e3 81 a8 0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d ......----------
0070: 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d ----------------
0080: 2d 2d 2d 2d 63 61 33 32 31 65 31 30 39 66 36 37 ----ca321e109f67
0090: 2d 2d 0d 0a          

Both of the following two methods will properly unmarshal the correct String:

method 1: use InputStream to get raw bytes

                    InputStream is = theFormDataBodyPart.
                    try {
                        byte[] bytes = Util.readInputStream(is, 1024 * 1024, 1024 * 1024 * 1024);
                        log.debug("this many bytes: " + bytes.length);
                        for(byte b:bytes){
                        String s = new String(bytes, "UTF-8");
                        return s;
                    } catch (IOException ex) {
                        throw new RuntimeException(ex.
getMessage(), ex);

Method 2) use theFormDataBodyPart.


No comments:

Post a Comment