Thursday, January 21, 2010

Character Encoding Hell

Over the years, one of the perennial hassles that web services programmers grapple with, is undoubtedly character encoding. One of the biggest contributing factors to the problem goes all the way back to the fact that the original definition of URL-encoding failed to specify how to deal with UTF-8 or other non-ascii encodings outside the reserved character set. The current spec states that if you want to send UTF-8 strings, for example "てすと   (te-su-to) ", then you should percent encode each byte of the UTF-8 character sequence (%E3%81%A6).

Two excellent pages for test data, should you need to test some multi-byte UTF-8 characters are:

Sadly for AJAX applications, actually generating these percent encoded byte sequences is non-trivial, and doesn't seem to be readily available "off the shelf". I've had to resort to JS source such as this.Without such scripts, various attempts to use "off the shelf" JS methods produce wacky results, such as unicode representations of the strings (like \uXX\uXX) which is completely useless for transmission in a URL.

On the server, there are also problems in receiving and properly decoding the bytes. One of the biggest problems is that Java web servers don't do it in a standard way. For example, when I was working on Tomcat 5.x,  the Servlet getParameter method would assume ISO-8859-1 (Latin 1) encoding, which would garble any properly UTF-8 percent encoded bytes. There is a well hidden setting to switch the default to UTF-8.


On the other hand, Jetty *does* assume the URL is UTF-8 percent encoded (love jetty!!). So without some config tweaks, don't expect servlet containers to *uniformly* deliver properly decoded UTF-8 strings.

My latest installment of UTF-8 Hell comes as I implement some more REST capabilities for Nextdb.net. Here is the content of a message I just posted to Paul Sandoz of Jersey fame (man, that guy must never sleep. he is *on the ball* on the Jersey mailing list). Mad props to Jersey. It is just killer. But I digress:

Hi Paul,

I got to the bottom of this by trying to unmarshal the string in three different ways. As I already mentioned the first way was just to call FormDataBodyPart.getValue().

toString(). This produced the improperly decoded String.

Then I tried two other ways, both of which correctly unmarshalled the bytes from the POST. As supporting information, here is the CURL line I was testing with, and an excerpt from the CURL trace, showing the proper bytes being posted.

curl --trace traced -F line=てすと http://localhost:8080/nextdb/rest/geoff/testchars/lines/rowid/PK/1

The 9 bytes highlighted below are the three japanese characters.

=> Send data, 148 bytes (0x94)
0000: 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d ----------------
0010: 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 63 61 --------------ca
0020: 33 32 31 65 31 30 39 66 36 37 0d 0a 43 6f 6e 74 321e109f67..Cont
0030: 65 6e 74 2d 44 69 73 70 6f 73 69 74 69 6f 6e 3a ent-Disposition:
0040: 20 66 6f 72 6d 2d 64 61 74 61 3b 20 6e 61 6d 65  form-data; name
0050: 3d 22 6c 69 6e 65 22 0d 0a 0d 0a e3 81 a6 e3 81 ="line".........
0060: 99 e3 81 a8 0d 0a 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d ......----------
0070: 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d ----------------
0080: 2d 2d 2d 2d 63 61 33 32 31 65 31 30 39 66 36 37 ----ca321e109f67
0090: 2d 2d 0d 0a          

Both of the following two methods will properly unmarshal the correct String:

method 1: use InputStream to get raw bytes

                    InputStream is = theFormDataBodyPart.
getValueAs(InputStream.class);
                    try {
                        byte[] bytes = Util.readInputStream(is, 1024 * 1024, 1024 * 1024 * 1024);
                        log.debug("this many bytes: " + bytes.length);
                        for(byte b:bytes){
                            log.debug(Integer.toHexString(
0x00FF&b));
                        }
                        String s = new String(bytes, "UTF-8");
                        log.debug(s);
                        return s;
                    } catch (IOException ex) {
                        throw new RuntimeException(ex.
getMessage(), ex);
                    } 

Method 2) use theFormDataBodyPart.
getValueAs(String.class)

Cheers,
geoff

Monday, January 04, 2010

Well said....

The exchange between Churchill & Lady Astor: She said, "If you were my husband I'd give you poison." He said, "If you were my wife, I'd drink it."

A member of Parliament to Disraeli: "Sir, you will either die on the gallows or of some unspeakable disease." "That depends, Sir," said Disraeli, "whether I embrace your policies or your mistress."

"He had delusions of adequacy." - Walter Kerr

"He has all the virtues I dislike and none of the vices I admire." - Winston Churchill 


"I have never killed a man, but I have read many obituaries with great pleasure." Clarence Darrow

"He has never been known to use a word that might send a reader to the dictionary." - William Faulkner (about Ernest Hemingway). 


"Thank you for sending me a copy of your book; I'll waste no time reading it." - Moses Hadas

"I didn't attend the funeral, but I sent a nice letter saying I approved of it." - Mark Twain

"He has no enemies, but is intensely disliked by his friends." - Oscar Wilde

"I am enclosing two tickets to the first night of my new play; bring a friend.... if you have one." - George Bernard Shaw to Winston Churchill

"Cannot possibly attend first night, will attend second... if there is one." - Winston Churchill, in response.

"I feel so miserable without you; it's almost like having you here." - Stephen Bishop

"He is a self-made man and worships his creator." - John Bright

"I've just learned about his illness. Let's hope it's nothing trivial." - Irvin S. Cobb

"He is not only dull himself; he is the cause of dullness in others." - Samuel Johnson

"He is simply a shiver looking for a spine to run up." - Paul Keating 


"In order to avoid being called a flirt, she always yielded easily." - Charles, Count Talleyrand

"He loves nature in spite of what it did to him." - Forrest Tucker

"Why do you sit there looking like an envelope without any address on it?" - Mark Twain

"His mother should have thrown him away and kept the stork." - Mae West

"Some cause happiness wherever they go; others, whenever they go." - Oscar Wilde

"He uses statistics as a drunken man uses lamp-posts... for support rather than illumination." - Andrew Lang (1844-1912)

"He has Van Gogh's ear for music." - Billy Wilder

"I've had a perfectly wonderful evening. But this wasn't it." - Groucho Marx