UTF-8 encoding and BOM java bug

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

UTF-8 encoding and BOM java bug

Adam Heath-2
The unicode specs say that a file 'may' start with a BOM(U+FEFF).  The
reader of the bytes can then look to see how the BOM is encoded, and
pick the correct encoding(UTF-8, UTF-16(le/be), UTF-32(le/be).  If the
file does start with a BOM, it must be removed.

A BOM anywhere else in the datastream is left alone.

However, lovely java doesn't do this correctly.  UTF-8 encodings do
*not* remove the BOM.  Only the others do.  The bug about this is at
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

I'm sending this to the list, because UTF-8 is the only sensible
encoding to use nowadays, and this might crop up here.  I don't really
have a fix yet.

I'm going to have to deal with this in webslinger, so I'll develop a
change there, and then alter the ofbiz code with the same kind of logic.