On Oct 9, 2006, at 4:52 PM, Kurt T Stam wrote:
> UTF-8 can consume up to 6 bytes per character. UCS2 is strictly 2
> bytes.
> However most people prefer the 'backwards compatible' utf-8 where
> ASCII
> range characters still only consume 1 byte, so it should NOT overflow
> using ASCII, but it might using Asian characters. BTW, on average it
> takes 3 bytes per character for Asian characters, so a rule of
> thumb is
> to increase your string lengths by 3 when doing i18n.
>
> Any db will have this 'problem'..
To some extent this is true, but it seems that many other databases
"hide" this internally by treating field sizes as the total number of
characters instead of the total number of bytes. In other words, if
you are using a multi-byte character set like UTF-8 and it wants to
reserve 3 bytes per character and you say your column should be 255
characters, then internally it will make that 765 bytes to cover
those 255 characters you wanted in your column size.
In the 4 series MySQL didn't do this, hence the latin character set
default. I don't know if this has changed in the 5 series, but it
sure would be nice!
-David