Displaying non latin characters in a web page

**Sysman** · 8 January 2010, 19:48

Originally posted by RichardCranium View Post

In case anyone is interested, when I try to view that post (and using this test page) I get:

Windows ME, IE6... a square box
Windows ME, Firefox v2... a question mark

Windows XP, IE v8... a square box
Windows XP, Firefox v3... a pretty little box with 2F at the top and 54 at the bottom.

Edit: that test page also says: "You need a font that supports this character to even have a hope of seeing it correctly in the browser."

In contrast, it displays correctly on my Mac in Firefox, Opera and Safari.

**mudskipper** · 8 January 2010, 19:59

I see the same message as RC on the test page.

sIFR might help with what you need for displaying fonts depending on what you're actually trying to achieve.

**RichardCranium** · 8 January 2010, 20:03

Originally posted by NickFitz View Post

I'm assuming neither of those machines has MS Word installed (or at least not a 200x version), and therefore don't have the Arial Unicode MS font?

I stalled at Word 97. I am unlikely to upgrade from there as I have yet to see any useful additional functionality in newer versions. Especially since I use OO.o !

That font is indeed not installed on either of those PCs.

**AtW** · 8 January 2010, 20:12

Originally posted by RichardCranium View Post

I stalled at Word 97.

Word 2003 is ok - they did remove Toolbar though so had to use one from Office 97.

**VectraMan** · 8 January 2010, 20:45

Originally posted by NickFitz View Post

Umm... completely incorrect

UTF-8 is capable of representing any Unicode character - well, "code point" to be precise. The "8" merely refers to the fact that it uses 8-bit blocks to represent a code point, where a code point may require between 1 and 4 blocks.

I see. I had no idea it could use an escape sequence to get in more characters. Although it does seem like a lot of effort just to avoid doing things properly.

The test page worked perfectly for me in IE8 and Firefox on Windows 7, but not in Chrome. Interestingly looking at the page source shows the correct character too. Windows 7 looks like it comes with a handful of fonts that do contain Chinese characters.

**AtW** · 8 January 2010, 21:12

Originally posted by VectraMan View Post

I see. I had no idea it could use an escape sequence to get in more characters. Although it does seem like a lot of effort just to avoid doing things properly.

"UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode."

**NickFitz** · 9 January 2010, 05:33

Originally posted by VectraMan View Post

I see. I had no idea it could use an escape sequence to get in more characters. Although it does seem like a lot of effort just to avoid doing things properly.

UTF-8 is doing things properly, and it's no effort given that support is ubiquitous and generally more reliable than support for UTF-16 (or more likely UCS-2), which isn't much used other than as part of the internals of Windows (where it causes no end of problems).

Your authoring tool should be set up to open any encoding and output UTF-8 by default; your database should be set up to both accept and output UTF-8 by default (although it may use whatever encoding it likes internally, as long as that encoding doesn't result in data loss or corruption, and as long as data input in a non-UTF-8 encoding is correctly decoded); your web server should be set up to serve UTF-8 by default (correctly decoding and re-encoding static text files that have been saved in a different encoding as necessary); your web browser should be set up to accept UTF-8 by default (whilst allowing you to choose a different encoding if, for example, some imbecile has saved something as Windows-1252 and the file system is so useless as to include no information about the file's encoding, thus preventing the server re-encoding it as described above); your email client should be set up to both send and accept UTF-8 by default (again with the option to try other encodings for received messages in case the content encoding has been misdescribed by a poorly-configured server or sending email client).

That way you fit in with what the rest of the world is doing; and in fact, if you're using up-to-date software, that will be what you've already got. The days of ISO-8859-1 and its bastard cousins as acceptable encodings were over donkeys years ago; and you should only use UTF-16 on the web if you want to double the amount of data you send and receive for everything that could be represented in the 8-bit character sets.

Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) cannot be recommended highly enough. Jeff Atwood also provides some essential links in I {entity} Unicode.

I can also recommend the O'Reilly book Unicode: The Definitive Guide, which is actually thicker than, to take a couple of examples off my bookcase, Mastering Regular Expressions and HTTP: The Definitive Guide.

Oh, and for Unicode-related stuff from one of Microsoft's internationalization (I18N) experts, visit Michael Kaplan's blog Sorting it all Out - older entries in particular are amusing for the sake of the Sesame Street-style postscript "This post brought to you by {some obscure Unicode code point that happens to have some relevance to the post}", even on non-technical posts, such as "This post brought to you by ॐ (U+0950, DEVANAGARI OM)"

Raymond Chen has also posted various things over the years about some of the ways Windows deals with encodings and the problems relating thereto; "Some files come up strange in Notepad" is but one example that I remember, but it's well worth doing a search of his blog for "UTF-16" and "Unicode" to find out an enormous amount more

**wurzel** · 9 January 2010, 10:40

Originally posted by NickFitz View Post

Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) cannot be recommended highly enough.

Cheers. I shall be forwarding that link to my team leader.

**xoggoth** · 9 January 2010, 18:40

Just out of interest like, what is point of Chinese characters (even for argument's sake) if the text is not in Chinese?

**NickFitz** · 9 January 2010, 19:02

Originally posted by xoggoth View Post

Just out of interest like, what is point of Chinese characters (even for argument's sake) if the text is not in Chinese?

It gives Chinese people a good laugh

If one was, for example, creating a content management system then limiting oneself to the ISO Latin-1 alphabet would be extremely short-sighted. By using Unicode, one ensures that the product is still usable for users from cultures that use non-Latin alphabets, such as are found in Russia, India, Greece, Korea, Japan, China, the USA (Cherokee anyone?)... the list goes on.

Also, how else are you going to represent that Ancient Greek music other than in Ancient Greek Musical Notation, or transcribe those tablets written in Linear B that have been sitting in the loft for years? You can also report the matches at the pub's Monday night Dominoes Club

Generally speaking, given that the first "W" in "WWW" stands for "World", it's usually best to do things in a way that will work for anybody in the world

Displaying non latin characters in a web page

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Displaying non latin characters in a web page

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Tag Cloud