Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!
I'm assuming neither of those machines has MS Word installed (or at least not a 200x version), and therefore don't have the Arial Unicode MS font?
I stalled at Word 97. I am unlikely to upgrade from there as I have yet to see any useful additional functionality in newer versions. Especially since I use OO.o !
That font is indeed not installed on either of those PCs.
UTF-8 is capable of representing any Unicode character - well, "code point" to be precise. The "8" merely refers to the fact that it uses 8-bit blocks to represent a code point, where a code point may require between 1 and 4 blocks.
I see. I had no idea it could use an escape sequence to get in more characters. Although it does seem like a lot of effort just to avoid doing things properly.
The test page worked perfectly for me in IE8 and Firefox on Windows 7, but not in Chrome. Interestingly looking at the page source shows the correct character too. Windows 7 looks like it comes with a handful of fonts that do contain Chinese characters.
I see. I had no idea it could use an escape sequence to get in more characters. Although it does seem like a lot of effort just to avoid doing things properly.
"UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode."
I see. I had no idea it could use an escape sequence to get in more characters. Although it does seem like a lot of effort just to avoid doing things properly.
UTF-8 is doing things properly, and it's no effort given that support is ubiquitous and generally more reliable than support for UTF-16 (or more likely UCS-2), which isn't much used other than as part of the internals of Windows (where it causes no end of problems).
Your authoring tool should be set up to open any encoding and output UTF-8 by default; your database should be set up to both accept and output UTF-8 by default (although it may use whatever encoding it likes internally, as long as that encoding doesn't result in data loss or corruption, and as long as data input in a non-UTF-8 encoding is correctly decoded); your web server should be set up to serve UTF-8 by default (correctly decoding and re-encoding static text files that have been saved in a different encoding as necessary); your web browser should be set up to accept UTF-8 by default (whilst allowing you to choose a different encoding if, for example, some imbecile has saved something as Windows-1252 and the file system is so useless as to include no information about the file's encoding, thus preventing the server re-encoding it as described above); your email client should be set up to both send and accept UTF-8 by default (again with the option to try other encodings for received messages in case the content encoding has been misdescribed by a poorly-configured server or sending email client).
That way you fit in with what the rest of the world is doing; and in fact, if you're using up-to-date software, that will be what you've already got. The days of ISO-8859-1 and its bastard cousins as acceptable encodings were over donkeys years ago; and you should only use UTF-16 on the web if you want to double the amount of data you send and receive for everything that could be represented in the 8-bit character sets.
I can also recommend the O'Reilly book Unicode: The Definitive Guide, which is actually thicker than, to take a couple of examples off my bookcase, Mastering Regular Expressions and HTTP: The Definitive Guide.
Oh, and for Unicode-related stuff from one of Microsoft's internationalization (I18N) experts, visit Michael Kaplan's blog Sorting it all Out - older entries in particular are amusing for the sake of the Sesame Street-style postscript "This post brought to you by {some obscure Unicode code point that happens to have some relevance to the post}", even on non-technical posts, such as "This post brought to you by ॐ (U+0950, DEVANAGARI OM)"
Raymond Chen has also posted various things over the years about some of the ways Windows deals with encodings and the problems relating thereto; "Some files come up strange in Notepad" is but one example that I remember, but it's well worth doing a search of his blog for "UTF-16" and "Unicode" to find out an enormous amount more
If one was, for example, creating a content management system then limiting oneself to the ISO Latin-1 alphabet would be extremely short-sighted. By using Unicode, one ensures that the product is still usable for users from cultures that use non-Latin alphabets, such as are found in Russia, India, Greece, Korea, Japan, China, the USA (Cherokee anyone?)... the list goes on.
Also, how else are you going to represent that Ancient Greek music other than in Ancient Greek Musical Notation, or transcribe those tablets written in Linear B that have been sitting in the loft for years? You can also report the matches at the pub's Monday night Dominoes Club
Generally speaking, given that the first "W" in "WWW" stands for "World", it's usually best to do things in a way that will work for anybody in the world
Comment