• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Chinese and RTF

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Chinese and RTF

    I hate RTF, but it's one of those things that won't go away.

    I'm trying to copy and paste chinese characters from charmap on Windows 7, into my app using RTF. Some of them work, some don't. Using the Microsoft YaHei font, this is what I get for one that works:

    Code:
    {\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset134 Microsoft YaHei;}
    {\f1\fnil\fcharset0 MS Shell Dlg 2;}}
    {\*\generator Msftedit 5.41.21.2509;}\viewkind4\uc1\pard\f0\fs20\u15431?\f1\fs17\par
    }
    I've underlined the important bit. It's the \u tag that sends any unicode character as a decimal value, and 15431 is indeed the correct code and I get the correct character out.

    If I try a character a bit higher up, in this case 0x7100, this is what I get:

    Code:
    {\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset134 Microsoft YaHei;}{\f1\fnil\fcharset0 MS Shell Dlg 2;}}
    {\*\generator Msftedit 5.41.21.2509;}\viewkind4\uc1\pard\f0\fs20\'9f\'57\f1\fs17\par
    }
    This time rather than send me a \u it's sending \'9f\'57. A \' sends a two digit hex value to cover the range 128-255, which you should then translate according to the code page. But that's two characters not the one I was expecting, and the code page is 1252 which is normal ANSI, and the language is 2057 which is latin.

    I don't understand how I'm meant to get from two characters 0x9f and 0x57 to 0x7100, and it's not UTF8 (which would be 3 characters and I don't think RTF uses UTF8 anyway). The only other thing is the charset on the font ( 134 = chinese ), but I'm not sure how I get from that to a code page, and it would still give me two characters out not that one I'm expecting.

    Does anybody understand all this? Does anybody speak chinese?
    Will work inside IR35. Or for food.

    #2
    If it's any consolation I had a similar problem with a web site where I was transliterating the page title to construct URLs by changing ä to ae, ö to oe and ü to ue.

    On page creation it worked fine. On editing a page it would mysteriously change the UTF-8 umlauted characters passed to the URL transliteration module to UTF-16, which said module didn't understand.

    I'm afraid I never managed to get that one sorted. The proper solution would have been to extend the transliteration module to cope with UTF-16 as well, but it wasn't important enough at the time.
    Behold the warranty -- the bold print giveth and the fine print taketh away.

    Comment


      #3
      It turns out it is a two character code for a single character, and you just have to know that for the chinese charset you use code page 936 to translate to unicode. I found a list, but I've no idea if it's a complete list:

      The Font Charset Property

      I don't understand why Charmap (and write) would use unicode for some characters, and use the old method for others. You'd think everything new would be using unicode.
      Will work inside IR35. Or for food.

      Comment


        #4
        Originally posted by VectraMan View Post
        I don't understand why Charmap (and write) would use unicode for some characters, and use the old method for others. You'd think everything new would be using unicode.
        But are either of them new?

        It's a strange mishmash at the moment. Although Apple have been recommending the use of UTF-8 in their developer documentation for several years, until recently TextEdit's default was Mac OS Roman. And exporting to text from Apple's Pages appears to choose an encoding on the fly, depending on context.

        It's all very messy.
        Behold the warranty -- the bold print giveth and the fine print taketh away.

        Comment

        Working...
        X