Firefox 'funny' characters ? - Contractor UK Bulletin Board

Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

You are not logged in or you do not have permission to access this page. This could be due to one of several reasons:

You are not logged in. If you are already registered, fill in the form below to log in, or follow the "Sign Up" link to register a new account.
You may not have sufficient privileges to access this page. Are you trying to edit someone else's post, access administrative features or some other privileged system?
If you are trying to post, the administrator may have disabled your account, or it may be awaiting activation.

NickFitz replied

23 November 2010, 13:11
Originally posted by Platypus View Post

I think that ' does show ok in HTML, but ‘ and ’ (the curly varieties) do not.

It's not HTML, it's a separate issue. Browsers will display curly quotes in HTML perfectly well (whether as entities or just the raw characters like “”) - in fact, they can happily manage things like umbrellas ☂ and sunshine ☀ if you have a suitable font installed. It's just down to the fact that the ‘’ in the XML feed I'm grabbing are encoded as 0x91 and 0x92 respectively, which is the ISO-8859-1 encoding, but are being parsed into UTF-8, which converts (e.g. ‘) to the multibyte representation 0xc2 0x91, which is what gets stored in the database. Then, when it's spat out by the forum software, the browser is being told that it's receiving ISO-8859-1 - and in that character encoding, 0xc2 is Â, so you see that character followed by the left single curly quote you were supposed to be getting all along.
Leave a comment:
Platypus replied

23 November 2010, 11:36
Originally posted by xoggoth View Post

Before 8 IE was very forgiving of all sorts of things that were not in the "standards" (actually the way browsers should be in my opinion unless there's some important securtity consideration). You could even get away with .Width instead of .width in jscript. All the browsers can seem inconsistent, why does " in HTML show ok but ' doesn't?

I think that ' does show ok in HTML, but ‘ and ’ (the curly varieties) do not.
Leave a comment:
xoggoth replied

23 November 2010, 11:15
But I've been seeing this for years on FF.
And I just had a quick peek using IE8 - same thing!

Before 8 IE was very forgiving of all sorts of things that were not in the "standards" (actually the way browsers should be in my opinion unless there's some important securtity consideration). You could even get away with .Width instead of .width in jscript. All the browsers can seem inconsistent, why does " in HTML show ok but ' doesn't?
Leave a comment:
Sysman replied

23 November 2010, 10:48
Originally posted by Platypus View Post

That's more than I could stomach

I've long been aware of what might happen to MySQL under Oracle's ownership, but to see it in OOo is still a shock.

I wish Apple would give their Numbers spreadsheet a serious boost. It's fine for the occasional user, but really doesn't cut the mustard for serious business style number crunching.
Leave a comment:
Sysman replied

23 November 2010, 10:43
Originally posted by OwlHoot View Post

On many web sites that host news articles these will have trundled through several steps, being parsed and converted at each hop. So there's a fair chance some developer along the line will assume text is UTF-8 when it isn't, or vice versa. One often sees munged characters even on sites like the BBC and the Telegraph. (Well, no surprise with the last, as they've probably sacked most of their developers, but you'd expect the BBC to be a bit more savvy.)

Yep, saw some weirdness the other day on the Beeb's iPlayer "Play" page.
Leave a comment:
Platypus replied

23 November 2010, 10:39
Originally posted by Sysman View Post

* I still haven't get used to seeing Oracle on the startup splash screen.

That's more than I could stomach
Leave a comment:
Sysman replied

23 November 2010, 10:37
Originally posted by bogeyman View Post

The funny accented A's are just fancy curly opening/closing single or double quotes in this case.

It's CUKs' content management editor at fault I think. It should translate non-standard characters into HTML entities.

A copy and paste from an OpenOffice document (yes, even a spreadsheet!) will do that. OpenOffice* will silently convert quotes and dashes to the fancy typographical versions by default. That might be OK in a word processing document but it's bloody criminal in a spreadsheet whose contents may be heading for a database.

I wouldn't be surprised if Word does the same, but I don't think Excel does.

* I still haven't get used to seeing Oracle on the startup splash screen.
Leave a comment:
NickFitz replied

23 November 2010, 00:47
Originally posted by bogeyman View Post

Good on yer Nick, but shouldn't these characters be converted to HTML entites (“ etc.) at some point, before they hit the browser? The character encoding and code-page wouldn't matter then, would it?

Unfortunately, it's too late by the time it gets to the point where it makes sense to use HTML entities. The way it's set up at the moment is that the news is entered into the main site CMS, which saves a copy of the headlines as an XML file on the forum server (as well as shoving the stories into the main site database, of course). My vBulletin plugin checks that file's last modification date as and when, and if it's been updated it parses the XML and shoves the headlines into the forum database, ready to be displayed in the sidebar.

It's only at display time that it makes sense to replace oddball characters with entities, and by then it's too late, as the characters got screwed up either when the file was created, when it was parsed, or when the forum database was updated - my current best guess is the parsing, but I need to confirm that.

The good news is that the main site CMS is soon to be upgraded to a system that's UTF-8 from end to end, so that should make it easier to sort things out.
Leave a comment:
OwlHoot replied

22 November 2010, 23:49
On many web sites that host news articles these will have trundled through several steps, being parsed and converted at each hop. So there's a fair chance some developer along the line will assume text is UTF-8 when it isn't, or vice versa. One often sees munged characters even on sites like the BBC and the Telegraph. (Well, no surprise with the last, as they've probably sacked most of their developers, but you'd expect the BBC to be a bit more savvy.)
Leave a comment:
bogeyman replied

22 November 2010, 23:40
Originally posted by NickFitz View Post

The headlines in the sidebar come from the content management system for the main site, but the character encoding is getting mucked up for things like curly quotes: I think it's coming from over there as ISO-8859-1 but with curly quotes thrown in, then being parsed as UTF-8, then being stuck in a database configured to use ISO-8859-1

I'll see about getting it fixed

Good on yer Nick, but shouldn't these characters be converted to HTML entites (“ etc.) at some point, before they hit the browser? The character encoding and code-page wouldn't matter then, would it?
Leave a comment:
NickFitz replied

22 November 2010, 22:46
The headlines in the sidebar come from the content management system for the main site, but the character encoding is getting mucked up for things like curly quotes: I think it's coming from over there as ISO-8859-1 but with curly quotes thrown in, then being parsed as UTF-8, then being stuck in a database configured to use ISO-8859-1

I'll see about getting it fixed
Leave a comment:
bogeyman replied

22 November 2010, 20:44
Originally posted by Platypus View Post

I'm on Win XP SP3, native (not VM) running FF 3.6.12
But I've been seeing this for years on FF.
And I just had a quick peek using IE8 - same thing!

I tried to chase this down once, and read lots of forum posts about character sets, but the replying geeks were so busy trying to out-geek each other with what-ifs and wherefores that any useful information (i.e. a simple fix) was completely obscured

What it basically comes down to is that the text content has characters that are not part of the common character set.

The funny accented A's are just fancy curly opening/closing single or double quotes in this case.

It's CUKs' content management editor at fault I think. It should translate non-standard characters into HTML entities.

That doesn't seen to be happing for some reason.

It's not a fault with your browser or anything.
Leave a comment:
Platypus replied

22 November 2010, 20:39
Originally posted by bogeyman View Post

Could be because the page is declared as charset=ISO-8859-1 (ISO LATIN 1) instead of charset=UTF-8 (Unicode).

The main problem is that non-ascii characters should be escaped or represented as entities (e.g. ” $lsquo; etc.).

... so does this that the webpage is in error?

EDIT: and furthermore, if it is, why don't the people who create such pages immediately see the error?

This very page is indeed ISO-8859-1

Last edited by Platypus; 22 November 2010, 20:42.
Leave a comment:
Platypus replied

22 November 2010, 20:35
Originally posted by bogeyman View Post

You on a Mac Platypus?

I see the same thing on FF and Chrome (OS X 10.6.4).

See the same thing in FF on Win XP under VMWare too.

I'm on Win XP SP3, native (not VM) running FF 3.6.12
But I've been seeing this for years on FF.
And I just had a quick peek using IE8 - same thing!

I tried to chase this down once, and read lots of forum posts about character sets, but the replying geeks were so busy trying to out-geek each other with what-ifs and wherefores that any useful information (i.e. a simple fix) was completely obscured
Leave a comment:
bogeyman replied

22 November 2010, 20:02
You on a Mac Platypus?

I see the same thing on FF and Chrome (OS X 10.6.4).

I see the same thing in FF on Win XP under VMWare too.

Could be because the page is declared as charset=ISO-8859-1 (ISO LATIN 1) instead of charset=UTF-8 (Unicode).

The main problem is that non-ascii characters should be escaped or represented as entities (e.g. ” ‘ etc.).

Last edited by bogeyman; 22 November 2010, 20:36.
Leave a comment: