Most efficient way of stripping HTML tags from string (with exceptions)

**dang65** · 4 September 2010, 23:02

Originally posted by Durbs View Post

As the title says, in .NET, whats my best bet for stripping all HTML markup bar a chosen few tags?

I'm currently using regex "<[^>]*>" to strip the lot but now wish to retain the and markup, nothing else. Whats the speediest, most efficient way of doing that?

Ta.

One option would be to replace "" and " " (and presumably "") with, say, "*p*" and "*br/*". Then do your regex thing to get rid of all the other tags, and then re-replace "*p*" with "" etc.

**NickFitz** · 5 September 2010, 03:13

What's the likelihood of the HTML being a bit dodgy? For example,

Code:

<a href="something" onclick="return a>b">blah</a>

is already going to fail with that expression. If you are dealing with HTML found in the wild or for any reason likely to be malformed-but-it-works-in-browsers-so-nobody-fixes-it (like that example), you're better off parsing it into a DOM, then doing a depth-first traversal of that DOM collapsing what you want to, then re-serialising the resultant DOM if you still need it as a string.

As a general rule, arbitrary HTML can't be parsed with regular expressions: they can only parse Type 3 Chomsky Grammars, and HTML's grammar is of Type 2. This point was made with slightly less rigour, but with the benefit of a glorious effusion of Unicode, by Bob Ince in a classic StackOverflow reply

N.B. .NET's regular expression implementation may include extensions that allow it to parse languages having a Type 2 grammar (although Perl's can't), so a solution may still be possible. I don't know if that is the case.

**Durbs** · 5 September 2010, 11:08

Cheers chaps, this HTML is pretty much clean, its an input from a textbox on a classifieds site so normally should have no HTML (only markup permitted by the few formatting buttons i provide) but always get the savvy manually typing in <a href entries and others ctrl+C'ing ads from other sites so you are right Nick, that regex wont help with duff markup. Went for a split/join function in the end.

Got round the and tags by just converting them to Environment.Newline before stripping the HTML. Not great, but seems to work.

Next problem is that this HTML was stripped and encoded for passing as an XML feed into an IPhone app, much of it going into a UITextView that seems happy to automatically show the ampersands, apostrophes, brackety things etc in their decoded form but things like the pound symbol that are encoded as '& #163;' remain as that.

Cant see any inbuilt blanket string decode (can see encode) functionality in the Iphone docs, is there any other way than a hard coded replacement?

Edited, actually UITextView doesn't do apostrophes either, must have dreamt that, balls, going to have to rethink this one or see if a Web View does £'s.

**Durbs** · 5 September 2010, 19:56

Originally posted by Durbs View Post

Cant see any inbuilt blanket string decode (can see encode) functionality in the Iphone docs, is there any other way than a hard coded replacement?

Wohoo, cracked that one by using: Apple - Support - Discussions - [iPhone] Any built in way to convert ... for anyone who Googles across this.

Code:

@implementation MREntitiesConverter
@synthesize resultString;
- (id)init
{
	if([super init]) {
		resultString = [[NSMutableString alloc] init];
	}
	return self;
}
- (void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)s {
		[self.resultString appendString:s];
}
- (NSString*)convertEntiesInString:(NSString*)s {
	if(s == nil) {
		NSLog(@"ERROR : Parameter string is nil");
	}
	NSString* xmlStr = [NSString stringWithFormat:@"<d>%@</d>", s];
	NSData *data = [xmlStr dataUsingEncoding:NSUTF8StringEncoding allowLossyConversion:YES];
	NSXMLParser* xmlParse = [[NSXMLParser alloc] initWithData:data];
	[xmlParse setDelegate:self];
	[xmlParse parse];
	NSString* returnStr = [[NSString alloc] initWithFormat:@"%@",resultString];
	return returnStr;
}
- (void)dealloc {
	[resultString release];
	[super dealloc];
}
@end

Anyone know a slicker inbuilt way similar to .Net's 'Server.HtmlDecode(str)', let me know!

**NickFitz** · 6 September 2010, 02:38

Originally posted by Durbs View Post

Wohoo, cracked that one by using: Apple - Support - Discussions - [iPhone] Any built in way to convert ... for anyone who Googles across this.

<snip>

Anyone know a slicker inbuilt way similar to .Net's 'Server.HtmlDecode(str)', let me know!

I came across that, but it strikes me as bizarre to create an instance of an XML parser for every single occurrence of an entity and ask it to parse it. Check it out in Instruments and it must inevitably show a hideous impact on performance. It gets the result, but it's a bit like... (/me searches for obligatory transportation-related analogy...) going to the shops for a pint of milk by hiring a helicopter at the helipad 0.4 miles from your home (but 0.5 miles from the shop) and flying to the helipad 0.4 miles from the shop (but 0.5 miles from your home) and getting a taxi from your home to the first helipad, getting a taxi from the second helipad to the shop and back, and then getting a taxi back home again. You end up with a pint of milk from the shop, but the waste of resources is phenomenal.

It would make more sense to use a WebView, given that it knows how to deal with entities. Even if it demands some basic markup around the content (which I haven't tested, but is unlikely) a couple of string concatenations to wrap things have still got to be cheaper in resource usage than firing up a parser for every ☂ (☂) or suchlike.

**Durbs** · 6 September 2010, 08:49

Originally posted by NickFitz View Post

I came across that, but it strikes me as bizarre to create an instance of an XML parser for every single occurrence of an entity and ask it to parse it. Check it out in Instruments and it must inevitably show a hideous impact on performance.

Yup, after testing in the simulator and now on my phone, it adds a noticable delay when running didSelectRowAtIndexPath as it parses a title and description string before pushing them to a new view. I assumed it would add a bit extra but didn't expect the overhead to be actually noticable in the interface!

Originally posted by NickFitz View Post

It would make more sense to use a WebView, given that it knows how to deal with entities. Even if it demands some basic markup around the content (which I haven't tested, but is unlikely)

Yep, it doesn't require any markup. Guess i'll have to go with the WebView even though i'm loath to use such a thing for what is effectively a simple text string

Most efficient way of stripping HTML tags from string (with exceptions)