Sorting data by frequency in C#

**Churchill** · 12 July 2007, 12:20

Originally posted by DimPrawn

Hello fellow .NET boffins,

Just asking opinions on the most efficient and elegant way to achieve the following.

I have a body of text. This text has been parsed into an array or collection of words. Stop words such as "in", "and", "or", "if" have been removed.

I would like to know the top N most frequent words, as efficiently (on CPU not RAM) as possible.

So to summarise. I bung in 100 or 10000 words pulled from a body of text and say I want the top 5 words and the routine in the blink of an eye says:

"Contractor"
"overpaid"
"lazy"
"tax"
"computer"

That would do for a start.

It needs to be using the sad old 1.1 framework, so no generics or anonymous delegate or yields please.

How would you approach this?

DP

I'll write you a module. It'll be available in two days and I'll charge you £1500.

**ASB** · 12 July 2007, 12:29

Just return:-

"the" "of" "to" "is" "it" and you'll be right quite a lot of the time.

Personally I think I'd just load them into a hashtable and see what the performance was like. If it was OK I wouldn't worry about it.

**Churchill** · 12 July 2007, 12:31

Originally posted by ASB

Just return:-

"the" "of" "to" "is" "it" and you'll be right quite a lot of the time.

Personally I think I'd just load them into a hashtable and see what the performance was like. If it was OK I wouldn't worry about it.

Quantify "OK"...

**DimPrawn** · 12 July 2007, 12:34

Originally posted by ASB

Just return:-

"the" "of" "to" "is" "it" and you'll be right quite a lot of the time.

Personally I think I'd just load them into a hashtable and see what the performance was like. If it was OK I wouldn't worry about it.

I said all the stop words have been removed before hand.

How does "loading them into a hashtable" somehow remove the duplicates and sort them by frequency?

Great answer with insightful analysis, I can see why they pay you £200/day.

**ASB** · 12 July 2007, 12:46

Originally posted by Churchill

Quantify "OK"...

Can't. Depends on the usage pattern doesn't it.

Given that the text has to be obtained from somewhere and then split into the words it may well be that stuffing them into a hashtable is quick enough.

If it takes say 500 ms to get the text and 100 to add them then it's probably OK. If it is the inverse then it probably isn't. I'm sure you could produce something clever that might be better.

Also I just did a search, here is a possible example:-

http://blogs.vbcity.com/mcintyre/arc...2/11/7343.aspx

Ok, it;'s not 1.1 but it would be pretty easy to test the performance.

**DimPrawn** · 12 July 2007, 12:53

That's a pretty long winded and bog standard solution isn't it?

I was asking if their were an elegant and perhaps clever algorithmic way.

I mean, yeah, walk through the list, count them and then sort by frequency.

Even Milan could have thought of that one.

I was hoping someone would know of a cutting edge algorithm that trades off memory for a very little CPU use.

**ASB** · 12 July 2007, 12:55

Originally posted by DimPrawn

I said all the stop words have been removed before hand.

How does "loading them into a hashtable" somehow remove the duplicates and sort them by frequency?

Great answer with insightful analysis, I can see why they pay you £200/day.

It was more helpful than yours normally are

I dug out an example which I put a link to (yes I know it's .NET 2).

I'm sure you can manage the sort it will require, though it would probably be quickest just to scan the counts holding the top 5 in another array then extract those elements.

**ASB** · 12 July 2007, 12:58

Originally posted by DimPrawn

That's a pretty long winded and bog standard solution isn't it?

I was asking if their were an elegant and perhaps clever algorithmic way.

I mean, yeah, walk through the list, count them and then sort by frequency.

Even Milan could have thought of that one.

I was hoping someone would know of a cutting edge algorithm that trades off memory for a very little CPU use.

Yes, it is bog standard. But does it work? yes.

Now it may be that it doesn't work well enough, but I guess you'll never find out since it's rejected simply on the grounds of not being clever.

**ASB** · 12 July 2007, 13:01

This might get you on the road to a clever solution:-

http://www.awprofessional.com/articl...&seqNum=8&rl=1

Sorting data by frequency in C#