• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Sorting data by frequency in C#

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Sorting data by frequency in C#

    Hello fellow .NET boffins,

    Just asking opinions on the most efficient and elegant way to achieve the following.

    I have a body of text. This text has been parsed into an array or collection of words. Stop words such as "in", "and", "or", "if" have been removed.

    I would like to know the top N most frequent words, as efficiently (on CPU not RAM) as possible.

    So to summarise. I bung in 100 or 10000 words pulled from a body of text and say I want the top 5 words and the routine in the blink of an eye says:

    "Contractor"
    "overpaid"
    "lazy"
    "tax"
    "computer"

    That would do for a start.

    It needs to be using the sad old 1.1 framework, so no generics or anonymous delegate or yields please.

    How would you approach this?

    DP

    #2
    Originally posted by DimPrawn
    Hello fellow .NET boffins,

    Just asking opinions on the most efficient and elegant way to achieve the following.

    I have a body of text. This text has been parsed into an array or collection of words. Stop words such as "in", "and", "or", "if" have been removed.

    I would like to know the top N most frequent words, as efficiently (on CPU not RAM) as possible.

    So to summarise. I bung in 100 or 10000 words pulled from a body of text and say I want the top 5 words and the routine in the blink of an eye says:

    "Contractor"
    "overpaid"
    "lazy"
    "tax"
    "computer"

    That would do for a start.

    It needs to be using the sad old 1.1 framework, so no generics or anonymous delegate or yields please.

    How would you approach this?

    DP
    I'll write you a module. It'll be available in two days and I'll charge you £1500.

    Comment


      #3
      Just return:-

      "the" "of" "to" "is" "it" and you'll be right quite a lot of the time.

      Personally I think I'd just load them into a hashtable and see what the performance was like. If it was OK I wouldn't worry about it.

      Comment


        #4
        Originally posted by ASB
        Just return:-

        "the" "of" "to" "is" "it" and you'll be right quite a lot of the time.

        Personally I think I'd just load them into a hashtable and see what the performance was like. If it was OK I wouldn't worry about it.
        Quantify "OK"...

        Comment


          #5
          Originally posted by ASB
          Just return:-

          "the" "of" "to" "is" "it" and you'll be right quite a lot of the time.

          Personally I think I'd just load them into a hashtable and see what the performance was like. If it was OK I wouldn't worry about it.
          I said all the stop words have been removed before hand.


          How does "loading them into a hashtable" somehow remove the duplicates and sort them by frequency?

          Great answer with insightful analysis, I can see why they pay you £200/day.


          Comment


            #6
            Originally posted by Churchill
            Quantify "OK"...
            Can't. Depends on the usage pattern doesn't it.

            Given that the text has to be obtained from somewhere and then split into the words it may well be that stuffing them into a hashtable is quick enough.

            If it takes say 500 ms to get the text and 100 to add them then it's probably OK. If it is the inverse then it probably isn't. I'm sure you could produce something clever that might be better.

            Also I just did a search, here is a possible example:-

            http://blogs.vbcity.com/mcintyre/arc...2/11/7343.aspx

            Ok, it;'s not 1.1 but it would be pretty easy to test the performance.

            Comment


              #7
              That's a pretty long winded and bog standard solution isn't it?

              I was asking if their were an elegant and perhaps clever algorithmic way.

              I mean, yeah, walk through the list, count them and then sort by frequency.

              Even Milan could have thought of that one.

              I was hoping someone would know of a cutting edge algorithm that trades off memory for a very little CPU use.

              Comment


                #8
                Originally posted by DimPrawn
                I said all the stop words have been removed before hand.


                How does "loading them into a hashtable" somehow remove the duplicates and sort them by frequency?

                Great answer with insightful analysis, I can see why they pay you £200/day.


                It was more helpful than yours normally are

                I dug out an example which I put a link to (yes I know it's .NET 2).

                I'm sure you can manage the sort it will require, though it would probably be quickest just to scan the counts holding the top 5 in another array then extract those elements.

                Comment


                  #9
                  Originally posted by DimPrawn
                  That's a pretty long winded and bog standard solution isn't it?

                  I was asking if their were an elegant and perhaps clever algorithmic way.

                  I mean, yeah, walk through the list, count them and then sort by frequency.

                  Even Milan could have thought of that one.

                  I was hoping someone would know of a cutting edge algorithm that trades off memory for a very little CPU use.
                  Yes, it is bog standard. But does it work? yes.

                  Now it may be that it doesn't work well enough, but I guess you'll never find out since it's rejected simply on the grounds of not being clever.

                  Comment


                    #10
                    This might get you on the road to a clever solution:-

                    http://www.awprofessional.com/articl...&seqNum=8&rl=1

                    Comment

                    Working...
                    X