• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

AtW - Info needed

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    #11
    no

    not at runtime - doing soundlike function for every word is a sure performance killer, you might want to do that _ONCE_ when generating main index which can be offline so that actual searches which are real-time stay as fast as possible.

    IMHO all "sounds like" functions I've seen were pretty poor, but I find spell check at Google indespensable.

    Comment


      #12
      Re: Just a suggestion...

      "I just thought that the "SoundsLike" facility would be of use."

      I've never used Soundex myself, but have seen it specified for various applications, normally lookups on people or place names.

      To see what sort of results it gives try putting your own name into

      resources.rootsweb.com/cg...xconverter

      "smith" gives this little lot, by way of example:

      SAINT | SAND | SANDY | SANTEE | SANTI | SCHMID | SCHMIDT | SCHMIT |
      SCHMITT | SHAND | SHUMATE | SINNOTT | SMITH | SMITHEY | SMOOT |
      SMOOTHY | SMYTH | SMYTHE | SNAITH | SNEAD | SNEATH | SNEED |
      SNODDY | SOUNDY | SUNDAY |

      Sunday sharing the same Soundex code as Smith ??

      Comment


        #13
        re search engine

        Cheers Atw, SupremeSpod, PerlOfWisdom.

        I'll look into those techniques. I want to fit the entire index into memory (ie. wordID, docID, locID) so response times are very quick. Unfortunately (fortunately?) I'm build my own database (a combination of hashtable and randomaccess files) so I can't use the special features such as intersect - I will just use a null pointer as a test to see if a term exists. Also I am building this into a web server so the whole package is just one application. AtW did you compress your data on the database? - I've found that this can cause I few problems.

        Comment


          #14
          ffs

          no need to compress data - just make sure you eliminate redundant words (50% of data easy) and turn them into numbers - this will make tables and indices very compact. Intersect is not necessary, use that other query which will work just fine. I suggest you download SQL Server and play around with Query Analyser - it displays very nice query plans which would show how efficient your query is - and the whole game is about doing offline enough to be able to run very tight queries in real time.

          The key to achieving exceptional real time performance is to do as much as possible offline.

          And buy that book - even though I learnt most of stuff on my own I still found that book very useful - it should be even more so since you have not gone through it all the hard way (trial and error).

          Hash tables are not the best - you need clustered index on it.

          Comment


            #15
            re

            "no need to compress data - just make sure you eliminate redundant words (50% of data easy) and turn them into numbers - this will make tables and indices very compact. "

            But if I turn words into wordID's I will still need to refer back to that hashtable just to turn them back into words again. For the space I'll save converting words to numbers is going to be used up again by providing a number-word lookup table.

            Comment


              #16
              re

              Apologies if this is a sh.it question. What is the smallest size chunk of data that can be compressed?

              Comment


                #17
                Re: re

                2 bits, where only 2 of the possible 4 values are ever used.

                Comment


                  #18
                  re:re

                  In real world cases you will find that any mechanism has an overhead and you need to see the redundancy of the data being greater than the overhead. Remembering of course that not all applications require 100% preservation of information.

                  If you think about it - there needs to be a way to encode information within the data to specify how the size has been reduced, that is information so requires space.

                  Have you read any of Shannon ?

                  Comment


                    #19
                    Re: re

                    > But if I turn words into wordID's I will still need to refer back
                    > to that hashtable just to turn them back into words again

                    yes thats what Table 1 is for - you will have unique index on WordID - you will only join it after you select X results you need so that you wont be joining _ALL_ found WordIDs, just those you will show on screen. Temporary table is handy for that sort of thing.


                    Edit: sorry Spod I was a wee bit inconsiderate today, hope I did not hurt your feelings

                    Comment


                      #20
                      re

                      "In real world cases you will find that any mechanism has an overhead and you need to see the redundancy of the data being greater than the overhead. Remembering of course that not all applications require 100% preservation of information.

                      If you think about it - there needs to be a way to encode information within the data to specify how the size has been reduced, that is information so requires space.

                      Have you read any of Shannon ?"

                      No, but I've noticed he's big in compression.

                      I did a test compression on a string and noticed that the string needs to be a certain size to realise any benefits of the compression (like you said the encoding information is bundled in).

                      Comment

                      Working...
                      X