• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

AtW - Info needed

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    AtW - Info needed

    Have you got any more links to info retrieval & search engine design techniques? Like the google one. I've got a bit stumped at how you can search for many keywords in a database. For example searching for "football" is easy because it's a single token but "football club" is trickier - I'm trying to find a quick method of pulling up documents that contain that phrase.

    What I've done is have a database for each word retrieved and a ranking of documents inside these databases. What do you reckon?

    [WORD, [docid,docid,docid]] etc.

    #2
    Sounds Like...

    Comment


      #3
      search engine

      Well, its pretty easy actually - so long as you store and index data effectively. You skipped 1 commonly used stage which is to turn Word into WordIDs - this is more efficient not just in terms of space usage, but also speed of scanning - joining fixed ints are far better than varchars.

      You need to index all data as follows:

      1. Table 1: Keywords consisting of WordID, actual Word, some extra stats like frenquency of that keyword in indexed data - this will be useful later.

      2. Table 2: Documents list: DocID against document name or URL

      3. Table 3: MainIndex - KeyWordID, DocID

      Cluster it like that - this would allow you to identify ALL DocIDs by KeyWordID, make sure clustered index is unique on combination of KeyWordID, DocID - note that having clustered index will help a lot, dont short change yourself for non-clustered, this is where clustering _really_ matters.

      So how to actually query it? Lets say you have phrase "football club", you do it in stages:

      Stage 1. Split phrase into separate words and identify corresponding KeyWordIDs - note that its best to make search engine case insensitive, in which case you just lower case everything during indexing and the search phrase itself.

      At this stage you can have neat optimisation - if you can't identify keyword in your keywords list then this search won't yield results - so you can abort it at this point! Very efficient technique against some tw@ts who like to overload database by sending lots of requests to search engine with lots of non-existant words (many implementations need to scan table).

      Stage 2. Now you know KeyWordIDs, so you need to obtain unique list of DocIDs, many ways to do it, but few are effective - I found that using DB2's "intersect" is pretty efficient, ie:

      select DocID from MainIndex where KeyWordID=1
      intersect
      select DocID from MainIndex where KeyWordID=2

      OR you can avoid using DB2 dependency:

      select M1.DocID
      from MainIndex M1,MainIndex M2
      where M1.KeyWordID=1
      and M2.KeyWordID=2
      and M1.DocID=M2.DocID

      This implements "AND" Google like search. Ranking is a separate question, lots of approaches there, but in principle so long as you can identify DocIDs _fast_, you should be able to rank them pretty quickly too.

      You can also use temporary tables to do 2 table joines at a time, in this case trick below will be essential: since you will know frequencies of KeyWordIDs you can re-order joins in such a way so that LEAST frequent will be joined first - this means joined datasets will be smaller rather than bigger, so that with lots of joins it will be overall more efficient.

      Stage 3: select Documents data from relevant table using identified DocIDs.

      -----------

      Anyway, if you want more information on how to deal with lots of data then I recommend the following book (shame I bought it after I invented most of tricks on my own) - www.amazon.co.uk/exec/obi...74-9767024

      A lot more features can be added to engine, things like synonims, spell checking etc.

      When I architectured new search engine for my ex-employer (we had one of the most popular ecommerce sites in the UK), database load dropped 4 times overall as efficient searching put very little load on database, nice Google like searches of sub 0.20 sec for many concurrent users.

      P.S. Spod you not being funny.

      Comment


        #4
        relevancy

        Forgot to make it clear - discard all duplicate WordID per given DocID - you can keep count of these though.

        The biggest disadvantage of that architecture is that it is hard to run searches for words that are near to each other - in fact personally I think this is a must have feature because when you deal with lots of DocIDs (we had 500k) you will have troubles ordering them well - after spending good time reading Google's document I've come to conclusion that proximity of search terms is one of the keys to relevancy, and relevancy (and speed) is what people want from search engines.

        Comment


          #5
          Re: relevancy

          The biggest disadvantage of that architecture is that it is hard to run searches for words that are near to each other
          Not thought it through in detail, but what about a table containing word positions in each document:

          Table 4: WordPos - DocID, WordPosition, KeyWordID

          select W2.WordPosition - W1.WordPosition as Distance
          from wordpos W1, wordpos W2
          where W1.KeyWordID = 764582374
          and W2.KeyWordID = 875843279
          and W1.DocID = 28298
          and W2.DocID = 28298
          and W2.WordPosition > W1.WordPosition

          Comment


            #6
            AtW

            How does the 'soundslike' feature actually work?

            The 'soundslike' feature is used to search for words that sound the same. For example, you may need to search for something that you are not sure how it is spelt. Simply choose 'Find', then enter the word you wish to search for, then click the 'soundslike' box. All word that sound the same as the word you typed, will appear one by one.
            iirc, Oracle has this facility built in.

            Comment


              #7
              Re: AtW

              ..........
              where SOUNDEX(city) = SOUNDEX('sidney');

              Comment


                #8
                spod

                Spod, why do you think Google is not using Oracle and they are not using "sounds like" feature? They are using spell check which as far as I am concerned is just right.

                Talking of Oracle - their licenses are simply too expensive thats one, two is that all those fancy "built-in" technologies are not as near as tunable as custom build search engine. Main problem is not to find, but to rank correct 10-20 products out of potentially 100,000 - this is why Google is so successful because they (had) developed very good ranking algorithm.

                PerlOfWisdom: yes I agree - position is best to count from beginning of text, so that it will be possible to easily identify words that followed one after another. Myself I was toying with storing PrevWordID, but this would not be as flexible as actually calculating distances. For most implementations it should be okay, Google is using very fancy tight data constructs as they have to deal with many terabytes of storage.

                Now the interesting thing is how to make search engine work on cluster where each separate machine does not have _all_ index, I spent sometime last year working on it, think I solved it but then abandoned project as I could not see how to make money out of it :x

                Comment


                  #9
                  yeah right

                  > where SOUNDEX(city) = SOUNDEX('sidney');

                  dont make me laugh - usage of functions in that manner is 99.999% guarantee that indexes will not be used - table scans are the killer (I've seen how mistakenly dropped index killed website in a matter of seconds).

                  Comment


                    #10
                    Just a suggestion...

                    ...pardon me for breathing!

                    I just thought that the "SoundsLike" facility would be of use.

                    Comment

                    Working...
                    X