AtW - Info needed

**AtW** · 4 March 2004, 15:36

no

not at runtime - doing soundlike function for every word is a sure performance killer, you might want to do that _ONCE_ when generating main index which can be offline so that actual searches which are real-time stay as fast as possible.

IMHO all "sounds like" functions I've seen were pretty poor, but I find spell check at Google indespensable.

**Sysman** · 4 March 2004, 15:47

**reynolds** · 4 March 2004, 16:13

re search engine

Cheers Atw, SupremeSpod, PerlOfWisdom.

I'll look into those techniques. I want to fit the entire index into memory (ie. wordID, docID, locID) so response times are very quick. Unfortunately (fortunately?) I'm build my own database (a combination of hashtable and randomaccess files) so I can't use the special features such as intersect - I will just use a null pointer as a test to see if a term exists. Also I am building this into a web server so the whole package is just one application. AtW did you compress your data on the database? - I've found that this can cause I few problems.

**AtW** · 4 March 2004, 16:29

ffs

no need to compress data - just make sure you eliminate redundant words (50% of data easy) and turn them into numbers - this will make tables and indices very compact. Intersect is not necessary, use that other query which will work just fine. I suggest you download SQL Server and play around with Query Analyser - it displays very nice query plans which would show how efficient your query is - and the whole game is about doing offline enough to be able to run very tight queries in real time.

The key to achieving exceptional real time performance is to do as much as possible offline.

And buy that book - even though I learnt most of stuff on my own I still found that book very useful - it should be even more so since you have not gone through it all the hard way (trial and error).

Hash tables are not the best - you need clustered index on it.

**reynolds** · 4 March 2004, 16:45

re

"no need to compress data - just make sure you eliminate redundant words (50% of data easy) and turn them into numbers - this will make tables and indices very compact. "

But if I turn words into wordID's I will still need to refer back to that hashtable just to turn them back into words again. For the space I'll save converting words to numbers is going to be used up again by providing a number-word lookup table.

**reynolds** · 4 March 2004, 17:01

re

Apologies if this is a sh.it question. What is the smallest size chunk of data that can be compressed?

**PerlOfWisdom** · 4 March 2004, 18:48

Re: re

2 bits, where only 2 of the possible 4 values are ever used.

**Mark Snowdon** · 4 March 2004, 19:45

re:re

In real world cases you will find that any mechanism has an overhead and you need to see the redundancy of the data being greater than the overhead. Remembering of course that not all applications require 100% preservation of information.

If you think about it - there needs to be a way to encode information within the data to specify how the size has been reduced, that is information so requires space.

Have you read any of Shannon ?

**AtW** · 4 March 2004, 19:46

Re: re

> But if I turn words into wordID's I will still need to refer back
> to that hashtable just to turn them back into words again

yes thats what Table 1 is for - you will have unique index on WordID - you will only join it after you select X results you need so that you wont be joining _ALL_ found WordIDs, just those you will show on screen. Temporary table is handy for that sort of thing.

Edit: sorry Spod I was a wee bit inconsiderate today, hope I did not hurt your feelings

**reynolds** · 4 March 2004, 21:53

re

"In real world cases you will find that any mechanism has an overhead and you need to see the redundancy of the data being greater than the overhead. Remembering of course that not all applications require 100% preservation of information.

If you think about it - there needs to be a way to encode information within the data to specify how the size has been reduced, that is information so requires space.

Have you read any of Shannon ?"

No, but I've noticed he's big in compression.

I did a test compression on a string and noticed that the string needs to be a certain size to realise any benefits of the compression (like you said the encoding information is bundled in).

AtW - Info needed

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

AtW - Info needed

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Tag Cloud