Deleting records SQL

**AtW** · 25 June 2003, 19:55

> I never said that my SQL would be faster than your non-
> standard messy code.

:rollin

3 lines of Transact SQL is messy and non-standard? its not like I am suggesting to do 30-50 lines - there is no reason why not to do it. anyhow, maybe if you did enough optimisations you'd learn that optimised code is non-standard by definition - its _optimised_

> I recently reduced the running time of a 4 hour query to 20
> minutes.

great, and I've recently improved query from 3 sec to 0.1 sec - tables with 5 mln rows involved. care to offer your solution to effective substring searches? ie optimise the following:

create table Products (SKU int, Keywords varchar(255)

select * from Products where Keywords like '%keyword1%' and Keywords like '%keyword2%'

table scans are pretty expensive for table with 100k's products

going to present this to local Perl Mongers group btw.

> I could have spent weeks improving it to 5 minutes, but it
> wouldn't give any further benefit to an 8 hour process.

agree in principle - but in this case fast optimisation was not big work at all -- the mani difference is in understanding what exactly database is doing in order to execute "not in" query, its that knowledge that allows to optimise code not just slapping index in hope it would improve performance -- it will but only finitely.

**PerlOfWisdom** · 26 June 2003, 12:38

This should speed it up a bit.

create table WordLookup (Keyword varchar(31), SKU int);
/*
** Populate it with every word in Products and its SKU, using some code, depending on the DBMS.
** This can be done using triggers and stored procedures to keep it up-to-date.
*/

/* if you need both SKU and Keywords */

select t1.SKU, Keywords
from WordLookup t1, WordLookup t2, Products
where t1.SKU = t2.SKU
and Products.SKU = t1.SKU
and t1.Keyword = 'keyword1'
and t2.Keyword = 'keyword2';

/* if you just need SKU */

select t1.SKU
from WordLookup t1, WordLookup t2
where t1.SKU = t2.SKU
and t1.Keyword = 'keyword1'
and t2.Keyword = 'keyword2';

Is there a better way, or do I win the £5 record token?

**AtW** · 26 June 2003, 20:15

nice try

have you read my text?

> care to offer your solution to effective substring
> searches?

SubStrings means when I search for "Look" I want to find not just "Look" but also "Lookup" and "WordLookup". Your code does not cater for that at all - you restrict engine to exact words which is easy to code but does not solve problem.

another try maybe?

I do not delude myself with level fo my english, so if you misunderstood or still do not understand task then please feel free to ask questions.

You also seem to want to have as many WordLookup tables as you can have keywords, for product searches it is not unusual to have 5-7-10 keywords, do you intend to have 10 tables with same data in them?

In my ex job where I designed search engine we had 150k products with 15-20 keywords each. If we split them into individual keywords then we'd have WordLookup table with 3mln rows. Now, if you use "like '%keyword%'" approach then you'd force database to table scan which kills performance. This whole task is about eliminating table scan. Before you rush to suggest "full text index" please note that some big databases would still not use it if you use % sign before keyword - using it after ie keyword% is easy since existing indecies can work.

my approach was not the most simplistic - its not just one query and it requires pre-processing of data that I did using Perl but it gives really good results - after it implementation database load dropped 4 times on our website that was pretty well visited.

> Is there a better way, or do I win the £5 record token?

you will have to try harder - you did not even solve the problem, my way was pretty unique (have not seen it anywhere else), best of all it was much more scalable than anything else - am putting article together on how to do it, think this time i might implement it using .NET + SQL

**PerlOfWisdom** · 27 June 2003, 10:31

Why would it need more than one lookup table? Have you ever heard of self-joins.

I know you said substring searches, but in practice, you would probably be looking for whole words - you wouldn't search for 'omputer' for example.

God knows what kind of solution you came up with, but I bet it was non-standard and only worked on your specific system. I remember one scheme you came up with where you encoded postcodes into binary to save 4 bytes or something.

Anyway, keep up the good work - you'll soon have a job for life as no-one else will be able to understand any of your systems.

**AtW** · 27 June 2003, 20:24

> Why would it need more than one lookup table? Have you
> ever heard of self-joins.

oops, your alises thrown me off a bit - apologies for that now I see you used same table - no questions about that.

> I know you said substring searches, but in practice, you
> would probably be looking for whole words - you wouldn't
> search for 'omputer' for example.

the task was very specificly defined as providing substring search. you can't do that effectively and try to modify original task to get away with simple query that any student can come up with.

In practical terms substring searches are necessary to improve search experience on consumer based websites - I know I analysed data for what used to be top 10 UK ecommerce website - people do input crap data and if they fail to find stuff quick they bog off elsewhere. People are so used to Google these days that they have very high expectations of search engines on sites and trust me - I looked at the very low level at customer conversions - improvement in search does increase sales considerably. Hence requirement to do substring searches. Add to this that data quality for 100k's of products is often low - even from expensive content providers and that necessitates substring searching even more.

> you wouldn't search for 'omputer' for example.

I would search for CD-RW and expect it to find CDRW, CD RW and CD-RW. I would search for "computer" and expect it to find "computer" as well as "computers" etc.

> God knows what kind of solution you came up with, but I
> bet it was non-standard and only worked on your specific
> system.

It worked on Sybase, SQL Server (MSFT) and DB2 (ported within days). Specifically our overall database load dropped 4 times - this put us in a position to scale down on hardware resources and cut costs from £500k pa to £50k - a big gain in my book.

> I remember one scheme you came up with where
> you encoded postcodes into binary to save 4 bytes or
> something.

I have got very good experience in optimisations - I used to program in assembly and I know value of each byte. My suggestion would have worked perfectly for huge datasets with 100s of millions of rows - this is enterprise level tasks not jokes Access-like "databases". I had to deal with 1GB of data collected daily and I was into analyses of months of data using very non-trivial queries - 4 bytes savings like above cut down time on queries by many hours.

> Anyway, keep up the good work - you'll soon have a job for
> life as no-one else will be able to understand any of your
> systems.

thanks, even tho i sense sarcasm there.I can assure you that my work was understood by some bright individuals

p.s. Now it would only fair to offer solution to problem, feel free to show me sources where they did it the same:

key idea is to avoid queries with % sign before keyword - this forces db to ignore index. So, we do split all words in the following manner:

computer
omputer
mputer
puter
uter
ter

this partial words would point to actual SKU you looking after so query would be to identify all split words and retrieve unique SKUs - easier said than done but key principle that allows to avoid table scans yet search for substrings is as above. Add to this that this approach enables to do easy automatic spell-checking ala Google without getting humans involved.

**PerlOfWisdom** · 30 June 2003, 13:30

I must admit I wouldn't have thought of that scheme.

Did you

a) split each value in the Keywords field into (up to) 254 substrings and put in a lookup table,
or
b) split the keyword1 and keyword2 into a number of substrings and run several queries?

On second thoughts, b) wouldn't work as it could only find the first word in the Keywords field without the leading %.

**AtW** · 30 June 2003, 21:12

yeah its pretty unique - i have not seen it elsewhere and after I looked at techniques people used to do "full text indexes" I found that they still dont allow true substring search, ie they would allow to append % at the end of word but not in front of it.

> a) split each value in the Keywords field into (up to) 254
> substrings and put in a lookup table,

statistics is powerful thing - have you thought that total size of good english dictionary is less than 60k words? Then think of average word length and realise that it is reasonable to expect that people would not search for individual letters or even 2 or 3 char words unless they are FULL words, so you only split anything bigger than 3 chars - create table with unique split parts and their frequences this would be very small, we had less than 250k unique word parts from big data set. More to the point - words in English language are pretty small, 4-5 is avg, so you only will have 3-4-7 parts for each word - thats not a lot.

> b) split the keyword1 and keyword2 into a number of
> substrings and run several queries?

once you split whole lot you will have Keywords table that you
can query for keys using normal notation "like 'keyword%' - note that you only split words from FRONT, no need to do it from the back because using % at the end of word is okay for indexes - its still fast. So, you can query one keywords table to identify unique number assigned to them using normal query methods - its very fast, then you just query in big Stock Codes table using identified numbers as exact match - use clustered index and its a breath. Remember my 4 bytes saving? Well, it helps a lot in this sort of situations, especially when you can fit things into tight 64bit data-types when you run on 64bit boxes.

There is sure a lot more to that to make it really good, too long to cover here, but you get the idea - this algorithm, albeit not wel known generates really good savings - as I said our overall database load dropped from 60% to 15% and whole website had only SEARCH engine optimised.

**AtW** · 1 July 2003, 21:46

My apologies for not putting it in plain compehensive english - I've been working far too much on big project at work recently

**nigelspencer** · 2 July 2003, 15:56

surely....

delete tblThis
from tblthis left outer join tblthat
on tblthis.clientid = tblthat.clientid
where tblthat.clientid = null

or something similar?

**AtW** · 2 July 2003, 22:54

Re: surely....

well that should delete rows with clientid=NULL, too tired to try it out at home, gotta try at work, not particularly into outer joins perhaps I am missing out something???

Deleting records SQL

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Deleting records SQL

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Tag Cloud