Journey to SQL Authority

**Spacecadet** · 22 March 2011, 20:53

Originally posted by Freamon View Post

In a real ETL tool you could also partition your data by the first half of the postcode to improve the join performance.

Eh??
SSIS is a "real ETL Tool" and splitting data is easy as pie
and
Splitting data doesn't help join performance anyway given that joins have to be done on ordered data streams

**Freamon** · 22 March 2011, 22:54

Originally posted by Spacecadet View Post

Eh??
SSIS is a "real ETL Tool" and splitting data is easy as pie
and
Splitting data doesn't help join performance anyway given that joins have to be done on ordered data streams

I think it would help in this case. If you were to partition the data on each side of the join by the first half of the postcode then your sorts would only have to sort a small amount of data each before joining. And they could run in parallel. Not sure if the parallel bit is possible in SSIS or not.

**Spacecadet** · 22 March 2011, 23:16

Originally posted by Freamon View Post

I think it would help in this case. If you were to partition the data on each side of the join by the first half of the postcode then your sorts would only have to sort a small amount of data each before joining. And they could run in parallel. Not sure if the parallel bit is possible in SSIS or not.

One of the great things about SSIS is the fact that you can run things in parallel and really make good use of all the CPU and RAM available.
ETL World Record! - SQL Server Performance - Site Home - MSDN Blogs

However, if you're sorting data using the SSIS component prior to joining then you've got bigger problems on your hands.

Anyway - joining by postcode isn't the problem, in fact that's done very early on in my package as part of filtering out the well formed address data and takes about a second or two... the issue is working out the more complex joins between the badly formed addresses and the rest of the PAF data. These joins are relatively straight forward to write in SQL but impossible in SSIS which expects a nice clean join.

**Freamon** · 22 March 2011, 23:27

Originally posted by Spacecadet View Post

One of the great things about SSIS is the fact that you can run things in parallel and really make good use of all the CPU and RAM available.
ETL World Record! - SQL Server Performance - Site Home - MSDN Blogs

However, if you're sorting data using the SSIS component prior to joining then you've got bigger problems on your hands.

But do you see what I'm saying about partitioning? If your initial component is reading data row-by-row and then firing it out to parallel streams, one for each postcode area, i.e. a stream for AB1, AB2, AB3 etc, and you then only have to sort each group of data within the stream, your sort will be more efficient than trying to sort the entire dataset in one go and then join it. Because you already know from the data that only a row with an AB1 postcode has a chance of joining to another row with an AB1 postcode, etc.

Originally posted by Spacecadet View Post

Anyway - joining by postcode isn't the problem, in fact that's done very early on in my package as part of filtering out the well formed address data and takes about a second or two... the issue is working out the more complex joins between the badly formed addresses and the rest of the PAF data. These joins are relatively straight forward to write in SQL but impossible in SSIS which expects a nice clean join.

I'd suggest you need a specialist address cleansing tool to get the job done. There are tools which will cope with a myriad of different forms of address data, and have all the intelligence around what notation is equivalent (e.g. "Rd" vs. "Road", house number and road name appearing on different address lines, house name included or excluded, etc etc) built into the tool, and can generate a cleansed address with a unique address key which would then join into PAF easily. Dataflux, Trillium etc will all do this, at a price. There are probably cheaper ones that are just as good when specialising only in address data.

**Spacecadet** · 22 March 2011, 23:36

Originally posted by Freamon View Post

your sort will be more efficient than trying to sort the entire dataset in one go and then join it.

Like I said, you shouldn't be using the SSIS sort component prior to joining

**Freamon** · 23 March 2011, 07:36

Originally posted by Spacecadet View Post

Like I said, you shouldn't be using the SSIS sort component prior to joining

Why not? What if my input is an unsorted text file, and my join component is set up to assume the input is sorted?

**Spacecadet** · 23 March 2011, 09:11

Originally posted by Freamon View Post

Why not? What if my input is an unsorted text file, and my join component is set up to assume the input is sorted?

Because the sort component takes way too long for large data sets.

If you bulk load the incoming data into a database first, then pull it out in the correct order you'll find its a lot quicker (as in orders of magnitude quicker) and you've got the raw data in the database ready for audit reporting on the ETL process .

**Churchill** · 23 March 2011, 09:53

Originally posted by Spacecadet View Post

Because the sort component takes way too long for large data sets.

If you bulk load the incoming data into a database first, then pull it out in the correct order you'll find its a lot quicker (as in orders of magnitude quicker) and you've got the raw data in the database ready for audit reporting on the ETL process .

Bollocks. Use a self balancing binary tree!

Churchill - In "You didn't expect a serious answer, did you?" mode.

**eek** · 23 March 2011, 10:08

Originally posted by Churchill View Post

Bollocks. Use a self balancing binary tree!

Churchill - In "You didn't expect a serious answer, did you?" mode.

Much that I hate agreeing with spacecadet, but using SSIS to sort even vaguely large amounts of data is painfully slow.

Given a large amount of unsorted data outside a database a temporary staging area is an unavoidable overhead.

**Spacecadet** · 23 March 2011, 10:10

Originally posted by eek View Post

Much that I hate agreeing with spacecadet

Journey to SQL Authority

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Journey to SQL Authority

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Tag Cloud