Journey to SQL Authority - Contractor UK Bulletin Board

Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

You are not logged in or you do not have permission to access this page. This could be due to one of several reasons:

You are not logged in. If you are already registered, fill in the form below to log in, or follow the "Sign Up" link to register a new account.
You may not have sufficient privileges to access this page. Are you trying to edit someone else's post, access administrative features or some other privileged system?
If you are trying to post, the administrator may have disabled your account, or it may be awaiting activation.

Spacecadet replied

24 March 2011, 00:46
Originally posted by Freamon View Post

Basically what this tells you is that microsoft have implemented two sort algorithms (one in SSIS and one in SQL Server) and the performance of one is way better than the other. If only they would update SSIS to use the same algorithm as SQL Server...

Yup, the cack sort is one of the few aspects of SSIS which I hate.
The fact you can load the data to disk, index it and then pull it back off the disk in order faster than it takes SSIS to sort it is fairly shocking!
And it's not as if they don't sell other products which can do in memory sorting in a reasonably performant manner.
Leave a comment:
Freamon replied

23 March 2011, 23:12
Originally posted by Spacecadet View Post

Because the sort component takes way too long for large data sets.

If you bulk load the incoming data into a database first, then pull it out in the correct order you'll find its a lot quicker (as in orders of magnitude quicker) and you've got the raw data in the database ready for audit reporting on the ETL process .

Basically what this tells you is that microsoft have implemented two sort algorithms (one in SSIS and one in SQL Server) and the performance of one is way better than the other. If only they would update SSIS to use the same algorithm as SQL Server...
Leave a comment:
Spacecadet replied

23 March 2011, 10:10
Originally posted by eek View Post

Much that I hate agreeing with spacecadet
Leave a comment:
eek replied

23 March 2011, 10:08
Originally posted by Churchill View Post

Bollocks. Use a self balancing binary tree!

Churchill - In "You didn't expect a serious answer, did you?" mode.

Much that I hate agreeing with spacecadet, but using SSIS to sort even vaguely large amounts of data is painfully slow.

Given a large amount of unsorted data outside a database a temporary staging area is an unavoidable overhead.
Leave a comment:
Churchill replied

23 March 2011, 09:53
Originally posted by Spacecadet View Post

Because the sort component takes way too long for large data sets.

If you bulk load the incoming data into a database first, then pull it out in the correct order you'll find its a lot quicker (as in orders of magnitude quicker) and you've got the raw data in the database ready for audit reporting on the ETL process .

Bollocks. Use a self balancing binary tree!

Churchill - In "You didn't expect a serious answer, did you?" mode.
Leave a comment:
Spacecadet replied

23 March 2011, 09:11
Originally posted by Freamon View Post

Why not? What if my input is an unsorted text file, and my join component is set up to assume the input is sorted?

Because the sort component takes way too long for large data sets.

If you bulk load the incoming data into a database first, then pull it out in the correct order you'll find its a lot quicker (as in orders of magnitude quicker) and you've got the raw data in the database ready for audit reporting on the ETL process .
Leave a comment:
Freamon replied

23 March 2011, 07:36
Originally posted by Spacecadet View Post

Like I said, you shouldn't be using the SSIS sort component prior to joining

Why not? What if my input is an unsorted text file, and my join component is set up to assume the input is sorted?
Leave a comment:
Spacecadet replied

22 March 2011, 23:36
Originally posted by Freamon View Post

your sort will be more efficient than trying to sort the entire dataset in one go and then join it.

Like I said, you shouldn't be using the SSIS sort component prior to joining
Leave a comment:
Freamon replied

22 March 2011, 23:27
Originally posted by Spacecadet View Post

One of the great things about SSIS is the fact that you can run things in parallel and really make good use of all the CPU and RAM available.
ETL World Record! - SQL Server Performance - Site Home - MSDN Blogs

However, if you're sorting data using the SSIS component prior to joining then you've got bigger problems on your hands.

But do you see what I'm saying about partitioning? If your initial component is reading data row-by-row and then firing it out to parallel streams, one for each postcode area, i.e. a stream for AB1, AB2, AB3 etc, and you then only have to sort each group of data within the stream, your sort will be more efficient than trying to sort the entire dataset in one go and then join it. Because you already know from the data that only a row with an AB1 postcode has a chance of joining to another row with an AB1 postcode, etc.

Originally posted by Spacecadet View Post

Anyway - joining by postcode isn't the problem, in fact that's done very early on in my package as part of filtering out the well formed address data and takes about a second or two... the issue is working out the more complex joins between the badly formed addresses and the rest of the PAF data. These joins are relatively straight forward to write in SQL but impossible in SSIS which expects a nice clean join.

I'd suggest you need a specialist address cleansing tool to get the job done. There are tools which will cope with a myriad of different forms of address data, and have all the intelligence around what notation is equivalent (e.g. "Rd" vs. "Road", house number and road name appearing on different address lines, house name included or excluded, etc etc) built into the tool, and can generate a cleansed address with a unique address key which would then join into PAF easily. Dataflux, Trillium etc will all do this, at a price. There are probably cheaper ones that are just as good when specialising only in address data.
Leave a comment:
Spacecadet replied

22 March 2011, 23:16
Originally posted by Freamon View Post

I think it would help in this case. If you were to partition the data on each side of the join by the first half of the postcode then your sorts would only have to sort a small amount of data each before joining. And they could run in parallel. Not sure if the parallel bit is possible in SSIS or not.

One of the great things about SSIS is the fact that you can run things in parallel and really make good use of all the CPU and RAM available.
ETL World Record! - SQL Server Performance - Site Home - MSDN Blogs

However, if you're sorting data using the SSIS component prior to joining then you've got bigger problems on your hands.

Anyway - joining by postcode isn't the problem, in fact that's done very early on in my package as part of filtering out the well formed address data and takes about a second or two... the issue is working out the more complex joins between the badly formed addresses and the rest of the PAF data. These joins are relatively straight forward to write in SQL but impossible in SSIS which expects a nice clean join.
Leave a comment:
Freamon replied

22 March 2011, 22:54
Originally posted by Spacecadet View Post

Eh??
SSIS is a "real ETL Tool" and splitting data is easy as pie
and
Splitting data doesn't help join performance anyway given that joins have to be done on ordered data streams

I think it would help in this case. If you were to partition the data on each side of the join by the first half of the postcode then your sorts would only have to sort a small amount of data each before joining. And they could run in parallel. Not sure if the parallel bit is possible in SSIS or not.
Leave a comment:
Spacecadet replied

22 March 2011, 20:53
Originally posted by Freamon View Post

In a real ETL tool you could also partition your data by the first half of the postcode to improve the join performance.

Eh??
SSIS is a "real ETL Tool" and splitting data is easy as pie
and
Splitting data doesn't help join performance anyway given that joins have to be done on ordered data streams
Leave a comment:
Freamon replied

22 March 2011, 20:18
Originally posted by Spacecadet View Post

it's not and no... although i might give the join another go tonight, just thought of another way of doing it.
Thanks

A lookup on a 30 million row table is definitely not a good idea. Lookup is designed to read the whole table into memory when the job starts up, and then access the data in memory for each row that passes through. If you read 30 million rows into memory your job will swap to disk a lot. I'd suggest using a join is definitely the way to go. In a real ETL tool you could also partition your data by the first half of the postcode to improve the join performance.

Originally posted by Spacecadet View Post

bingo

problem with address data (at
least in this project) is that the incoming format is crap and doesn't join nicely to the PAF data

Yep, I think most people have been through this pain at some point. Especially when it's addresses that have been manually keyed in with no sense-checking at the point of entry. There's specialist data cleansing tools for address data which may save you a lot of pain. Dataflux does a good job of it (amongst other things).
Leave a comment:
Spacecadet replied

22 March 2011, 16:09
Originally posted by eek View Post

You'll have fun with that. Last time I did that I ended with a multi-stage process based on the accuracy of the individual records.

10% of it turned out to be total and utter crap which was surprising as they were all customers who had purchased goods via mail order.

That's pretty much the conclusion I'd got to as well, filter out the good, the not so good, the passable and then ignore the rest!
Leave a comment:
EternalOptimist replied

22 March 2011, 09:57
Originally posted by eek View Post

Fuzzy Logic doesn't cover the example of postcode given for Main Street, address given as High Row (which has a different postcode).

yes it does
Leave a comment: