• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Journey to SQL Authority

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    #21
    Originally posted by Freamon View Post
    In a real ETL tool you could also partition your data by the first half of the postcode to improve the join performance.
    Eh??
    SSIS is a "real ETL Tool" and splitting data is easy as pie
    and
    Splitting data doesn't help join performance anyway given that joins have to be done on ordered data streams
    Coffee's for closers

    Comment


      #22
      Originally posted by Spacecadet View Post
      Eh??
      SSIS is a "real ETL Tool" and splitting data is easy as pie
      and
      Splitting data doesn't help join performance anyway given that joins have to be done on ordered data streams
      I think it would help in this case. If you were to partition the data on each side of the join by the first half of the postcode then your sorts would only have to sort a small amount of data each before joining. And they could run in parallel. Not sure if the parallel bit is possible in SSIS or not.
      "A life, Jimmy, you know what that is? It’s the s*** that happens while you’re waiting for moments that never come." -- Lester Freamon

      Comment


        #23
        Originally posted by Freamon View Post
        I think it would help in this case. If you were to partition the data on each side of the join by the first half of the postcode then your sorts would only have to sort a small amount of data each before joining. And they could run in parallel. Not sure if the parallel bit is possible in SSIS or not.
        One of the great things about SSIS is the fact that you can run things in parallel and really make good use of all the CPU and RAM available.
        ETL World Record! - SQL Server Performance - Site Home - MSDN Blogs

        However, if you're sorting data using the SSIS component prior to joining then you've got bigger problems on your hands.

        Anyway - joining by postcode isn't the problem, in fact that's done very early on in my package as part of filtering out the well formed address data and takes about a second or two... the issue is working out the more complex joins between the badly formed addresses and the rest of the PAF data. These joins are relatively straight forward to write in SQL but impossible in SSIS which expects a nice clean join.
        Coffee's for closers

        Comment


          #24
          Originally posted by Spacecadet View Post
          One of the great things about SSIS is the fact that you can run things in parallel and really make good use of all the CPU and RAM available.
          ETL World Record! - SQL Server Performance - Site Home - MSDN Blogs

          However, if you're sorting data using the SSIS component prior to joining then you've got bigger problems on your hands.
          But do you see what I'm saying about partitioning? If your initial component is reading data row-by-row and then firing it out to parallel streams, one for each postcode area, i.e. a stream for AB1, AB2, AB3 etc, and you then only have to sort each group of data within the stream, your sort will be more efficient than trying to sort the entire dataset in one go and then join it. Because you already know from the data that only a row with an AB1 postcode has a chance of joining to another row with an AB1 postcode, etc.

          Originally posted by Spacecadet View Post
          Anyway - joining by postcode isn't the problem, in fact that's done very early on in my package as part of filtering out the well formed address data and takes about a second or two... the issue is working out the more complex joins between the badly formed addresses and the rest of the PAF data. These joins are relatively straight forward to write in SQL but impossible in SSIS which expects a nice clean join.
          I'd suggest you need a specialist address cleansing tool to get the job done. There are tools which will cope with a myriad of different forms of address data, and have all the intelligence around what notation is equivalent (e.g. "Rd" vs. "Road", house number and road name appearing on different address lines, house name included or excluded, etc etc) built into the tool, and can generate a cleansed address with a unique address key which would then join into PAF easily. Dataflux, Trillium etc will all do this, at a price. There are probably cheaper ones that are just as good when specialising only in address data.
          "A life, Jimmy, you know what that is? It’s the s*** that happens while you’re waiting for moments that never come." -- Lester Freamon

          Comment


            #25
            Originally posted by Freamon View Post
            your sort will be more efficient than trying to sort the entire dataset in one go and then join it.
            Like I said, you shouldn't be using the SSIS sort component prior to joining
            Coffee's for closers

            Comment


              #26
              Originally posted by Spacecadet View Post
              Like I said, you shouldn't be using the SSIS sort component prior to joining
              Why not? What if my input is an unsorted text file, and my join component is set up to assume the input is sorted?
              "A life, Jimmy, you know what that is? It’s the s*** that happens while you’re waiting for moments that never come." -- Lester Freamon

              Comment


                #27
                Originally posted by Freamon View Post
                Why not? What if my input is an unsorted text file, and my join component is set up to assume the input is sorted?
                Because the sort component takes way too long for large data sets.

                If you bulk load the incoming data into a database first, then pull it out in the correct order you'll find its a lot quicker (as in orders of magnitude quicker) and you've got the raw data in the database ready for audit reporting on the ETL process .
                Coffee's for closers

                Comment


                  #28
                  Originally posted by Spacecadet View Post
                  Because the sort component takes way too long for large data sets.

                  If you bulk load the incoming data into a database first, then pull it out in the correct order you'll find its a lot quicker (as in orders of magnitude quicker) and you've got the raw data in the database ready for audit reporting on the ETL process .
                  Bollocks. Use a self balancing binary tree!

                  Churchill - In "You didn't expect a serious answer, did you?" mode.

                  Comment


                    #29
                    Originally posted by Churchill View Post
                    Bollocks. Use a self balancing binary tree!

                    Churchill - In "You didn't expect a serious answer, did you?" mode.
                    Much that I hate agreeing with spacecadet, but using SSIS to sort even vaguely large amounts of data is painfully slow.

                    Given a large amount of unsorted data outside a database a temporary staging area is an unavoidable overhead.
                    merely at clientco for the entertainment

                    Comment


                      #30
                      Originally posted by eek View Post
                      Much that I hate agreeing with spacecadet
                      Coffee's for closers

                      Comment

                      Working...
                      X