• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Journey to SQL Authority

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    #11
    Originally posted by Spacecadet View Post
    You've not pushed it hard enough then
    I hit a problem this morning whilst adding some lookups. I wanted to compare several columns in order to the lookup table - which itself is about 30 million rows.
    SSIS fell over trying to cache the data and not caching took way to long.
    Ended up having to move that functionality into SQL.
    You had a lookup table that is 30 million rows? Sounds like a design issue. Couldn't you use a join instead?
    "A life, Jimmy, you know what that is? It’s the s*** that happens while you’re waiting for moments that never come." -- Lester Freamon

    Comment


      #12
      Originally posted by Freamon View Post
      You had a lookup table that is 30 million rows? Sounds like a design issue. Couldn't you use a join instead?
      Sounds like a UK address table.

      Comment


        #13
        Originally posted by Freamon View Post
        You had a lookup table that is 30 million rows? Sounds like a design issue. Couldn't you use a join instead?
        it's not and no... although i might give the join another go tonight, just thought of another way of doing it.
        Thanks

        Originally posted by k2p2 View Post
        Sounds like a UK address table.
        bingo

        problem with address data (at
        least in this project) is that the incoming format is crap and doesn't join nicely to the PAF data
        Coffee's for closers

        Comment


          #14
          Originally posted by Spacecadet View Post
          problem with address data (at
          least in this project) is that the incoming format is crap and doesn't join nicely to the PAF data
          I had this in a previous life. I built a neat pattern-matching tool to find likely dupes

          ah, halcyon days


          (\__/)
          (>'.'<)
          ("")("") Born to Drink. Forced to Work

          Comment


            #15
            Originally posted by Spacecadet View Post

            bingo

            problem with address data (at
            least in this project) is that the incoming format is crap and doesn't join nicely to the PAF data
            You'll have fun with that. Last time I did that I ended with a multi-stage process based on the accuracy of the individual records.

            10% of it turned out to be total and utter crap which was surprising as they were all customers who had purchased goods via mail order.
            merely at clientco for the entertainment

            Comment


              #16
              Originally posted by eek View Post
              You'll have fun with that. Last time I did that I ended with a multi-stage process based on the accuracy of the individual records.

              10% of it turned out to be total and utter crap which was surprising as they were all customers who had purchased goods via mail order.
              fuzzy logic. thats the answer, fuzzy logic





              (\__/)
              (>'.'<)
              ("")("") Born to Drink. Forced to Work

              Comment


                #17
                Originally posted by EternalOptimist View Post
                fuzzy logic. thats the answer, fuzzy logic





                Fuzzy Logic doesn't cover the example of postcode given for Main Street, address given as High Row (which has a different postcode).
                merely at clientco for the entertainment

                Comment


                  #18
                  Originally posted by eek View Post
                  Fuzzy Logic doesn't cover the example of postcode given for Main Street, address given as High Row (which has a different postcode).
                  yes it does
                  (\__/)
                  (>'.'<)
                  ("")("") Born to Drink. Forced to Work

                  Comment


                    #19
                    Originally posted by eek View Post
                    You'll have fun with that. Last time I did that I ended with a multi-stage process based on the accuracy of the individual records.

                    10% of it turned out to be total and utter crap which was surprising as they were all customers who had purchased goods via mail order.
                    That's pretty much the conclusion I'd got to as well, filter out the good, the not so good, the passable and then ignore the rest!
                    Coffee's for closers

                    Comment


                      #20
                      Originally posted by Spacecadet View Post
                      it's not and no... although i might give the join another go tonight, just thought of another way of doing it.
                      Thanks
                      A lookup on a 30 million row table is definitely not a good idea. Lookup is designed to read the whole table into memory when the job starts up, and then access the data in memory for each row that passes through. If you read 30 million rows into memory your job will swap to disk a lot. I'd suggest using a join is definitely the way to go. In a real ETL tool you could also partition your data by the first half of the postcode to improve the join performance.

                      Originally posted by Spacecadet View Post
                      bingo

                      problem with address data (at
                      least in this project) is that the incoming format is crap and doesn't join nicely to the PAF data
                      Yep, I think most people have been through this pain at some point. Especially when it's addresses that have been manually keyed in with no sense-checking at the point of entry. There's specialist data cleansing tools for address data which may save you a lot of pain. Dataflux does a good job of it (amongst other things).
                      "A life, Jimmy, you know what that is? It’s the s*** that happens while you’re waiting for moments that never come." -- Lester Freamon

                      Comment

                      Working...
                      X