• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Amazon S3 Bucket retrieval

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Amazon S3 Bucket retrieval

    Anyone here got experience with Amazon S3 buckets?

    I'm writing an integration workflow that'll pull objects out of S3 on a polled basis. The buckets are being filled independently and I just need to get all new objects I haven't yet retrieved. It looks like the argument to use is withStartAfter from ListObjectsV2Request. Not an issue but am I guaranteed that the way S3 works is that keys are retrieved in creation order? If not I can't see the point in withStartAfter and will probably just have to retrieve all keys every time which is inefficient! I'm concerned because in a test retrieval was obviously not in creation order but by folder then by name

    #2
    That won’t work see amazon s3 - C# AWS S3 - List objects created before or after a certain time - Stack Overflow
    merely at clientco for the entertainment

    Comment


      #3
      Yes it doesn't so I'm having to pull the full object list back each time but have optimised identification of what's already been processed so only retrieve the data for new objects, not ideal but it works. I just hope I find a more efficient approach before the the volumes get too big. Over 100K files in there already for half a years worth of data so it'll grow over time Unfortunately I've no option to get at the data earlier in the process before its put into S3

      Comment


        #4
        Originally posted by tazdevil View Post
        Yes it doesn't so I'm having to pull the full object list back each time but have optimised identification of what's already been processed so only retrieve the data for new objects, not ideal but it works. I just hope I find a more efficient approach before the the volumes get too big. Over 100K files in there already for half a years worth of data so it'll grow over time Unfortunately I've no option to get at the data earlier in the process before its put into S3
        Do they need to be stored where they are after you've pulled them or could you move them to say a processed bucket?
        merely at clientco for the entertainment

        Comment


          #5
          Depending on whether you have access to the AWS account or not, would setting up an event triggering a lambda or a SQS be an option for you?

          Configuring Amazon S3 event notifications - Amazon Simple Storage Service

          Comment


            #6
            Life will be much easier, if you could make a simple design change.

            1. Consider writing new objects into another folder (e.g. temp-bucket)
            2. When file copy is completed in temp-bucket, write a dummy file (e.g. success) to indicate successful copy.
            3. Use S3 events that will get fired upon 'success' file creation. (in S3 events you can watch for specific file or pattern etc)
            4. Write a lambda that will get triggered based on step 3.
            5. Lambda function will do whatever is required and finally delete the contents of temp-bucket.

            This is useful for batch loads. If you are receiving continuous stream of data, then you need a different approach.

            Hope this helps
            Last edited by BigDataPro; 21 December 2020, 13:33.

            Comment

            Working...
            X