• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • FREE webinar: What does a post IR35 reform CV look like? : Wed, Jul 28, 2021 7:15 PM - 8:15 PM BST More details here.

Storing 4M+ json snippets

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Storing 4M+ json snippets

    Hi all,

    I have a script that generates 4 million + snippets of json code.
    What would you agree on on being the best way to store them?

    I have considered:
    - json files - way too many even if split in subfolders
    - mongo DB
    - mySQL with json type field

    The aim is to be able to easy retrieve them for processing later.
    Language is python.

    Bonus question: each of these 4m+ json will have 10-15 images related to it. Where and how do you store the images?

    #2
    Your bonus question gives you the only 1 sane answer to your question but you haven't got there yet.
    merely at clientco for the entertainment

    Comment


      #3
      I would vote for MongoDB in this use case. Because BSON is the native format of documents stored in MongoDB, you can parse this and store in MongoDB as a queryable object.

      Of course if you just want to store them as strings you can choose pretty much any SQL database you like. One option if you don't want a server based DB is to insert them into a local SQLite database.

      SQLite Home Page
      I design idiot proof software. Trouble is, they keep making better idiots.

      Comment


        #4
        Hi SAS

        Comment


          #5
          Ive done this exact same thing with cloud storage, both on azure and S3.

          About 600k folders with about 10 images and 15 json files in each.

          Comment


            #6
            Postgres + JSONB

            Comment


              #7
              Azure Table Storage for the Json snippets.
              Azure Blob container for the images.

              Link the json to the images using columns in the Azure table (one column for each image blob key).

              Use Python via Azure Functions to manipulate the data if want 'serverless'. i.e. M$ handle the infrastructure, availability, backup. Though regular sycing to a local or alternative cloud backup is a good idea.

              Not sure how much it may cost, so use the Azure pricing calculator based on your estimates for an idea.

              Sorted.
              Maybe tomorrow, I'll want to settle down. Until tomorrow, I'll just keep moving on.

              Comment


                #8
                Come out of old school and consider using

                - Azure Data Lake Storage (Gen2)
                - AWS S3
                - GCP Cloud Storage / Filestore.

                If you want to keep a copy without your permission and without being charged then go for Alibaba Cloud
                Last edited by BigDataPro; 23 September 2020, 15:59.

                Comment


                  #9
                  Originally posted by anim View Post
                  Hi all,

                  I have a script that generates 4 million + snippets of json code.
                  What would you agree on on being the best way to store them?

                  I have considered:
                  - json files - way too many even if split in subfolders
                  - mongo DB
                  - mySQL with json type field

                  The aim is to be able to easy retrieve them for processing later.
                  Language is python.

                  Bonus question: each of these 4m+ json will have 10-15 images related to it. Where and how do you store the images?
                  If you want to pack them away in the database pronto, but aren't too bothered about retrieval speed, then Cassandra would be a good choice.

                  Edit: It's free (open source) and by now fairly mature
                  Work in the public sector? Read the IR35 FAQ here

                  Comment


                    #10
                    Originally posted by anim View Post
                    Hi all,

                    I have a script that generates 4 million + snippets of json code.
                    What would you agree on on being the best way to store them?

                    I have considered:
                    - json files - way too many even if split in subfolders
                    - mongo DB
                    - mySQL with json type field

                    The aim is to be able to easy retrieve them for processing later.
                    Language is python.

                    Bonus question: each of these 4m+ json will have 10-15 images related to it. Where and how do you store the images?
                    We don't know how your app works.
                    If you access the image first and then go looking for the related json, then you could embed the json in the image and save doing the subsequent lookup.
                    Google for "steganography".
                    Don't believe it, until you see it!

                    Comment

                    Working...
                    X