• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

SKA news

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Originally posted by PAH View Post
    That sounds a bit daunting having to keep on top of that lot.
    It's fairly easy so long as you have scalable crawler, the real problem is that handful of very large sites have got hundreds of millions of URLs on them.

    Comment


      Originally posted by AtW View Post
      Front end stuff is very easy to run in parallel very cheaply, it's the large scale DB that is a problem for companies like Twitter, Facebook, Google et al.

      Real time nature of Twitter certainly made it harder to implement than usual batch processing however inherent advantage in terms of small text size and write once read many times approach make their problem fairly trivial to solve.

      It's all really matter of perspective - when you spend your own £50k on stuff like this you'd have to be smart, but when you want to raise hundreds of millions making problem easily solveable will backfire.
      Twitter and their ilk can also relax ACID constraints in certain ways compared to a more conventional DB i.e. it doesn't matter if every user sees the latest tweets from every other user at the same time, no one will notice if they are delayed by 500ms or they don't get the exact same ordering twice in a row. I think it's called BASE (as opposed to ACID).
      While you're waiting, read the free novel we sent you. It's a Spanish story about a guy named 'Manual.'

      Comment


        Originally posted by minestrone View Post
        You seem to think running websites is purely down to DB datasize.
        I think the problem is that you end up looking at all systems like your own and the solution to your issues become the solution to their issues.

        Mind you most issues seem to boil down to one of two areas:-

        getting data into database
        getting data out of database. The latter is more interesting as once you decide that 100% this millisecond accuracy isn't important you can take a lot of short cuts to speed up handling the data.
        merely at clientco for the entertainment

        Comment


          Originally posted by doodab View Post
          Twitter and their ilk can also relax ACID constraints in certain ways compared to a more conventional DB
          Aye.

          Twitter does not even need to work 100% of the time!!!

          Comment


            Originally posted by AtW View Post
            the real problem is that handful of very large sites have got hundreds of millions of URLs on them.
            Offer a cloud service where they can host their sites then you don't need to go crawling them, you'll always be bang up to date.

            I wonder if Google or M$ have thought of that yet.
            Feist - 1234. One camera, one take, no editing. Superb. How they did it
            Feist - I Feel It All
            Feist - The Bad In Each Other (Later With Jools Holland)

            Comment


              Originally posted by AtW View Post
              Front end stuff is very easy to run in parallel very cheaply.
              Bollocks.

              If I have 1 table with one text field of 140 chars if that gets accessed 1 million times in 1 second that is easier to run than 1 person accessing 140 million chars in one second.

              You talk the biggest pile of crap, truly you seem to know jack tulip my simple mathematically challenged friend.

              Comment


                Originally posted by minestrone View Post
                If I have 1 table with one text field of 140 chars if that gets accessed 1 million times in 1 second that is easier to run than 1 person accessing 140 million chars in one second.
                Accessing 140 mln chars in one second would require 1 Gbit connectivity and it's done trivially if you have required bandwidth (and low enough latency).

                1 mln accesses to 140 charts over TCP/IP might actually be more difficult problem if lots of separate IPs are involved but in such scenarios having 100 cheap boxes would reduce the problem to 10k accesses each per second which is doable.

                Comment


                  amazon CTO Werner Vogels on eventual consistency

                  BASE: An Acid Alternative - ACM Queue

                  Scaling in games and virtual worlds - ACM Queue
                  While you're waiting, read the free novel we sent you. It's a Spanish story about a guy named 'Manual.'

                  Comment


                    Originally posted by AtW View Post
                    Accessing 140 mln chars in one second would require 1 Gbit connectivity and it's done trivially if you have required bandwidth (and low enough latency).

                    1 mln accesses to 140 charts over TCP/IP might actually be more difficult problem if lots of separate IPs are involved but in such scenarios having 100 cheap boxes would reduce the problem to 10k accesses each per second which is doable.
                    Can I just ask what you think is more problematic for a web server.

                    "1 table with one text field of 140 chars if that gets accessed 1 million times in 1 second"

                    "1 person accessing 140 million chars in one second"

                    Comment


                      Originally posted by minestrone View Post
                      Can I just ask what you think is more problematic for a web server.

                      "1 table with one text field of 140 chars if that gets accessed 1 million times in 1 second"

                      "1 person accessing 140 million chars in one second"
                      You forget. ATW has solved his problem by using a hammer so his immediate solution to all problems is now that hammer. And if it doesn't work to buy a bigger hammer.

                      To be honest that statement is true of most people. They will take a working solution and try and apply it to the next problem that comes along.

                      Edit to answer the question.

                      The second one could be a problem based on the size of the network connection.
                      The first one is a problem for the database but less so if the database keeps recent statements in memory.

                      Memcache solves the first issue very well. Stackoverflow halved the number of machines they require by caching database results for 3 seconds. I'm sure most popular sites would do the same.
                      Last edited by eek; 9 September 2011, 15:14.
                      merely at clientco for the entertainment

                      Comment

                      Working...
                      X