SKA news

**AtW** · 9 September 2011, 14:33

Originally posted by PAH View Post

That sounds a bit daunting having to keep on top of that lot.

It's fairly easy so long as you have scalable crawler, the real problem is that handful of very large sites have got hundreds of millions of URLs on them.

**doodab** · 9 September 2011, 14:42

Originally posted by AtW View Post

Front end stuff is very easy to run in parallel very cheaply, it's the large scale DB that is a problem for companies like Twitter, Facebook, Google et al.

Real time nature of Twitter certainly made it harder to implement than usual batch processing however inherent advantage in terms of small text size and write once read many times approach make their problem fairly trivial to solve.

It's all really matter of perspective - when you spend your own £50k on stuff like this you'd have to be smart, but when you want to raise hundreds of millions making problem easily solveable will backfire.

Twitter and their ilk can also relax ACID constraints in certain ways compared to a more conventional DB i.e. it doesn't matter if every user sees the latest tweets from every other user at the same time, no one will notice if they are delayed by 500ms or they don't get the exact same ordering twice in a row. I think it's called BASE (as opposed to ACID).

**eek** · 9 September 2011, 14:44

Originally posted by minestrone View Post

You seem to think running websites is purely down to DB datasize.

I think the problem is that you end up looking at all systems like your own and the solution to your issues become the solution to their issues.

Mind you most issues seem to boil down to one of two areas:-

getting data into database
getting data out of database. The latter is more interesting as once you decide that 100% this millisecond accuracy isn't important you can take a lot of short cuts to speed up handling the data.

**AtW** · 9 September 2011, 14:45

Originally posted by doodab View Post

Twitter and their ilk can also relax ACID constraints in certain ways compared to a more conventional DB

Aye.

Twitter does not even need to work 100% of the time!!!

**PAH** · 9 September 2011, 14:46

Originally posted by AtW View Post

the real problem is that handful of very large sites have got hundreds of millions of URLs on them.

Offer a cloud service where they can host their sites then you don't need to go crawling them, you'll always be bang up to date.

I wonder if Google or M$ have thought of that yet.

**minestrone** · 9 September 2011, 14:46

Originally posted by AtW View Post

Front end stuff is very easy to run in parallel very cheaply.

Bollocks.

If I have 1 table with one text field of 140 chars if that gets accessed 1 million times in 1 second that is easier to run than 1 person accessing 140 million chars in one second.

You talk the biggest pile of crap, truly you seem to know jack tulip my simple mathematically challenged friend.

**AtW** · 9 September 2011, 14:54

Originally posted by minestrone View Post

If I have 1 table with one text field of 140 chars if that gets accessed 1 million times in 1 second that is easier to run than 1 person accessing 140 million chars in one second.

Accessing 140 mln chars in one second would require 1 Gbit connectivity and it's done trivially if you have required bandwidth (and low enough latency).

1 mln accesses to 140 charts over TCP/IP might actually be more difficult problem if lots of separate IPs are involved but in such scenarios having 100 cheap boxes would reduce the problem to 10k accesses each per second which is doable.

**doodab** · 9 September 2011, 14:58

amazon CTO Werner Vogels on eventual consistency

BASE: An Acid Alternative - ACM Queue

Scaling in games and virtual worlds - ACM Queue

**minestrone** · 9 September 2011, 15:02

Originally posted by AtW View Post

Accessing 140 mln chars in one second would require 1 Gbit connectivity and it's done trivially if you have required bandwidth (and low enough latency).

1 mln accesses to 140 charts over TCP/IP might actually be more difficult problem if lots of separate IPs are involved but in such scenarios having 100 cheap boxes would reduce the problem to 10k accesses each per second which is doable.

Can I just ask what you think is more problematic for a web server.

"1 table with one text field of 140 chars if that gets accessed 1 million times in 1 second"

"1 person accessing 140 million chars in one second"

**eek** · 9 September 2011, 15:09

Originally posted by minestrone View Post

Can I just ask what you think is more problematic for a web server.

"1 table with one text field of 140 chars if that gets accessed 1 million times in 1 second"

"1 person accessing 140 million chars in one second"

You forget. ATW has solved his problem by using a hammer so his immediate solution to all problems is now that hammer. And if it doesn't work to buy a bigger hammer.

To be honest that statement is true of most people. They will take a working solution and try and apply it to the next problem that comes along.

Edit to answer the question.

The second one could be a problem based on the size of the network connection.
The first one is a problem for the database but less so if the database keeps recent statements in memory.

Memcache solves the first issue very well. Stackoverflow halved the number of machines they require by caching database results for 3 seconds. I'm sure most popular sites would do the same.

SKA news

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

SKA news

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Tag Cloud