Snail mode for simulating vast databases when using a small test database

**SpontaneousOrder** · 18 March 2015, 17:56

Originally posted by OwlHoot View Post

Suddenly occurred to me that it would be very useful to be able to configure a database connection in which the server could deliberately make statement execution slower, depending on the key structure and complexity of the query, so that apps could be tested on a modest-sized test database as if they were run against a production database comprising millions of records, and defective or sub-optimal table keys or SQL statements could thereby be identified. Never come across anything like that though.

if such a thing were possible, wouldn't it be easier to just flag up the sub-optimal SQL in the first place without the fake slow execution times?

**xoggoth** · 18 March 2015, 18:45

You can buy my old ZX81 if you like. 500 quid.

**OwlHoot** · 18 March 2015, 18:46

Originally posted by SpontaneousOrder View Post

if such a thing were possible, wouldn't it be easier to just flag up the sub-optimal SQL in the first place without the fake slow execution times?

In many cases, but the SQL isn't always at fault, or even sub-optimal relative to a given table + key structure, if it runs without a syntax error and does what its creator intended.

A simple SELECT is fine and will run like lightning on a table with a hundred records and no keys - But try running the same SQL on a table with a hundred million records. That is a simplistic example, but one could envisage a complicated statement involving joins across several tables etc, which again could be a perfectly sound SQL, and optimal for the database structure as it stands, but hard if not impossible to anticipate how it would scale to huge tables.

So the proposed snail mode would just be, as it were, a useful way of protoyping the database structure and/or the SQL to check that SQL statements did scale well.

But it was just a passing idea, which I'm not interested in defending to the last ditch if anyone sees no potential use for it.

**SpontaneousOrder** · 18 March 2015, 22:00

Originally posted by OwlHoot View Post

In many cases, but the SQL isn't always at fault, or even sub-optimal relative to a given table + key structure, if it runs without a syntax error and does what its creator intended.

A simple SELECT is fine and will run like lightning on a table with a hundred records and no keys - But try running the same SQL on a table with a hundred million records. That is a simplistic example, but one could envisage a complicated statement involving joins across several tables etc, which again could be a perfectly sound SQL, and optimal for the database structure as it stands, but hard if not impossible to anticipate how it would scale to huge tables.

So the proposed snail mode would just be, as it were, a useful way of protoyping the database structure and/or the SQL to check that SQL statements did scale well.

But it was just a passing idea, which I'm not interested in defending to the last ditch if anyone sees no potential use for it.

I might be missing the point, but I was wondering how it would know whether to slow down the query or not?
I.e. if the increase in execution time as the table sizes scale isn't a linear progression, then as you say whether the SQL is good or not depends on the size and distribution of data. And if you could predict how it would respond under different table sizes & data distributions (in order to reflect that), then why would you bother actually running the queries for real?

If it's a case of just slowing everything down so that the impact of complicated queries is exaggerated so that's it's noticeable with small data sets, then you don't get the non-linear performance degradation.

**NickFitz** · 19 March 2015, 00:00

Surely it'd only cost a few bob to run up an Amazon RDS server (or equivalent if you use another provider) of a suitable size, fill it with a few hundred million randomly-generated records (pre-generated, of course - no point running the DB until you have the data), and run your queries against that. The biggest DB server they offer (db.r3.8xlarge) costs $3.78/hour. IIRC intra-datacentre data transfer is free. So generate your however-many-GB of data using a small-and-cheap EC2 instance (1.3 cents per hour, much cheaper if you use spot instances) which writes it to a file in S3 (3 cents per Gig per month at top price), fire up the DB server and import the data from S3, run your queries, then shut down the database server and either delete the data or (maybe better) move it to Glacier, where it's dirt-cheap to store (1 cent per GB per month) but needs a few hours notice to access again.

You could probably achieve the entire objective for about £20, and a couple of quid a month if you wanted to keep the random data around. It might be cheaper to regenerate a random dataset each time; you'd know after the first time.

**DimPrawn** · 19 March 2015, 19:18

Originally posted by NickFitz View Post

Surely it'd only cost a few bob to run up an Amazon RDS server (or equivalent if you use another provider) of a suitable size, fill it with a few hundred million randomly-generated records (pre-generated, of course - no point running the DB until you have the data), and run your queries against that. The biggest DB server they offer (db.r3.8xlarge) costs $3.78/hour. IIRC intra-datacentre data transfer is free. So generate your however-many-GB of data using a small-and-cheap EC2 instance (1.3 cents per hour, much cheaper if you use spot instances) which writes it to a file in S3 (3 cents per Gig per month at top price), fire up the DB server and import the data from S3, run your queries, then shut down the database server and either delete the data or (maybe better) move it to Glacier, where it's dirt-cheap to store (1 cent per GB per month) but needs a few hours notice to access again.

You could probably achieve the entire objective for about £20, and a couple of quid a month if you wanted to keep the random data around. It might be cheaper to regenerate a random dataset each time; you'd know after the first time.

^ this

Combined with generatedata.com or this SQL Data Generator - Data generator for MS SQL Server databases

Sorted.

**darrylmg** · 19 March 2015, 22:28

- Reduce the DB buffer cache.
- Reduce DB server O/S memory.
- Reduce DB I/O performance by running some other crappy program at the same time on the DB server.
- Use NFS for DB files.
- Copy the table stats from a eeal big version of the same DB making the statement optimiser of your small DB choose sub-optimal access paths.

Just some options I could think of that I've seen in the past.
Especially a poorly configured NFS setup. Classic.
Oh yes, and Antivirus scanning the db files.

**woohoo** · 20 March 2015, 10:41

Originally posted by DimPrawn View Post

^ this

Combined with generatedata.com or this SQL Data Generator - Data generator for MS SQL Server databases

Sorted.

The red-gate product looks interesting, thanks for the link. Goes without saying, so im going to say it, you need to understand your data before generating test records. If your data is too generic then your queries may not be selective enough to use indexes and you may have performance issues you would not get in production. Or if your test data is too random then indexes may be used that would not in production.

Slight side-track, recently I've been using sql monitor from red-gate to monitor the production db. You can compare performance against a baseline and anything that's introduced that causes a spike in performance you can identify. Might be a bit after the fact but if your working in a team getting them all to performance test their code/queries against generated data and simulating production usage is very difficult.

**minestrone** · 20 March 2015, 11:39

Sticking data on the cloud won't fly for most organisations that have large databases.

Also many companies have test and dev databases that are considerably smaller than production.

I see the relevance of what owlhoot is suggesting.

"But this page worked so much faster when we tested it before the release" is not an uncommon complaint.

Run the query on prod, stick that time 'estimate' into the DEV and UAT phases and give the users a real feel on how it will perform on prod.

Snail mode for simulating vast databases when using a small test database