-
Website
http://www.ryanpark.org -
Original page
http://www.ryanpark.org/2008/04/top-10-avoid-the-simpledb-hype.html -
Subscribe
All Comments -
Community
-
Top Commenters
-
Greg Solovyev
1 comment · 9 points
-
Sohail Rahim
1 comment · 1 points
-
KentEldon
1 comment · 1 points
-
Castro
1 comment · 1 points
-
sro
1 comment · 1 points
-
-
Popular Threads
That's exactly right. If you are creating a toy app, you don't need to scale. If you are creating a toy app, you should use a toy db, i.e., an rdbms.
Google is not the only one that needs a scalable db. /ANYONE/ who ever hopes to have > 100,000 users is going to start running into scalability problems and eventually face the reality of the SHARDING NIGHTMARE if they use an RDBMs.
And if you are making something for < 100,000 users, you really probably ought to just stop now, shouldn't you?
Sorry, you can't get around it. MUCH BETTER to go in with the assumption that you'll have these problems (because, hey, you are creating something that's going to be successful, right?) and plan from day 1 to deal with them.
That's one of the beauties of couch/simple/bigtable -- you can't hide behind some empty promise from some big RDBM vendor. You have to face the truth from the start.
And you know what? The truth isn't so bad. It's quite elegant, actually.
Yahoo!, eBay, Facebook, etc scale their RDBMSs by doing the same thing that SimpleDB or BigTable do internally: by sharding the data down to finer and finer levels of keyspace as the number of machines grows. Except, with an RDBMS, this is a manual process. They also use read-only slaves to distribute read load, something also implicit in BigTable/SimpleDB/etc with their use of simple block replication without the concurrent use of erasure coding (e.g. RAID). RAID is also external to the RDBMS, requiring you to manage both disparately.
Also, Oracle can scale up to 64 nodes at max with a clustered filesystem (this is somewhat old, it might be 128 by now). Google has ~650,000 machines in their clusters. This is 4 orders of magnitude difference. No one has enough money to pay Oracle to scale to this level. Yahoo! and Facebook have gotten MySQL to run on more boxes than this, but not in a cluster, so they (like you and everyone else) are stuck with the manual process of sharding and shard management.
If you don't expect to grow, by all means, continue to live in the RDBMS past. However, if you're app is subjected to possible rapid growth (e.g. a Facebook app, Salesforce app, GAE app, pretty much any Web-facing app) you should definitely be thinking about how to leverage SimpleDB/HBase/Hypertable/CouchDB/etc in your design. I see a lot of posts of this nature lately and they all seem to be coming from the initial shock of what you *can't* do with these systems. Give them a shot and see what you *CAN* do with some semi-clever design and you might be surprised.
Here's why you /should/ be scared. Google (and to a lesser degree Amazon) /already/ run on this new breed of DB. And their apps /work/. And their apps /scale/, massively. No stupid sharding or rdbms babysitting required.
Sorry, but if you think traditional rdbms scale without problems, you either have 0 experience with large systems or you are being disengenuous.
Surely you can't do a group by if you use shards?
That said, Oracle does have an in-memory, key-value pair based system which is highly (5000-node clusters in production) and linearly scalable, can do aggregations over the entire grid, and works on objects.
Oracle Coherence.
Costs an arm and a leg, but solves all the issues raised in this article (yes you can do aggregations and SQL-like queries, and they are automatically run "in parallel" across the entire grid).
You don't miss much, do you, slick?
"We all expect Oracle to scale if we pay them enough money..."
I also strongly disagree with this. Yes, programmers do like solving problems but only new problems and challenges not those problems for which the easy solutions already exist.
That said, I would've just stated one reason not to use this (#10: You almost surely don't need it).
The polished turds from Amazon and Google are still turds, though shiny.
I find it most interesting seeing all of the cheerleading for SimpleDB and similes by people who are quite evidently clueless about databases, so they embrace and flaunt their ignorance, using Google and Amazon as a "Big Daddy" of sorts, always ready to reference.
Guess what, kids - you aren't Google, Amazon, or Facebook. The chance that your web toy will ever be a fraction as popular as those sites is so vanishingly small that is creating an underflow condition.
Google has a very specialized database, and their needs are absolutely nothing like almost anyone else. Amazon likewise. Until the day that you build your own specialized database, an RDBMS is often a suitable choice.
And the scalability ruse....extraordinary. The numbers I've seen for these "scalable" database technology are need to be scalable because they're such incredibly poor performers.
Alas, everything old is new again. Here we have cheerleaders heralding the arrival of basically exactly what people did before real databases were invented. Hurrah for the past!
+And if you are making something for < 100,000 users, you really probably ought to just stop now, shouldn’t you?
Ho ho ho. Awesome stuff.
Yeah, I guess making systems managing billions in funds just doesn't cut into realm of the awesome systems that you make.
You are simply delusional.
+And if you are making something for 100,000 user sites do you have, jackson? Care to point a couple out?
Now I presume you must mean 100,000 simultaneous users, because there are quite a few >100K user sites easily running on some shitty RDBMS (e.g. MySQL) on a low-end desktop PC. Slashdot, for instance, which was pretty much a worst case because they were caching nothing, and generating every request live from the database.
Clearly you have needs far beyond /. in their heyday.
Not good enough for jackson's imaginary success story, though.
the answer you troll for is "Nope". in fact, i'd say just the opposite. if you know before you start that your app will need upwards of a hundred thousands users to be useful to its audience, "you really probably [sic] ought to just stop now."
One obvious one that caught my eye: #7: "SimpleDB isn’t that fast" -- Todd specifically pointed out (right in there with the performance numbers he was quoting...) that tools like SimpleDb are NOT fast. That's not the point; they exist to address scaling issues.
Some other lines you apparently considered fawning or possibly satiric:
"If you have a complex OLAP style database SimpleDB is not for you. But, if you have a simple structure, you want ease of use, and you want it to scale without your ever lifting a finger ever again, then SimpleDB makes sense. The cost is everything you currently know about using databases is useless and all the cool things we take for granted that a database does, SimpleDB does not do."
That sounds an awful lot like what you're saying in #10. But you start that off with "Everyone’s assuming that SimpleDB was designed to be a general-purpose replacement for OLTP database servers."
Sorry for the rant; I guess I'm just saying you clearly have some useful input to add to the discussion -- just leave the straw man nonsense at home, please.
PS - do you know that your comment filter rejects valid email addresses such as root@localhost.localdomain? (at least, that's where all my cron jobs send it ... :-)
@toby has it exactly right. RDBMS _can_ scale, but at *significant* costs in both money and developer/sysadmin/DBA time.
Regards
D
1. Data integrity is not guaranteed.
This could be the case with SimpleDB, but overall nothing prevents document databases from managing data integrity very well.
Regarding the constraints, there is nothing that prevents defining validations in a document or its related “meta” document (this is pretty much how StrokeDB works — you can define your validations within meta document and they will let your document stay validated)
More interesting are the concerns about the conflicts. I’d say that this problem is hardly addressed in a common RDBMS approach. All you usually get is either user’s A or user’s B most recent update — there seems to be no easy way graceful conflict resulution. On the contrary, since document databases approach is rather novel there is certainly enough room to adopt ways to deal with conflicts. For example, with different and configurable algorithms — like merging them slot-by-slot 3-ways, or even some special programmer-defined algorithms. I can hardly imagine how to do this sort of stuff with traditional RDBMS in a relatively easy manner.
2. Inconsistency will provide a terrible user experience.
First of all, it should noted that described inconsistencies are also quite possible with distributed RDBMS setups — they too are constrained by a certain lag before the data is going to be propagated through replicas.
The actual problem is not with lag — it is more about leaving documents in a consistent state.
This problem could be easily addressed in any kind of database, either relational or document-based.
3. Aggregate operations will require more coding.
Again, while this seems to be true for SimpleDB, other document-based databases address this problem pretty well with Views approach (CouchDB, StrokeDB [Views is WIP]) — so you can define any kind of aggregation, even such that are simply not supported by RDBMS.
More at http://rashkovskii.com/articles/2008/4/26/top-1...
Databases != RDBMS. RDBMS is but "one" kind of database. Then you have hierarchical, object-based, document-based, etc.
SimpleDB is but one kind of non-RDBMS database. There are use cases that fit RDBMS, that are use cases that make RDBMS cry. That's where SimpleDB or other alternatives get into the game.
Just as simple as that. When all you know is a hammer, all your problems are nails.
This article is just wanting more traffic by generating FUD to newbies.
This is actually a big database limitation. My application has a lot of places where people can leave comments.. and 1k is too small. Think about a long email message.. it could easily go over 1k.
That means you have to split a field into multiple chunks.... :-(
You didn't spend too much time researching SimpleDB then. You can store pointers to larger data objects stored in S3 if you need more than 1k.
Ummm. yeah lets also add the s3 goodness for something thats easily handled by a rdbms. Do you work for amazon?