When Data is Worthless

Ladislav Thon brought up a good point in response to my article on Mongo:

You know, this all starts to make sense once you realize the original design goal of Mongo: a database for that kind of data where losing one record or two isn't a problem, but speed is crucial (i.e., web analytics)...

Now Mongo is attacking the "system of the record" use-cases, and some of the original design tradeoffs are manifesting themselves.

He's absolutely right: if all of your data has the property that you could lose some of it and still compute something meaningful, MongoDB is a harmless choice.

But my endorsement has some subtle caveats that developers need to keep in mind:

  1. Mongo made its tradeoffs to gain speed. But it's not actually all that fast. There are faster systems out there that provide equal or better fault-tolerance guarantees. Wouldn't it make more sense to use one of those?

  2. There is no upper bound on how many records Mongo will lose at a time, so let's not be cavalier and claim that it'll only lose "one record or two." Your data may consist of discardable low value records, but clearly those records are valuable in aggregate, so you'll want some kind of a bound on how much data Mongo can drop for you. The precise amount of data Mongo can lose depends intimately on how you set up Mongo and how you wrote your application.

So, how much data can be lost due to a single fault? I don't know (it's your code and set up, after all), but if you also don't know precisely what you can lose in one strike, it's probably time to look into alternative systems whose responses to failures are better characterized. Either that, or take the devops team out for a care-free night on the town, pagers off. For if you are running systems without bounds on data loss, it's not like there is a standard of excellence they're trying to meet.

  1. For Mongo to be an appropriate database, all your data has to be equally unimportant. Mongo is going to be indiscriminate about which records it loses. In web analytics, it could lose that high-value click on "mesothelioma" or "annuity." In warehousing, it could drop that high-value order from your biggest account. In applications where the data store keeps a mix of data, say, bitcoin transaction records for small purchases as well as wallets, it could lose the wallets just as easily as it could lose the transaction records.

Pokemon

You might be using that data store just to track the CEO's pokemon collection at the moment, but it can easily grow into the personnel database tomorrow. Could you live with yourself if you lost Pikachu?

And it's not good engineering to pick a database that manages to meet an application's needs by the skin of its teeth. Civil and mechanical engineers design their structures with a safety factor. We know how software requirements change and systems evolve. So it would not be unwise, engineering-wise, to think about a substrate that can handle anticipated feature growth.

So let us give onto Mongo what is clearly its: it's mature software with a large install base. If it loses data, it'll likely do so because of a deliberate design decision, rather than a bug. It's easy to find Mongo hosting and it's relatively easy to find people who are experienced with it. So if all your data is really of equal and low value, and you can afford to lose some of it, and your app's needs are unlikely to grow, then MongoDB can be a fine pick for your application.


comments powered by Disqus