Flexcoin was a Bitcoin exchange that shut down on March 3rd, 2014, when someone allegedly hacked in and made off with 896 BTC in the hot wallet. Because the half-million dollar heist from the hot wallet was too large for the company to bear, it folded.
I'll resist the urge to ask why they did not have deposit insurance for their hot wallet, because the technical story of what happened is even more colorful and fascinating.
In their own words:
The attacker successfully exploited a flaw in the code which allows transfers between flexcoin users. By sending thousands of simultaneous requests, the attacker was able to "move" coins from one user account to another until the sending account was overdrawn, before balances were updated.
It's not every day when one's professional interests in NoSQL databases collide with one's interest in cryptocurrencies, especially in such a monumental train wreck. So, before I go on, allow me to link to appropriate background music for this occasion.
What happened here is a standard problem covered in every undergrad computer science curriculum. It's known as concurrency. Let's look carefully into what must have happened here, before we examine why it happened. The what-part is technical, and frankly, simple, with many well-known solutions. But it's the why-part that is as fascinating as it is disheartening. It points to a social failure: a failure of distributed systems academics to educate developers and to equip them with clear-thinking frameworks. There is far too much noise in the valley around how to build distributed systems, much of it being generated by people who stand to profit from selling broken-by-design software. And it all has consequences and ends with not just broken websites, but with stolen cash and broken dreams.
But first, let me illustrate the problem. Here's the simplest code one might write to dispense cash from an ATM (I'll illustrate with an ATM example because Flexcoin is a trusted Bitcoin wallet and exchange, which is really a glorified bank. Their withdrawal code is multithreaded, but for those who don't know what that means, it's simplest to think of it as ATM witdrawals. Real code will also check to see if there are sufficient funds, as well as a ton of other things, but they are not germane to the bug so let's leave them out for now):
mybalance = database.read("account-number")
newbalance = mybalance - amount
database.write("account-number", newbalance)
dispense_cash(amount) // or send bitcoins to customer
Now, consider what would happen if I duplicated my debit card, gave it to my best friend, synchronized our watches, and performed withdrawals at two different ATMs at the same time.
What's that I hear you say? Absolutely nothing bizarre would happen. My account would be deducted the right amount. That's because banks employ systems that guard against this kind of elementary error. They are based on transactions with ACID guarantees.
But if our bank was implemented using a first-generation NoSQL datastore, say MongoDB, we'd see something very different. Specifically, if multiple people simultaneously execute the code above, they might just go through those operations in lockstep. They would simultaneously read the amount of money I have in my bank account, say $1000. They would simultaneously subtract the amount to be dispensed, say $100, and end up with $900 in the variable newbalance. And they would simultaneously write back the new balance to the database, while dispensing $100 in crisp green bills at two different locations. I would end up with a bank account that says $900, and two fresh $100 bills in hand, for a sum total of $1100! Manna from heaven!
Any computer scientist worth her salt would immediately repeat this process all day, at web scale, until she emptied out all the cash at the exchange. And that's exactly what the attackers did.
The problem here stemmed from the broken-by-design interface and semantics offered by MongoDB. And the situation would not have been any different if we had used Cassandra or Riak. All of these first-generation NoSQL datastores were early because they are easy to build. When the datastore does not provide any tangible guarantees besides "best effort," building it is simple. Any masters student in a top school can build an eventually consistent datastore over a weekend, and students in our courses at Cornell routinely do. What they don't do is go from door to door in the valley, peddling the resulting code as if it could or should be deployed. Yes, yes, the broken-by-design apologists will trot out the usual refrain that goes "there is nothing wrong with MongoDB as long as you always deploy it knowing that it can give you back bogus answers." Yeah well, there is nothing wrong with flammable mattresses either, just make sure there is no source of ignition nearby. It just turns out that we then get charred family tragedies, because people are fallible. Little websites that start out as a pokemon collection or Magic the Gathering trading cards suddenly turn into world's largest Bitcoin exchange handling half a billion dollars, and oops.
Bitcoin coincided with a particularly dark time in distributed systems when people, armed with an incorrect interpretation of the CAP Theorem, thought that they just had to give up on consistency in their databases, that no one could build distributed data stores that provided strong guarantees. Marketers went from door to door in the valley, peddling weak data stores that could not uphold the simple guarantee that a READ should return the result of the latest successful WRITE. Even now, after next-generation NoSQL data stores, such as HyperDex and Google's Spanner, showed that the tradeoffs in first-generation NoSQL systems are neither necessary nor desirable, there are still people who are trying to beat the dead horse of eventual consistency and weak APIs.
Well, tell all that to the Flexcoin folks. These are honest people who put in many hours of work to build a product that they believed in, using the latest technology available to them, and they fell prey to one of the best documented problems in the book.
One might claim that the Flexcoin folks were particularly bad at their craft, that they should never have deployed a bank without concurrency controls, that they should have known better. I don't know these devs, but as a techie, I can detect when I'm dealing with other genuine, well-meaning, hard-working techies, and the Flexcoin online presence pushes all these buttons. They did what anyone would do after reading one too many astroturf articles on Hacker News. Sure, their system failed, but in a sense, the overall system failed them.
And they are far from alone. Another exchange, Poloniex, suffered from the exact same bug. Here are the gory details, which are remarkable in how similar they are to the Flexcoin bug. It's a well-known result in software engineering that even when you have N different teams independently developing software that has nothing in common, they will run into the same issues around the same pain points.
Historically, Bitcoin exchanges that suffered significant losses turned into fractional reserve banks, only to fold later. Luckily, Poloniex did not go under and is currently back online.
This problem is so wide-spread, so embarassingly endemic that there have even been public discussions and possibly a third affected site. It's a dirty little secret that everyone knows: Bitcoin exchanges built on top of first-generation NoSQL infrastructure lack even the most basic measures to guarantee the integrity of their accounts.
And typical security audits may not uncover these flaws, for it's not the case that the hackers gained unauthorized access through some cross-site scripting vulnerability, or some other flaw, well within the arsenal of security auditing firms. It wasn't a fault of the authentication scheme; they were using state-of-the-art 2-factor authentication. It wasn't a fault of their authorization scheme, either; the hackers did not do anything they were not allowed to do. The problems lie with the fundamentally weak semantics offered by the data stores behind these websites. The site was itself broken from the ground up. The hackers simply got it to do what it was programmed to do, a lot faster than normal.
What happened at Flexcoin, or Poloniex, or any of the other Bitcoin exchanges beset by technical problems (and I'm looking at you Coinbase!), was not an outlier or unexpected or unforeseable. The infrastructure is broken. And it is broken by design.
There are many ways of avoiding these kinds of problems. They all start by switching to better infrastructure.
Here's how one would write the same code in HyperDex:
hd = hyperdex.client.Client('127.0.0.1', 1982)
success = False
while not success:
mybalance = hd.get('accounts', 'account-number')['balance']
if mybalance < amount:
raise InsufficientFunds()
newbalance = mybalance - amount
success = hd.cond_put('accounts', "account-number",
{balance: mybalance}, # require balance == old balance
{balance: newbalance}) # set balance = new balance
dispense_cash(amount) # or send bitcoins to customer
This code is not only correct, but HyperDex is faster than MongoDB by a factor of 2 to 10X. And it provides a fault-tolerance guarantee that your data will be protected even through multiple simultaneous failures. And you can take online, instantenous backups that are consistent across a cluster. You cannot do any of this with MongoDB.
Atomic Operations
In domains outside banking, there are scenarios where atomicity is required, but where the operations can be reordered at will. A next-generation NoSQL store like HyperDex provides even faster mechanisms for providing the necessary atomicity when operations are commutative. Suppose, for instance, one might want to keep track of up and downvotes in a reddit-like website. HyperDex provides atomic addition (as well as atomic subtraction, multiplication, division, string prepend, string append, list prepend, list append, etc) operations that greatly simplify the task:
# upboat!
hd = hyperdex.client.Client('127.0.0.1', 1982)
hd.atomic_add('threads', "thread-key", {upvotes: 1})
Mongo does not support such atomic operations. Replicas may diverge, and merging the sums correctly would be non-trivial.
Full Blown Transactions
Here's another trick, possible with next-generation NoSQL data stores like HyperDex (with the Warp add-on), that is impossible with Mongo, Cassandra, and the like.
Suppose we want to move some funds between two accounts, "egs" and "robert", atomically.
hd = hyperdex.client.Client('127.0.0.1',1982)
success = False
while not success:
xaction = hd.begin_transaction()
mybalance = xaction.get('accounts', 'egs')['balance']
hisbalance = xaction.get('accounts', 'robert')['balance']
if mybalance < amount:
xaction.abort()
raise InsufficientFunds()
mynewbalance = mybalance - amount
hisnewbalance = hisbalance + amount
xaction.put('accounts', 'egs', {balance: mynewbalance})
xaction.put('accounts', 'robert', {balance: hisnewbalance})
success = xaction.commit()
In this case, the code is modifying two separate objects, one that holds EGS's balance and another one that holds Robert's balance. Since the data store is horizontally scalable, these two balances are quite likely stored on separate servers. Ensuring that these two transactions are atomic, consistent, isolated and fault-tolerant is a difficult thing to do. But it's not impossible. We now have the technology to do it correctly, and to do so with higher performance than the NoSQL systems of yesterday.