In a recent, very nice blog post, Frank McSherry describes in detail how to implement database operations, more specifically the Weaver API comprising graph transactions and traversals, using the differential dataflow model implemented by Naiad. In this post, we analyze this technique of implementing database operations using differential dataflow and contrast it with traditional techniques of data management.
The key idea behind Naiad's differential dataflow-based approach is to view every database operation as a combination of the input and computation on that input. For example, in order to implement a batch of updates on a graph, the differential dataflow approach would take as input both the current state of the graph as well as the batch of update operations. Naiad applies the batch of updates in order, and outputs the result. Differential dataflow, as the Naiad paper describes, can perform iterative processing on a stream of data. In this case, the stream comprises both the input graph as well as incoming database operations.
This approach is, of course, very different from the way traditional data stores such as Weaver manage the system state. Weaver stores the data as a distributed adjacency list, and Weaver transactions directly modify and read from the adjacency list. As the system processes more and more updates, Weaver stores all the different versions of the graph in-memory, so as to ensure that only the correct answer is observable to an external entity.
Speaking of which, McSherry's post is righteously snarky about various "different flavors of consistency" that seem to be in vogue in the distributed systems community. We share with McSherry his concern about "weird-a** consistency properties" that some systems exhibit. Indeed, we are convinced that regular programmers find it very difficult to reason about any model other than strict serializability or linearizability. First-gen NoSQL companies have used the reasonable-sounding claim that the application ought to drive the properties required of the underlying data store, in order to justify selling databases with incredibly weak consistency properties. It is an open question whether practitioners actually are able to understand custom consistency properties, whether they can formally enunciate the required properties of their applications, and whether they can match them to the interfaces provided by these weak systems. And even in cases where the practitioners can match a data store to an application at a given point in time, application requirements can evolve and change post-deployment, leading to disruptive changes. These concerns do not apply to Weaver, because it guarantees that all transactions are strictly serializable, a very strong and well-understood consistency property.
Both differential dataflow, and Weaver-style direct data management, provide similar consistency guarantees. One may expect that database operations implemented in an interative dataflow model may have higher latency due to batching; the exact latency numbers depends on the workload characteristics and the batch size. If the iteration speed is fast enough, the two techniques may have similar performance.
A key difference between the two techniques is in terms of fault tolerance. In Weaver, once an operation commits, the system replicates and durably stores the updates to guard against crash failures. In dataflow world, the system can recover from crash failures by re-executing the operations on the input data. It would be prohibitively expensive to replay all operations since the beginning of time, so Frank also alludes to a way of committing the partial computation as the system progresses through the stream of updates. Since this is covered in the "speculative" part of his post, it will be interesting to see future work which develops this idea.
As a side note, differential dataflow is orthogonal to refinable timestamps, Weaver's core technical contribution. Refinable timestamps is a novel distributed ordering technique that Weaver uses to order graph transactions. One could envisage a scenario where we augment Weaver's query model with differential dataflow, which would result in a strictly stronger system with a more expressive query interface.
State management is at the heart of nearly all distributed systems applications. Naiad and Weaver represent two interesting points in the design space. Naiad's approach is based on differential dataflow, a scheme that implements state management as operations on a named input and computation on that input. Weaver follows a more direct route -- the graph data is stored as an adjacency list, and the operations manipulate that state directly.
We thank Frank for his enlightening description of an implementation of the Weaver API on the differential dataflow computation model. The "dataflow as database" approach is a novel one, and we look forward to reading more about the systems challenges in implementing a fully-fledged, fault-tolerant database using this technique.