A very brief history of the NoSQL development

26.06.2014 Christian Mennerich

A very brief history of the NoSQL development – From Codd to Brewer and beyond

I am still new to the movement that is now called NoSQL, and therefore curiously following all the discussions around the CAP theorem, consistency levels like BASE, the immolation of several letters in the ACID paradigm and the ‘demonization’ of the relational join operation. I wondered why long established techniques and paradigms might no longer be valid and attended some of the NoSQL matters conferences. These conferences are still small and very communicative, and I enjoyed them a lot! Last year in Barcelona a great inspiring talk on the development of NoSQL was given by Doug Turnbull (@softwaredoug), who is a historian as well as a computer scientist. He discussed a lot of interesting points, and to some of these I will refer here, too. What is better suited to understand a new topic than writing a review on its history? As this would be too time consuming a task, I will write down a very brief history of the events (as far as I know about them) related to the NoSQL development, as well as some of my very own thoughts and impressions on this topic. There are still a lot of questions troubling my mind…

A short story of three famous papers

There is a lot of research going on in the field of NoSQL. Three papers are frequently cited in this context: Codds works on large shared databanks [1], the fundamental work that describes the foundation of relational database systems and the relational algebra as its query language. The proof of Brewer’s conjecture by Gilbert and Lynch [2], who show that, within a reasonable network model and given some premises, CAP is a pick two out of three choice. And the criticism on a one size fits all philosophy by Stonebraker and Çetintemel [3], a paper stating, more or less, the end of relational databases as the jack of all trades devices for storage related problems.

Codds criticism

To understand Codds criticism on database systems we have to get back to the 1960s, the time when data in database systems like IBM’s IMS or CODASYL systems was hierarchically ordered (i.e. tree structured) or arranged in networks (i.e. graph structured). Three major issues directly affect the applications that depended on the persisted data in these systems:

Ordering dependence: The order of the records on the storage system is identical to the presentation of the records to the application. Hence, a change in this order can break the application.
Indexing dependence: If index structures are used, queries have to make explicit use of these indexes. Hence, dropping an index can invalidate a query and break the application.
Access path dependence: Data and relations between data are modelled as a tree or a network. If an applications relies on these structures to access and retrieve data it can break if the structure is unknown or changed.

The relational data model: Separation of concerns and data consistency

Codds basic idea to address these issues was the separation of the data model from the data representation on the storage system. This can be achieved by modeling data and the relationships between data in terms of mathematical relations, structures that are, basically, sets of (ordered) tuples. A database state, in the simplest sense, is a time-varying collection of relations. To avoid anomalies, the database schema is normalized, which leads to a distribution of the elements that constitute a single object of the domain onto several relations. The relational algebra, a set of operations defined on relations, is a suitable query language, and as powerful (or expressive) as a first order calculus. It produces new relations from given relations (and the algebra is in that sense closed). This query language solely depends on the relational data model and is completely independent from the underlying storage of records on disks and from any defined indexing structures: All information the model bears can be retrieved and deduced by the query language. Codd and Stonebraker were among the first who implemented such query languages. Today, SQL, a language based on the relational algebra, is the standard query language for relational database systems.
Relational database systems offer the concept of referential integrity, mechanisms to add semantics to the model by using keys and foreign key relationships to ensure data consistency. Concurrent access to a database is administered by using transactions. The famous ACID paradigm is part of (almost) all relational database systems and guarantees transaction to be

Atomic (all operations of a transaction are executed, or none is)
Consistent (a transaction begins with a consistent state, and leaves the database in a consistent state)
Isolated (a transaction is executed as if it were the only transaction in the system)
Durable (all changes of a transaction will be persisted to the storage system)

These features are tunable to some extend, leading to different isolation and consistency concepts. This is an important point, not only in distributed systems.

Performance issues

There are some issues to be discussed when it comes to performance tuning of relational database systems: Not all mathematical properties of the relational algebra (or equivalent calculi) hold in the implementation of the query languages, and despite being declarative the order of the operations in a SQL statement can significantly affect the performance. Equivalent reformulations of complex queries can immensely decrease response times. The relational join can be quite expensive, because, due to the normalization process that distributes components of one business object to several rows or even tables, intensive I/O traffic can be produced to gather the object’s components back from the storage. And even the use of indexes does not always lead to acceptable response times for ad-hoc queries, so denormalization is applied. This leads to data redundancy and brings back the carefully avoided anomalies. Often these issues are tried to be addressed by vertical scaling, i.e. by increasing memory, power or storage of the database server.

There is no such thing as a free lunch

Starting with increasing popularity of the web and web businesses, the world became, naturally, more and more distributed. As discussed above, there can be performance issues with relational databases, and these become more significant when scaling is horizontal. Guaranteeing promises like ACID compliance becomes hard on scaling out, which is very popular nowadays: Use more computers with less performance rather than fewer computers with higher performance. Commodity hardware is cheap and widely achievable. This distribution of systems comes unavoidably with the characteristics of the CAP theorem as stated as a conjecture by Brewer at PODC 2000 [4]: A distributed systems cannot be (at the same time!!) consistent with all its data, available to serve all requests and tolerant to network partitioning. The conjecture has been formally proven by Gilbert and Lynch in 2002 within a reasonable network model [3]. Hence: There is no free lunch with distributed data. [5] In CAP, the term consistency of a distributed system refers to a consistent view of all nodes in the system on the data that is stored. Availability means that any request will get a response as long as there is no total network failure. And partition tolerance is about dealing with failures or the unavailability of parts of the system. As in highly distributed systems network failures are not avoidable, the P is often a premise, and the system can be chosen to be AP or CP. Data consistency becomes an issue, and often BASE (Basically Available, soft State, Eventually Consistent) is the chosen consistency paradigm in distributed environments: The system is guaranteed to become consistent at some time in the future, but is not guaranteed to provide a consistent view on the data at all times. That a distributed datastore can be tuned within CAP bounds (and one somehow has a continuous choice) was pointed out by Brewer [6] as well as by others.

One size does not fit all

Over time, relational database systems with SQL as query language became the de facto standard wherever a storage system was needed, ranging from a simple data storage backing up a small web application, to giant companies data management systems, integration databases managing data exchange between lots of applications and data warehouses: One technology integrating all these totally different purposes. The task of choosing a database is often treated as a non-functional requirement, answered by taking the relational database system that was always used. As pointed out by Stonebraker and Çetintemel [3], there are limits to what applications a relational database system can be used for. The consequence: One size does not fit all. A giant hammer is not a universal tool, and choosing the correct tool for a task becomes an option again in these days. Stonebraker and Çetintemel take stream processing as an example to show how a simple system fitting the application’s needs can outperform a giant universal purpose relational database system by orders of magnitude, and state that one size fits all can only be maintained as a marketing illusion [3]. The paper and its sequel give several examples.

The rise of NoSQL

At this point NoSQL comes into play, by offering datastores optimized for special purposes. The usual categorization classifies the NoSQL stores into one of four categories (document store, key/value store, wide column store or graph database), and places it at a specific side of the CAP triangle [3], according to the properties they offer.

But: What exactly is NoSQL?

It is agreed on that NoSQL stands for not only SQL. But there is still no common understanding about the concepts included into the term NoSQL. NoSQL stores are often schema-less, open source, non-relational, horizontally scalable, and use BASE as consistency mode. The term elasticity is used for stores that are scalable, schema-free, and allow for rapid replication and rapid changes. But how can these features be achieved? If there is no schema, the application has to take care of the integrity of the data, as the datastore often cannot support decisions without knowledge about the structure. The design of many NoSQL datastores is bottom-up, optimized for horizontal scalability. They often provide only simple low-level APIs (like simple set, get and put operations, sometimes realized atomically). Modelling with NoSQL datastores feels totally different than modelling in the the relational world, and follows a different philosophy.

Common misunderstandings

The NoSQL world is still lacking a commonly accepted terminology, and there are frequent misunderstandings: Relational databases gained their name from mathematical relations, not from relationships between data tables implemented by foreign keys and referential integrity. The meaning of C, i.e. the idea of data consistency, differs in ACID and CAP by refering to referential integrity or data being the same on different nodes. A BASE system will be consistent at some time in the future, this is a guaranty when the system has enough time and no more updates are altering the systems state. ACID and NoSQL are not inherently mutual exclusive (see e.g. the graph database Neo4j [8], or the approach FoundationDB [9] has taken). And the terms sharding and replication are sometimes used synonymously, confusing data distribution with data redundancy. Not all of these misunderstandings mentioned here are exclusive problems in NoSQL, but produce confusion in all discussions about distributed systems.

A repetition of history?

Does the situation we are facing today resemble the one Codd faced when he wrote about the relational data model? There are clearly some similarities. But is it fair to compare the situation nowadays to the one of the 1960s? Nonetheless, I will put two (arbitrarily chosen) examples for discussion.

The graph database Neo4j

Neo4j, developed and maintained by Neotechnology [8], is the most famous graph database to date. Neo4j is on the CA side of the CAP triangle [7]. The data model is well-suited for a wide range of applications, ranging from recommendation systems to underground train time tables, and is used by many big companies. Neo4j is, undoubtedly, a great NoSQL database: Easy to set up and to use, lots of impressive application examples, and completely open source with a helpful community. I myself like working with that database a lot. Neo4j brings a query language called Cypher that can be intuitively used and seems to be quite powerful. Let’s have a look at a typical Cypher statement before Neo4j version 2.0:

START movie = node:Movies(“title:Matrix”)
MATCH movie<-[:ACTS_IN]-actor
RETURN actor.name;

The first line of the statement reveals a direct dependency on an index called Movies (compare Codds issue no. 2.). Since version 2.0 of Neo4j this issue has been resolved. But how powerful and expressive is Cypher, say, compared to a query language like the relational algebra (or some version of datalog, or whatever language you always used and know to its bones)? Can you express everything you always wanted to know about your graph model using Cypher? As far as I know there is no obvious algebra underlying Cypher (some concept comparable to a path algebra as proposed by Neubauer and Rodriguez [10]) that would make Cypher easily accessible for a formal analysis. And the last resort to querying a graph in Neo4j, the core API, is inherently imperative, so it depends on access paths and graph traversal strategies.

The wide column store Cassandra

Apache Cassandra [11] is a famous wide column store, suited to hold tons of data and resides on the AP side of the CAP triangle [7]. Cassandra can be easily distributed, is highly available and provides no single point of failure with tunable consistency. Cassandra is widely used in big companies for data analysis. Data modelling in Cassandra follows different objectives than data modelling against a relational database: Data does not need to be flat, which is actually a very nice property. The data that is required to answer a query against a Cassandra data model must reside in a single column family, and hence, referential integrity is considered a non-issue here. And the data modelling methodology is equally driven by queries and data, data duplication is wanted here, whereas data duplication in relational database systems leads to unwanted anomalies. In opposition to relational database systems, transactions are not supported by Cassandra. So, we deal with completely different approaches, and in Cassandra some issues that are carefully treated and avoided for the purpose of data consistency in relational systems are purposely ignored to achieve a different goal. In addition, Cassandra is shipped with the Cassandra Query Language CQL, that is in many ways similar to SQL:

SELECT name FROM employees WHERE department = 'Marketing';

You can have SELECT, FROM and WHERE clauses in a query, and you can think in tables, rows and columns again. But to match for an attribute in a WHERE clause, you need an index on that column (compare Codds issue no. 2.). And, unlike in SQL, you can not have subqueries. Again the question about the expressiveness of CQL remains unanswered, at least for me.

Attempting a conclusion

I do not want, at any point and in any way, to offend any of the NoSQL datastore vendors or communities. The datastores all suit their purpose. As one can equivalently model a domain within each of the four pillars of NoSQL (at different gains and costs, obviously), one should carefully choose the database that suits the desired purpose. Data is the new oil! [12] And care should be taken when it comes to the further evolution of a datastore. It would be a pity if a datastore sacrificed good properties to achieve a general purposefulness, running the risk of repeating the history of relational databases in several ways. Fast and highly optimized, highly specialized data stores open the doors to highly sophisticated polyglot solutions, and here the relational databases are included as part of the database landscape. But there are lessons learned from the pre-relational days, and some ideas that led to relational database systems were not so bad at all, but revolutionary in the circumstances they were invented in. Looking back and learning seems to be the key to self-healing data and CRDTs as described by Shapiro et al [13], using long known ideas from Lamports works [14] and mathematical lattice theory (to formally describe automatic convergence towards a common supremum for differing data states). Maybe evolution of datastores in the NoSQL world can be sped-up by increasing a datastores fitness by looking back and learning, by understanding why certain decisions were made in earlier times, by understanding their consequences and hence avoiding mistakes. Temporary faulty states (like direct index dependencies or access path dependencies) are not always avoidable, but to know of them is important and necessary. And often differing concepts are used on purpose. Pointing out specializations and weaknesses leads to more honest solutions (as nicely pointed out in a FoundationDB blogpost [15]), and that would drastically simplify the choice of the correct tool from the more than 150 existing NoSQL datastores [16].
In conclusion: I am really not sure if any uttered criticism is even fair. Going back to access, indexing and order dependencies can be a good choice in recent developments, and maybe NoSQL does not have to evolve out of this again, for the sake of query performance. But is the separation of data and its representation on the storage not a good idea in the distributed world NoSQL stores reside in today? What makes the situation different from the pre-relational time? Supposedly, we will experience, at least to some extent, a reinvention of the wheel at certain points, and doing this knowingly or unknowingly could be the question to ask! Honest database solutions are needed. If you know your drawbacks, do not hide them. And point out specializations and motivate them. I really cannot predict future developments in NoSQL. But maybe someone wants to share their experience here? Every comment and discussion is very welcome!

References

[1] E.F. Codd: A Relational Model of Data for Large Shared Data Banks, Communications of the ACM, Vol. 13:6, 1970.
[2] N. Lynch and S. Gilbert: “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59.
[3] M. Stonebraker and U. Çetintemel: “One Size Fits All”: An Idea Whose Time Has Come and Gone, Proceedings of the 21st International Conference on Data Engineering, 2005.
[4] E. Brewer: Towards Robust Distributed Systems, Keynote at PODC, 2000.
[5] HP white paper: There is no free lunch with distributed data white paper Consistency, availability, and partition-tolerance trade-offs on distributed data access systems.
[6] E. Brewer: CAP Twelve Years Later: How the “Rules” Have Changed, Computer, IEEE Computer Society, 2012.
[7] Nathan Hurst’s blog
[8] Neotechnology
[9] FoundationDB
[10] M. A. Rodriguez and P. Neubauer: A Path Algebra for Multi-Relational Graphs. CoRR, 2010.
[11] Apache Cassandra
[12] E. Redmond and J.R. Wilson: Seven databases in seven weeks: A Guide to Modern Databases and the NoSQL Movement, Pragmatic Bookshelf, 2012.
[13] Shapiro et al: A comprehensive study of Convergent and Commutative Replicated Data Types. INRIA, RR-7506, 2011.
[14] L. Lamport: Time, clocks, and the ordering of events in a distributed system. Comm ACM 21, 7, pp 558-565, 1978.
[15] FoundationDB blog
[16] nosql-database.org