“NoSQL is really the way to go”. That’s what I recently read somewhere on the web.
“NoSQL is really the way to go”. Erm. No. It is if you are the NSA, have billions to spend, and don’t actually know what you gonna do and just wanna start collecting all trash you come about. NoSQL basically is unstructured and inconsistent. SQL is structured and consistent and therefore the epitome of data managment. And there are myriad shades of gray between them. Generally speaking, consistency requirements drain your performance. That’s why you want to have SQL databases run on machines that can keep the active part of the data completely in memory. That’s the overall theme IMHO: at the heart is that single point of failure that is required to coordinate consistency which HDFS simply eliminates by removing the support for small chunks of data by having a central coordinator node called ‘name node’ (and that’s where SQL is huge: managing billions of small data chunks consistently). Look at MVCC. Compare that with ObjectDB, but keep in mind that ObjectDB does not provide anything like MVCC, instead its locking mechanism is much more like NoSQL, ie. object-based, nothing even remotely resembling MVCC.
Then look at Cassandra and how it does not even have data integrity checks for largely unused data – only data that actually gets read by a client gets checked – no sane RAID implementor would do that because it’s a recipe for disaster if your data is worth more than just trash. Look at XFS: a POSIX compliant file system: per TB you need one GB of RAM to do a filesystem check.
In the end, SQL is highly complex (the standard defines 4 different consistency levels on a transactional level) and delivers an immense amount of value. NoSQL basically just gets rid of the SQL complexity and value and, if you are lucky, replaces it by some bullshit map-reduce value. I mean, bullshit in the sense that it’s more or less braindead. Everyone can write NoSQL databases and data clusters. Hardly anyone can write SQL optimizers. Or why do you think Oracle SQL databases are expensive like hell, while Cassandra and Hadoop are available for free? And don’t even get me started about MySQL before you can show me that it can handle the same complex SQL queries as well as Oracle can.
In the end, everything fits its purpose. Small data, structured, highly consistent, go for SQL (bank accounts). Big data, maybe unstructured, low consistency requirements, go for NoSQL (Twitter-like stuff). Big data, highly structured and highly consistent, keep looking for a solution: the problem of requiring a central coordinator will likely never allow you to have a solution for that because it prevents the application of the divide and conquer principle. But, of course, there also exist shades of gray here: it looks like GFS2 tries to defuse the central coordinator issue by decentralizing it a bit, but, even when I haven’t read about it in-depth, I’m willing to bet that this depends on some sort of locality-restriction for consistency, ie. is not much more than setting up an uber-cluster consisting of separate, more or less independent clusters with a potentially higher integration with each other.
If you want to dive deeper into these consistency issues, I recommend starting with the two-phase commit protocol, and notice how there is no way to handle every situation fully automatically when it comes to spanning transactions over more than one member/node.