Why NoSQL, Principles, Overview Lecture 1 of NoSQL Databases (PA195) David Novak & Vlastislav Dohnal Faculty of Informatics, Masaryk University, Brno Agenda ● Current trends in data management & computing ● Big Data ● Relational vs. NoSQL databases ○ the value of relational databases ○ new requirements ○ NoSQL features, strengths and challenges ● Types of NoSQL databases ○ key-value stores, document databases, column-family databases, graph databases ○ principles and examples 2 Agenda ● Current trends in data management & computing ● Big Data ● Relational vs. NoSQL databases ○ the value of relational databases ○ new requirements ○ NoSQL features, strengths and challenges ● Types of NoSQL databases ○ key-value stores, document databases, column-family databases, graph databases ○ principles and examples 3 Current Trends: Big Data ● Volume, Velocity and Variety of data 4 Current Trends: Big Users ● It is common to start a Web-based system and have millions of users within a few months 5 Current Trends: Cloud Computing ● Everything is in Cloud ○ flexibility and distributed nature of the systems 6 Agenda ● Current trends in data management & computing ● Big Data ● Relational vs. NoSQL databases ○ the value of relational databases ○ new requirements ○ NoSQL features, strengths and challenges ● Types of NoSQL databases ○ key-value stores, document databases, column-family databases, graph databases ○ principles and examples 7 Big Data “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” (Gartner, 2012) 8 veracity – precision vs. uncertainty of data value – information extraction needed to get a value Sources of Big Data ● Social networks ○ this data is huge, but volumes can be relatively limited ● Logs of various web/email servers or routers ○ growing beyond limits ● Sensor networks ○ this sector is expected to grow even faster ● Internet of things (IoT) ● Computer-driven machines, like airplanes: ○ one overseas flight of Boeing generates 640 TB of data ● etc. 9 Processing (Traditional) Data ● OLTP: Online Transaction Processing ○ Standard databases (DBMSs) and database applications ○ Storing, querying, multi-user access ● OLAP: Online Analytical Processing (Warehousing) ○ Answer multi-dimensional analytical queries ○ Financial/marketing reporting, budgeting, forecasting, … ● RTAP: Real-Time Analytic Processing (Big Data Architecture & Technology) ○ Data gathered & processed in real-time (streaming) ○ Real-time and history data combined 10 Technologies for Big Data ● Distributed file systems (GFS, HDFS, etc.) ● MapReduce ○ and other models for distributed programming ● NoSQL databases ● Data Warehouses ● Grid computing, cloud computing ● Large-scale machine learning 11 Agenda ● Current trends in data management & computing ● Big Data ● Relational vs. NoSQL databases ○ the value of relational databases ○ new requirements ○ NoSQL features, strengths and challenges ● Types of NoSQL databases ○ key-value stores, document databases, column-family databases, graph databases ○ principles and examples 12 Relational Database Management Systems ● RDBMS are predominant database technologies ○ first defined in 1970 by Edgar Codd of IBM's Research Lab ● Data modeled as relations (tables) ○ object = tuple of attribute values ■ each attribute has a certain domain ○ a table is a set of objects (tuples, rows) of the same type ■ relation is a subset of cartesian product of the attribute domains ○ each tuple identified by a key ■ field (or a set of fields) that uniquely identifies a row ○ tables and objects “interconnected” via foreign keys ● Relational calculus, SQL query language 13 RDBMS Example SELECT Name FROM Students NATURAL JOIN Takes_Course WHERE ClassID = 1001 sources: https://github.com/talhafazal/DataBase/wiki/Home-Work-%23-3-Relational-Data-vs-Non-Relational-Database... 14 The Value of Relational Databases ● A (mostly) standard data model ● Many well developed technologies ○ physical organization of the data, search indexes, query optimization, search operator implementations ● Good concurrency control (ACID) ○ transactions: atomicity, consistency, isolation, durability ● Many reliable integration mechanisms ○ “shared database integration” of applications ● Well-established: familiar, mature, supported,... 15 Data Management: Trends & Requirements Trends Requirements ● Volume of data . ● Real database scalability ○ massive database distribution ○ dynamic resource management● Cloud comp. (IaaS) ○ horizontally scaling systems ● Velocity of data . ● Frequent update operations ● Many users ● Massive read throughput ● Variety of data ● Flexible database schema ○ semi-structured data 16 RDBMS for Big Data ● relational schema ○ data in tuples ○ a priori known schema ● schema normalization ○ data split into tables (3NF) ○ queries merge the data ● transaction support ○ trans. management with ACID ○ Atomicity, Consistency, Isolation, Durability ○ safety first ● but current data are naturally flexible ● inefficient for large data ● slow in distributed environment ● full transactions very inefficient in distributed envir. 17 NoSQL Databases ● What is “NoSQL”? ○ term used in late 90s for a different type of technology: Carlo Strozzi: http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/ ○ “Not Only SQL”? ■ but many RDBMS are also “not just SQL” [Sadalage & Fowler: NoSQL Distilled, 2012] “NoSQL is an accidental term with no precise definition” ○ first used at an informal meetup in 2009 in San Francisco (presentations from Voldemort, Cassandra, Dynomite, HBase, Hypertable, CouchDB, and MongoDB) 18 NoSQL Databases (cont.) ● NoSQL: Database technologies that are (mostly): ○ Not using the relational model (nor the SQL language) ○ Designed to run on large clusters (horizontally scalable) ○ No schema - fields can be freely added to any record ○ Open source ○ Based on the needs of 21st century web estates [Sadalage & Fowler: NoSQL Distilled, 2012] ● Other characteristics (often true): ○ easy replication support (fault-tolerance, query efficiency) ○ simple API ○ eventually consistent (not ACID) 19 Just Another Temporary Trend? ● There have been other trends here before ○ object databases, XML databases, etc. ● But NoSQL databases: ○ are answer to real practical problems big companies have ○ are often developed by the biggest players ○ outside academia but based on solid theoretical results ■ e.g., old results on distributed processing ○ widely used 20 NoSQL Properties in Detail 1. Good scalability ○ horizontal scalability instead of vertical 2. Dynamic schema of data ○ different levels of flexibility for different types of DB 3. Efficient reading ○ spend more time to store the data, but read fast ○ keep relevant information together 4. Cost saving ○ designed to run on commodity hardware ○ typically open-source (with a support from a company) 21 Challenges of NoSQL Databases 1. Maturity of the technology ○ it’s getting better, but RDBMS had a lot of time 2. User support ○ rarely professional support as provided by, e.g. Oracle 3. Administration ○ massive distribution requires advanced administration 4. Standards for data access ○ RDBMS have SQL, but the NoSQL world is wilder 5. Lack of experts ○ not enough DB experts on NoSQL technologies 22 ...but More and more companies accept the weak points and choose NoSQL databases for their strengths. NoSQL technologies are also often used as secondary databases for specific data processing. http://basho.com/about/customers/ https://www.mongodb.com/who-uses-mongodb http://planetcassandra.org/companies/ http://neo4j.com/customers/ 23 The End of Relational Databases? ● Relational databases are not going away ○ are ideal for a lot of structured data, reliable, mature, etc. ● RDBMS became one option for data storage Polyglot persistence – using different data stores under different circumstances [Sadalage & Fowler: NoSQL Distilled, 2012] Two trends: 1. NoSQL databases implement standard RDBMS features 2. RDBMS are adopting NoSQL principles 24 Agenda ● Current trends in data management & computing ● Big Data ● Relational vs. NoSQL databases ○ the value of relational databases ○ new requirements ○ NoSQL features, strengths and challenges ● Types of NoSQL databases ○ key-value stores, document databases, column-family databases, graph databases ○ principles and examples 25 NoSQL Technologies ● MapReduce programming model ○ running over a distributed file system ● Key-value stores ● Document databases ● Column-family stores ● Graph databases source: http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 26 MapReduce: Principles source: Dean, J. & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters 27 MapReduce: Features ● MapReduce is a generic approach for distributed processing of large data collections ● Requires a way to distribute the data ○ and to collect the results back after the processing ● The user must only specify two functions: map & reduce 28 MapReduce: Implementation Amazon Elastic MapReduce 29 Key-value Stores: Basics ● A simple hash table (map), primarily used when all accesses to the database are via primary key ○ key-value mapping ● In RDBMS world: A table with two columns: ○ ID column (primary key) ○ DATA column storing the value (unstructured BLOB) ● Basic operations: ○ Put a value for a key put(key, value) ○ Get the value for the key value:= get(key) ○ Delete a key-value delete(key) 30 Key-value Stores: Architecture 2. Large-scale Distributed stores Architecture often as a distributed hash table (DHT) source: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Features: it is simple ● great performance, easily scaled 1. Embedded systems ○ the system is a library and the DB runs within your system 31 Key-value Stores: Representatives Project Voldemort Ranked list: http://db-engines.com/en/ranking/key-value+store 32 Document Databases: Basics ● Basic concept of data: Document ● Documents are self-describing pieces of data ○ Hierarchical tree data structures ○ Nested associative arrays (maps), collections, scalars ○ XML, JSON (JavaScript Object Notation), BSON, … ● Documents in a collection should be “similar” ○ Their schema can differ ● Documents stored in the value part of key-value ○ Key-value stores where the values are examinable ○ Building search indexes on various keys/fields 33 Document Databases: Data Example key=3 -> { "personID": 3, "firstname": "Martin", "likes": [ "Biking","Photography" ], "lastcity": "Boston", "visited": [ "NYC", "Paris" ] } key=5 -> { "personID": 5, "firstname": "Pramod", "citiesvisited": [ "Chicago", "London","NYC" ], "addresses": [ { "state": "AK", "city": "DILLINGHAM" }, { "state": "MH", "city": "PUNE" } ], "lastcity": "Chicago“ } source: Sadalage & Fowler: NoSQL Distilled, 201234 Document Databases: Queries Example in MongoDB syntax ● Query language expressed via JSON ● clauses: where, sort, count, sum, etc. SQL: SELECT * FROM users MongoDB: db.users.find() SELECT * FROM users WHERE personID = 3 db.users.find( { "personID": 3 } ) SELECT firstname, lastcity FROM users WHERE personID = 5 db.users.find( { "personID": 5}, {firstname:1, lastcity:1} ) 35 Document Databases: Representatives Ranked list: http://db-engines.com/en/ranking/document+store MS Azure DocumentDB 36 Column-family Stores: Basics ● AKA: wide-column, columnar ● Data model: rows (each identified with a row key) each row can have many columns ● Column families are groups of related data (columns) that are often accessed together ○ e.g., for a customer we typically access all profile information at the same time, but not customer’s orders 37 Column-family Stores: Example source: Sadalage & Fowler: NoSQL Distilled, 201238 Column-family Stores: BigTable ● 2008: Google publishes Bigtable Paper ● “BigTable = sparse, distributed, persistent, multi-dimensional sorted map indexed by (row_key, column_key, timestamp)” 39 ”com.ccn.www” column family Column-family Stores: BigTable ● 2008: Google publishes Bigtable Paper ● “BigTable = sparse, distributed, persistent, multi-dimensional sorted map indexed by (row_key, column_key, timestamp)” row key row column column column column column “contents:html” “param:lang” “param:enc” “a:cnnsi.com” “a:ihned.cz” column names ... ... t2 t6 t8 ... EN UTF-8 CNN.com CNN t2 t2 t3 t7 40 Column-family Stores: Representatives Ranked list: http://db-engines.com/en/ranking/wide+column+store 41 Graph Databases: Example source: Sadalage & Fowler: NoSQL Distilled, 201242 Graph Databases: Mission ● To store entities and relationships between them ○ Nodes are instances of objects ○ Nodes have properties, e.g., name ○ Edges have directional significance ○ Edges have types e.g., likes, friend, … ● Nodes are organized by relationships ○ Allow to find interesting patterns ○ example: Get all nodes that are “employee” of “Big Company” and that “likes” “NoSQL Distilled” 43 Graph Databases: Graphs in RDBMS ● When we store a graph-like structure in RDBMS, it is for a single type of relationship ○ “Who is my manager” ● Adding another relationship usually means a lot of schema changes ● In RDBMS, we model the graph beforehand based on the traversal we want ○ If the traversal changes, the data will have to change ○ Graph DBs: the relationship is not calculated but persisted 44 Graph Databases: Representatives Ranked list: http://db-engines.com/en/ranking/graph+dbms 45 One Example: Facebook Facebook statistics (2016) ○ 1.86 billion monthly active users ○ 4 million 'likes' per minute ○ 250 billion stored photos (350 million uploaded daily) ○ 300 PB of user data stored (2014) 2009: 10,000 servers 2010: 30,000 servers 2012: 180,000 servers (estimated) source: http://expandedramblings.com/index.php/by-the-numbers-17-amazing-facebook-stats/ https://www.brandwatch.com/blog/47-facebook-statistics-2016/ 46 Facebook: Database Tech. Behind Apache Hadoop http://hadoop.apache.org/ ○ Hadoop File System (HDFS) ■ over 100 PB in a single HDFS cluster ○ an open source implementation of MapReduce: ■ Enables efficient calculations on massive amounts of data Apache Hive http://hive.apache.org/ ○ SQL-like access to Hadoop-stored data ○ integration of MapReduce query evaluation sources: http://goo.gl/SZ6jia http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/ 47 Facebook: Database Tech. Behind (2) Apache HBase http://hbase.apache.org/ ○ a Hadoop column-family database ○ used for e-mails, instant messaging and SMS ○ replacement for MySQL and Cassandra Memcached http://memcached.org/ ○ distributed key-value store ○ used as a cache between web servers and MySQL servers in the beginning of FB sources: http://goo.gl/SZ6jia http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/ 48 Facebook: Database Tech. Behind (3) Apache Giraph http://giraph.apache.org/ ○ graph database ○ facebook users and connections is one very large graph ○ used since 2013 for various analytic tasks (trillion edges) RocksDB http://rocksdb.org/ ○ high-performance key-value store ○ developed internally in FB, now open-source sources: https://code.facebook.com/posts/509727595776839/scaling-apache-giraph-to-a-trillion-edges/ http://goo.gl/XNtG6p49 Questions? Please, any questions? Good question is a gift... 50 References ● I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p. ● Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional, 192 p. ● RNDr. Irena Holubova, Ph.D. MMF UK course NDBI040: Big Data Management and NoSQL Databases ● Why NoSQL. White paper. http://www.couchbase.com/ ● http://db-engines.com/en/ranking ● http://nosql-database.org/ ● Chang, F. et al. (2008). Bigtable: A Distributed Storage System for Structured Data. ACM TOCS, 26(2), pp 1–26. 51