Graph Databases
Lecture 8 of NoSQL Databases (PA195)
David Novak & Vlastislav Dohnal
Faculty of Informatics, Masaryk University, Brno
Agenda
● Graph Databases: Mission, Data, Example
● A Bit of Graph Theory
○ Graph Representations
○ Algorithms: Improving Data Locality (efficient storage)
○ Graph Partitioning and Traversal Algorithms
● Graph Databases
○ Transactional databases
○ Non-transactional databases
● Neo4j
○ Basics, Native Java API, Cypher, Behind the Scene
2
Graph Databases: Example
source: Sadalage & Fowler: NoSQL Distilled, 2012 3
Graph Databases: Mission
● To store entities and relationships between them
○ Nodes are instances of objects
○ Nodes have properties, e.g., name
○ Edges connect nodes and have directional significance
○ Edges have types, e.g., likes, friend, …
● Nodes are organized by relationships
○ Allows finding interesting patterns
○ Example: Get all nodes that are “employee” of “Big
Company” and that “likes” “NoSQL Distilled”
4
Graph Databases: Representatives
Ranked list: http://db-engines.com/en/ranking/graph+dbms 5
A Bit of a Theory
Basics and graph representations
6
● Data: a set of entities and their relationships
○ => we need to efficiently represent graphs
● Basic operations:
■ finding the neighbors of a node,
■ checking if two nodes are connected by an edge,
■ updating the graph structure, …
○ => we need efficient graph operations
Basic Terminology
● Graph G = (V, E) is usually modelled as
○ set of nodes (vertices) V, |V| = n
○ set of (directed) edges E = (V1,V2), |E| = m
● Which data structure to use?
7
Data Structure: Adjacency Matrix
● Two-dimensional array A of
n ⨉ n Boolean values
○ Indexes of the array = node
identifiers of the graph
○ Boolean value Aij indicates
whether nodes i, j are connected
● Variants:
○ (Un)directed graphs
○ Weighted graphs…
8
Adjacency Matrix: Properties
● Pros:
○ Adding/removing edges
○ Checking if 2 nodes are
connected
● Cons:
○ Quadratic space: O(n2)
○ We usually have sparse graphs
○ Adding nodes is expensive
○ Retrieval of all the neighboring
nodes takes linear time: O(n)
9
Data Structure: Adjacency List
● A set of lists, each enumerating
neighbors of one node
○ Vector of n pointers to adjacency lists
● Undirected graph:
○ An edge connects nodes i and j
○ => the adjacency list of i contains
node j and vice versa
● Often compressed
○ Exploiting regularities in graphs 10
Adjacency List: Properties
● Pros:
○ Getting the neighbors of a node
○ Cheap addition of nodes
○ More compact representation of
sparse graphs
● Cons:
○ Checking if there is an edge
between two nodes
■ Optimization: sorted lists => logarithmic
scan, but also logarithmic insertion
11
Data Structure: Incidence Matrix
● Two-dimensional Boolean
matrix of n rows and m columns
○ Each row represents a node
■ All edges that are connected to the node
○ Each column represents an edge
■ Nodes that are connected by a certain edge
12
Incidence Matrix: Properties
● Pros:
○ Representation of hypergraphs
■ where one edge connects an
arbitrary number of nodes
● Cons:
○ Requires n ⨉ m bits (for most
graphs m ⋙ n)
○ Listing neighborhood is slow
13
Data Structure: Laplacian Matrix
● Two-dimensional array of
n ⨉ n integers
○ Similar structure as adjacency matrix
○ Diagonal of the Laplacian matrix
indicates the degree of the node
○ The rest of positions are set to -1 if
the two vertices are connected, 0
otherwise
14
Laplacian Matrix: Properties
All features of adjacency matrix
● Pros:
○ Analyzing the graph structure by
means of spectral analysis
■ Calculating the number of spanning trees
■ Approximation of the sparsest cut of the
graph
■ Calculate eigenvalues of the matrix
○ A good summary: Wikipedia
15
A Bit of a Theory
Selected graph algorithms
16
Basic Graph Algorithms
● Access all nodes reachable from a given source:
○ Breadth-first Search (BFS)
○ Depth-first Search (DFS)
● Shortest path between two nodes
● Single-source shortest path problem
○ BFS (unweighted),
○ Dijkstra (nonnegative weights),
○ Bellman-Ford algorithm
● All-pairs shortest path problem
○ Floyd-Warshall algorithm
http://en.wikipedia.org/wiki/Shortest_path_problem 17
Improving Data Locality
● Performance of the read/write operations
○ Depends also on physical organization of the data
○ Objective: Achieve the best “data locality”
● Spatial locality:
○ if a data item has been accessed, the nearby data items
are likely to be accessed in the following computations
■ e.g., during graph traversal
● Strategy:
○ in graph adjacency matrix representation, exchange rows
and columns to improve the disk cache hit ratio
○ Specific methods: BFSL, Bandwidth of a Matrix, ... 18
Data Locality: Example
This matrix has better data
locality, more efficient traversal
19
Breadth First Search Layout (BFSL)
● Input: vertices of a graph
● Output: a permutation of the vertices
○ with better cache performance for graph traversals
● BFSL algorithm:
1. Select a node (at random, the origin of the traversal)
2. Traverse the graph using the BFS alg.
■ generating a list of vertex identifiers in the order they are visited
3. Take the generated list as the new vertices permutation
20
Breadth First Search Layout (2)
● Let us recall:
Breadth First Search (BFS)
○ FIFO queue of frontier vertices
● Pros: optimal locality for traversal from the root
● Cons: starting traversal from other nodes
○ The further, the worse
21
Matrix Bandwidth: Motivation
● Graph represented by adjacency matrix
22
Matrix Bandwidth: Formalization
● The minimum bandwidth problem
○ Bandwidth of a row in a matrix = the maximum distance
between nonzero elements, where one is left of the
diagonal and the other is right of the diagonal
○ Bandwidth of a matrix = maximum bandwidth of its rows
● Low bandwidth matrices are more cache friendly
○ Non zero elements (edges) clustered about the diagonal
● Bandwidth minimization problem: NP hard
○ For large matrices the solutions are only approximated 23
A Bit of a Theory
Graph partitioning
24
Graph Partitioning
● Some graphs are too large to be fully loaded into
the main memory of a single computer
○ Usage of secondary storage degrades the performance
○ Scalable solution: distribute the graph on multiple nodes
● We need to partition the graph reasonably
○ Usually for a particular (set of) operation(s)
■ The shortest path, finding frequent patterns, BFS, spanning tree search
25
Example: 1-Dimensional Partitioning
● Aim: Partition the graph to solve BFS efficiently
○ Distributed into shared-nothing parallel system
○ Partitioning of the adjacency matrix
● 1D partitioning of Adjacency Matrix:
○ Matrix rows are randomly assigned to the P nodes
(processors) in the system
○ Each vertex (and its edges) are owned by one processor
26
27
Starting BFS traversal at node 1:
1. (at black) 1 -> 10, 11
visit green server
2. (at green) 10, 11 ->
a. 1, back to black
b. 6, visit red
c. 7,9, visit blue
d. 10, 11, myself
3. (at red) 6 -> 7
visit blue
3. (at blue) 7,9 ->
a. 3, back to black …
b. 6, back to red
c. 8 -> 2,3, back to black
d. 10,11,12, back to green
Traversing Graph
● Traversing with 1D partitioning (e.g., BFS)
1. Each processor keeps information about frontier vertices
2. ...and also list of neighboring vertices in other processors
3. Messages are sent to other processors…
● 1D partitioning leads to high messaging
○ => 2D-partitioning of adjacency matrix
○ … lower messaging but still very demanding
Efficient sharding of a graph is very difficult
● and thus graph DBs are often centralized
28
Graph Databases
29
Types of Graphs
● Single-relational graphs
○ Edges are homogeneous in meaning
■ e.g., all edges represent friendship
● Multi-relational (property) graphs
○ Edges are typed or labeled
■ e.g., friendship, business, communication
○ Vertices and edges maintain a set of key/value pairs
■ Representation of non-graphical data (properties)
■ e.g., name of a vertex, the weight of an edge
30
Graph Databases
● A graph database = a set of graphs
● Types of graph databases:
○ Transactional = a large set of small graphs
■ e.g., chemical compounds, biological pathways, …
■ Searching for graphs that match the query
○ Non-transactional = few numbers of very large graphs
■ or one huge (not necessarily connected) graph
■ e.g., Web graph, social networks, …
31
● Types of Queries
○ Subgraph queries
■ Searches for a specific pattern in the graph database
■ Query = a small graph
● or a graph, where some parts are uncertain, e.g., vertices with wildcard labels
■ More general type: allow sub-graph isomorphism
Transactional DBs: Queries
32
○ Super-graph queries
■ Search for graphs whose whole structure is contained in the query graph
Transactional DBs: Queries (2)
○ Similarity (approximate matching) queries
■ Finds graphs which are similar to a given query graph
● but not necessarily isomorphic
■ Key question: how to measure the similarity
33
● Extract certain characteristics from each graph
○ And index these characteristics for each G1,..., Gn
Indexing & Query Evaluation
● Query evaluation in transactional graph DB
1. Extraction of the characteristics from query graph q
2. Filter the database (index) and identify a candidate set
■ Subset of the G1,..., Gn graphs that should contain the answer
3. Refinement - check all candidate graphs
34
1. Mining-based Graph Indexing Techniques
○ Idea: if some features of query graph q do not exist in data
graph G, then G cannot contain q as its subgraph
○ Apply graph-mining methods to extract some features
(sub-structures) from the graph database members
■ e.g., frequent sub-trees, frequent sub-graphs
○ An inverted index is created for each feature
2. Non Mining-Based Graph Indexing Techniques
○ Indexing of the whole constructs of the graph database
■ Instead of indexing only some selected features
Subgraph Query Processing
35
Mining-based Technique
● Example method: GIndex [2004]
○ Indexing “frequent discriminative graphs”
○ Build inverted index for selected discriminative subgraphs
36
Non Mining-based Techniques
● Example: GString (2007)
○ Model the graphs in the context of organic chemistry
using basic structures
■ Line = series of vertices connected end to end
■ Cycle = series of vertices that form a close loop
■ Star = core vertex directly connects to several vertices
37
Non Mining-based Techniques
● GDIndex (2007)
○ all connected and induced subgraphs of a given graph are
enumerated (at most 2n)
○ due to isomorfisms, there much less subgraphs.
■ if all labels are identical, a complete graph of size n is decomposed into
just n+1 subgraphs.
38
Graph Databases
Non-transactional Databases
39
Non-transactional Databases
● A few very large graphs
○ e.g., Web graph, social networks, …
● Queries:
○ Nodes/edges with properties
○ Neighboring nodes/edges
○ Paths (all, shortest, etc.)
● Our example: Neo4j
40
Basic Characteristics
● Different types of relationships between nodes
○ To represent relationships between domain entities
○ Or to model any kind of secondary relationships
■ Category, path, time-trees, spatial relationships, …
● No limit to the number and kind of relationships
● Relationships have properties
○ E.g., since when did they become friends?
41
Relationship Properties: Example
source: Sadalage & Fowler: NoSQL Distilled, 201242
Graph DB vs. RDBMS
● RDBMS designed for a single type of relationship
○ “Who is my manager”
● Adding another relationship usually means a lot of
schema changes
● In RDBMS we model the graph beforehand based
on the traversal we want
○ If the traversal changes, the data will have to change
○ Graph DBs: the relationship is not calculated but persisted
43
Neo4J: Basics & Concepts
44
Neo4j: Basic Info
● Open source graph database
○ The most popular
● Initial release: 2007
● Written in: Java
● OS: cross-platform
● Stores data as nodes connected
by directed, typed relationships
○ With properties on both
nodes and relationships
45
Neo4j: Basic Features
● reliable – with full ACID transactions
● durable and fast – disk-based, native storage engine
● scalable – up to several billion nodes/relationships/properties
● highly-available – when distributed (replicated)
● expressive – powerful, human readable graph query language
● fast – powerful traversal framework
● embeddable - in Java program
● accessible – simple REST interface & Java API
46
Data Model: Nodes
http://db-engines.com/en/system/Neo4j
● Fundamental unit: node
● Nodes have properties
○ Key-value pairs
○ null is not a valid property value
■ nulls can be modelled by the absence of a key
● Nodes have labels
○ labels typically express "type of node"
47
Data Model: Properties
Type Description
boolean true/false
byte 8-bit integer
short 16-bit integer
int 32-bit integer
long 64-bit integer
float 32-bit IEEE 754 floating-point number
double 64-bit IEEE 754 floating-point number
char 16-bit unsigned integer representing a
Unicode character
String sequence of Unicode characters
DateTime temporal types…
48
Data Model: Relationships
● Directed relationships (edges)
○ Incoming and outgoing edge
■ Equally efficient traversal in both directions
■ Direction can be ignored
if not needed by the application
○ Always a start
and an end node
■ Can be recursive
49
What How
get who a person follows outgoing follows relationships, depth one
get the followers of a person incoming follows relationships, depth one
get who a person blocks outgoing blocks relationships, depth one
What How
get the full path of a file incoming file relationships
get all paths for a file incoming file and symbolic link relationships
get all files in a directory outgoing file and symbolic link relationships,
depth one
get all files in a directory, excluding
symbolic links
outgoing file relationships, depth one
get all files in a directory, recursively outgoing file and symbolic link relationships 50
Access to Neo4j
● Embedded database in Java system
● Language-specific connectors
○ Libraries to connect to a running Neo4j server
● Cypher query language
○ Standard language to query graph data
● HTTP REST API
● Gremlin graph traversal language (plugin)
● etc.
51
Neo4J: Native Java API & Graph Traversal
52
Native Java Interface: Example
Node irena = graphDb.createNode();
irena.setProperty("name", "Irena");
Node jirka = graphDb.createNode();
jirka.setProperty("name", "Jirka");
Relationship i2j = irena.createRelationshipTo(jirka, FRIEND);
Relationship j2i = jirka.createRelationshipTo(irena, FRIEND);
i2j.setProperty("quality", "a good one");
j2i.setProperty("since", 2003);
● Undirected edge:
○ Relationship between the nodes in both directions
○ INCOMING and OUTGOING relationships from a node
53
● Path = specific nodes + connecting relationships
○ Path can be a result of a query or a traversal
Data Model: Path & Traversal
● Traversing a graph = visiting
its nodes, following
relationships according
to some rules
○ Typically, a subgraph is visited
○ Neo4j: Traversal framework
in Java API, Cypher, Gremlin
54
Traversal Framework
● A traversal is influenced by
○ Starting node(s) where the traversal begins
○ Expanders – define what to traverse
■ i.e., relationship direction and type
○ Order – depth-first / breadth-first
○ Uniqueness – visit nodes (relationships, paths) only once
○ Evaluator – what to return
and whether to stop or continue beyond current position
Traversal = TraversalDescription + starting node(s)
55
Traversal Framework – Java API
● org.neo4j...TraversalDescription
○ The main interface for defining traversals
■ Can specify branch ordering breadthFirst() / depthFirst()
● .relationships()
○ Specify the relationship types to traverse
■ e.g., traverse only edge types: FRIEND, RELATIVE
■ Empty (default) = traverse all relationships
○ Can also specify direction
■ Direction.BOTH
■ Direction.INCOMING
■ Direction.OUTGOING
56
Traversal Framework – Java API (2)
● org.neo4j...Evaluator
○ Used for deciding at each node: should the traversal
continue, and should the node be included in the result
■ INCLUDE_AND_CONTINUE: Include this node in the result and
continue the traversal
■ INCLUDE_AND_PRUNE: Include this node, do not continue traversal
■ EXCLUDE_AND_CONTINUE: Exclude this node, but continue traversal
■ EXCLUDE_AND_PRUNE: Exclude this node and do not continue
○ Pre-defined evaluators:
■ Evaluators.toDepth(int depth) /
Evaluators.fromDepth(int depth),
■ Evaluators.excludeStartPosition()
■ …
57
Traversal Framework – Java API (3)
● org.neo4j...Uniqueness
○ Indicates under what circumstances a traversal may
revisit the same position in the graph
● Traverser
○ Starts actual traversal given a TraversalDescription and
starting node(s)
○ Returns an iterator over “steps” in the traversal
■ Steps can be: Path (default), Node, Relationship
○ The graph is actually traversed “lazily” (on request)
58
Example of Traversal
TraversalDescription desc =
db.traversalDescription()
.depthFirst()
.relationships( Rels.KNOWS,
Direction.BOTH )
.evaluator(Evaluators.toDepth(3));
// node is ‘Ed’ (Node[2])
for (Node n : desc.traverse(node).nodes())
{
output += n.getProperty("name") + ", ";
}
http://neo4j.com/docs/stable/tutorial-traversal-java-api.html
Output: Ed, Lars, Lisa, Dirk, Peter
59
Access to Nodes
● How to get to the starting node(s) before traversal
1. Using internal identifiers (generated IDs)
■ not recommended - Neo4j generates IDs for memory objs and reuses IDs
2. Using properties of nodes
■ one of the properties is typically “ID” (user-specified ID)
■ recommended, properties can be indexed
● automatic indexes
3. Using “labels”
■ group nodes into “subsets” (named graph)
■ a node can have more than one label
● belong to more subsets
60
Neo4J: Cypher Language
61
Cypher Language
● Neo4j graph query language
○ For querying and updating
● Declarative – we say what we want
○ Not how to get it
○ Not necessary to express traversals
● Human-readable
● Inspired by SQL and SPARQL
● Still growing = syntax changes are often
http://neo4j.com/docs/stable/cypher-query-lang.html 62
Cypher: Clauses
● MATCH: The graph pattern to match
● WHERE: Filtering criteria
● RETURN: What to return
● WITH: Divides a query into multiple parts
○ can define starting points in the graph
● CREATE: Creates nodes and relationships.
● DELETE: Remove nodes, relationships, properties
● SET: Set property values
63
Cypher: Creating Nodes (Examples)
CREATE (n);
(create a node, assign to var n)
Created 1 node, returned 0 rows
CREATE (a: Person {name : 'David'})
RETURN a;
(create a node with label ‘Person’ and
‘name’ property ‘David’)
Created 1 node, set 1 property, returned
1 row
64
Cypher: Creating Relationships
MATCH (a {name:’John’}), (b {name:’Jack’})
CREATE a-[r:Friend]->b
RETURN r ;
(create a relation Friend between John and Jack)
Created 1 relationship, returned 1 row
MATCH (a {name:’John’}), (b {name:’Jack’})
CREATE p = a-[:Friend {name: a.name + '->' + b.name }]->b
RETURN p
(set property ‘name’ of the relationship)
Created 0 nodes, set 1 property, returned 1 row 65
Cypher: Queries
MATCH (user: Person {name: 'Andres'})-[:Friend]->(follower)
RETURN user.name, follower.name
(find all ‘Friends’ of 'Andres')
MATCH (p: Person)
WHERE p.age >= 18 AND p.age < 30
RETURN p.name
(return names of all adult people under 30)
66
Cypher: Queries (2)
MATCH (andres: Person {name: 'Andres'})-[*1..3]-(node)
RETURN andres, node ;
(find all ‘nodes’ within three hops from ‘Andres’)
MATCH p=shortestPath(
(andres:Person {name: 'Andres'})-[*]-(david {name:'David'})
)
RETURN p ;
(find the shortest connection between ‘Andres’ and ‘David’)
67
Neo4J: Behind the Scene
68
Neo4j Internals: Indexes
● Since Neo4j v. 2, indexes are used automatically
○ Can be specified explicitly (which index to use)
MATCH (n:Person)
USING INDEX n:Person(surname)
WHERE n.surname = 'Taylor'
RETURN n
CREATE INDEX ON :Person(name);
(Create index on property name of nodes with label Person)
Indexes added: 1
69
Neo4j Internals: Transactions
● Transactions in Neo4j
○ Support for ACID properties
○ All write operations must be performed in a transaction
○ Default transaction isolation level: Read committed
■ Operation can see the last committed value
■ Reads do not block or take any locks
■ If the same row is retrieved twice within a transaction, the values in the
row CAN differ
○ Higher level of isolation can be achieved
■ By explicit acquiring the read locks
70
Neo4j Internals: High Availability
● Master-slave replication
○ Several Neo4j slave databases can be configured to be
exact replicas of a single Neo4j master database
● Speed-up of read operations
○ Enables to handle more read load than a single node
● Fault-tolerance
○ In case a node becomes unavailable
● Transactions are still atomic, consistent and
durable, but eventually propagated to the slaves
71
Graph Databases: When (not) to Use
73
Graph DBs: Suitable Use Cases
● Connected Data
○ Social networks
○ Any link-rich domain is well suited for graph databases
● Routing, Dispatch, and Location-Based Services
○ Node = location or address that has a delivery
○ Graph = nodes where a delivery has to be made
○ Relationships = distance
● Recommendation Engines
○ “your friends also bought this product”
○ “when buying this item, these others are usually bought”
74
Graph DBs: Modeling Issues
● Node modeling:
● tradeoff between placing all attributes and properties in
a single node,
● and separating each attribute into an individual node.
● Relationship modeling:
○ “unlabeled” all,
■ e.g., person connected_to person/address/product
○ versus semantic meaning encoded labels
■ e.g., person peters_work_colleague person,
person peters_home_address address
75
Graph DBs: When Not to Use
● If we want to update all or a subset of entities
○ Changing a property on many nodes is not straightforward
■ e.g., analytics solution where all entities may need to be updated with a
changed property
■ No BLOBs (large binary objects) in byte arrays.
● Some graph databases may be unable to handle
lots of data
○ Distribution of a graph is difficult
76
Questions?
77
References
● I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a
NoSQL databáze. Praha: Grada Publishing, 2015. 288 p.
● RNDr. Irena Holubova, Ph.D. MMF UK course NDBI040:
Big Data Management and NoSQL Databases
● Sherif Sakr - Eric Pardede: Graph Data Management:
Techniques and Applications
● Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A
Brief Guide to the Emerging World of Polyglot
Persistence. Addison-Wesley Professional, 192 p.
● http://neo4j.com/docs/stable/
78