Big Data
A general approach to process external multimedia datasets
David Mera
Laboratory of Data Intensive Systems and Applications (DISA)
Masaryk University
Brno, Czech Republic
7/10/2014
Table of Contents
Table of Contents
Introduction
Big Data
Huge new datasets are constantly created.
“90% of the data in the world today has been created in the
last two years", 2013 1
Organizations have potential access to a wealth of
information, but they do not know how to get value out of it
1
Source: SINTEF. “Big Data - for better or worse”
Introduction
Big Data
Big Data phenomenon
Volume refers to the vast amount of data generated every
second
Variety refers to the different forms of data
Velocity refers to the speed at which new data are generated
Veracity refers to the reliability of the data
Value
Variety
Velocity
Volume
Introduction
Multimedia Big Data
Multimedia Big Data
100 hours of video are uploaded to YouTube every minute
350 millions of photos are uploaded every day to Facebook
(2012)
Each day, 60 million photos are uploaded on Instagram
...
70%
Non-Structured Data
60%
Internet Traﬃc2
2
Source: IBM 2013 Global Technological Outlook report
Introduction
Multimedia Big Data
Getting information from large volumes of multimedia data
Content-based retrieval techniques
Findability problem
Extraction of suitable features → Time-consuming task
Feature extraction approaches
Sequential approach → not affordable
Distributed computing: Cluster computing, Grid computing
High computer skills
‘Ad-hoc’ approaches → Low reusability.
Lack of handling failures
Distributed computing: Big data approaches
Batch data: Map-Reduce paradigm (Apache Hadoop)
Stream data: S4, Apache Storm.
Table of Contents
Big Data processing frameworks
Apache Hadoop
Apache Hadoop characteristics (Map-Reduce paradigm)
Batch data processing system
Commodity computing
No specialized distributed-computing skills are required
Machine communication
Task scheduling
Scalability
Handling failures
Automatic partition of the input data
Big Data processing frameworks
Hadoop
Map-Reduce paradigm
Input
Data
Split 0
Split 1
...
Map
Map
Tuples(key, value) Tuples(I-key, I-value)
Intermediate pairsInput pairs
Split 2 Map
Split n Map
Reduce
...
.........
Reduce
Reduce
Output pairs
Tuples(O-key, O-value)
Big Data processing frameworks
Apache Hadoop
Weaknesses and limitations
Large ﬁles optimization
Batch data processing
Response time
Hard conﬁguration process - iterative optimization
Lack of real-time processing
The parallelization level cannot be altered in running time
Big Data processing frameworks
Apache Storm
Apache Storm characteristics
Real-time processing system
Commodity computing
No specialized distributed-computing skills are required
Set of generic tools to build distributed graphs of
computation
Machine communication
Task scheduling
Scalability
Handling failures
The parallelization can be adapted in processing time
Big Data processing frameworks
Apache Storm
Storm runs topologies
Streams: unbounded sequence of tuples
Spouts: source of streams
Bolts: input streams → some processing → new streams
Spout
Spout
Bolt A
Bolt
A1
Bolt
A2
Bolt
An
Bolt B
Bolt
B1
Bolt
B2
Bolt
Bn
Bolt C
Bolt
C1
Bolt
C2
Bolt
Cn
Bolt D
Bolt
D1
Bolt
D2
Bolt
Dn
Bolt E
Bolt
E1
Bolt
E2
Bolt
En
Stream of data
Stream
of data
Stream
ofdata
Stream
of data'
Stream of data'
Stream
ofdata'
......
...
......
Stream of data'
Big Data processing frameworks
Apache Storm
Weaknesses and limitations
Lack of support for processing batch data
low-level framework
Pull mode
Speciﬁc scenario conﬁgurations
Table of Contents
Prototype
General overview
Prototype goals
Efﬁcient processing of huge external datasets
Heterogeneous data management
Processing of arbitrary functions
Infrastructure ﬂexibility
Handling failures
Prototype
General overview
Distributed File System
Server
Cluster
Storm
topology
External
data source
job de nition
Jar les
Stream of data
Parser Topology
creator
Storm
Topology
Manager
Job Output
Distributed
Infrastructure
Job
Prototype
General overview
Distributed File System
Server
Cluster
Storm
topology
External
data source
job de nition
Jar les
Stream of data
Parser Topology
creator
Storm
Topology
Manager
Job Output
Job
Parser
Topology
creator
Job
Interface
Jar
les
job
de nition
Prototype
Job deﬁnition
<job>
<name>...</name>
<datasource>...</datasource>
<data save="bool">
<operators>
<operator>*
<class>
<name>...</name>
<method>...</method>
</class>
<data save="bool">
<operators>...</operators>
</data>
</operator>
</operators>
</data>
</Job>
Prototype
Job deﬁnition
<job>
<name>...</name>
<datasource>...</datasource>
<data save="bool">
<operators>
<operator>*
<class>
<name>...</name>
<method>...</method>
</class>
<data save="bool">
<operators>...</operators>
</data>
</operator>
</operators>
</data>
</Job>
Topology name
Prototype
Job deﬁnition
<job>
<name>...</name>
<datasource>...</datasource>
<data save="bool">
<operators>
<operator>*
<class>
<name>...</name>
<method>...</method>
</class>
<data save="bool">
<operators>...</operators>
</data>
</operator>
</operators>
</data>
</Job>
Topology name
Spout
Prototype
Job deﬁnition
Spouts
Socket
Apache Kafka
Distributed messaging system
Spout
Topology name
byte[]
Prototype
Job deﬁnition
<job>
<name>...</name>
<datasource>...</datasource>
<data save="bool">
<operators>
<operator>*
<class>
<name>...</name>
<method>...</method>
</class>
<data save="bool">
<operators>...</operators>
</data>
</operator>
</operators>
</data>
</Job>
Topology name
Spout
Stream of data
Save Bolt
Prototype
Job deﬁnition
<job>
<name>...</name>
<datasource>...</datasource>
<data save="bool">
<operators>
<operator>*
<class>
<name>...</name>
<method>...</method>
</class>
<data save="bool">
<operators>...</operators>
</data>
</operator>
</operators>
</data>
</Job>
Topology name
Spout
Stream of data
Save Bolt
Stream processing
Operation
Class name (inside Jar le)
public byte[] methodName(byte[])
Prototype
Job deﬁnition
Bolts
SaveBolts
Data storage into HDFS
Buffer → Hadoop SequenceFiles
WorkerBolt
Processing tuples
public byte[] methodName(byte[])
Spout
Topology name
Save
Bolt
Worker
Bolt
Worker
Bolt
byte[]
Prototype
Job deﬁnition
<job>
<name>...</name>
<datasource>...</datasource>
<data save="bool">
<operators>
<operator>*
<class>
<name>...</name>
<method>...</method>
</class>
<data save="bool">
<operators>...</operators>
</data>
</operator>
</operators>
</data>
</Job>
Topology name
Spout
Stream of data
Save Bolt
Stream processing
Operation
Class name (inside Jar le)
public byte[] methodName(byte[])
Stream of data
Save Bolt
...
Prototype
Job deﬁnition
Spout
Topology name
Save
Bolt
Worker
Bolt
Worker
Bolt
byte[]
Prototype
Job deﬁnition
Distributed File System
Cluster
Storm
topology
External
data source
job de nition
Jar les
Stream of data
Storm
Topology
Manager
Job Output
Job
Parser
Topology
creator
Job
Interface
Jar
les
job
de nition
Storm
Topology
Topology
deployment
Hadoop File System
Topology
monitor
Activate
Kafka Tuples
Prototype
Monitoring system
Internal monitoring system → Max pending tuples parameter.
Topology starts with a low parameter value.
Every ‘X’ seconds the monitor checks the ‘acked’ tuples.
First iteration → the monitor increases the parameter value.
Next iterations:
Current ‘acked’ tuples > previous ‘acked’ tuples → Increasing
parameter value.
Current ‘acked’ tuples < previous ‘acked’ tuples → Decreasing
parameter value.
Current ‘acked’ tuples == previous ‘acked’ tuples → Doing
nothing unless this scenario was repeated ‘X’ times →
Increasing parameter value.
Prototype
Monitoring system
External monitoring system
Administrator can add rules.
Rule = (metric, operator, value, action)
The monitor gets topology metrics every ‘X’ seconds. Each
bolt produces a set of metrics.
The monitors evaluates each rules using the bolt metrics
The monitor applies the rule action in every Bolt which has
triggered it.
Prototype
Monitoring system - Example
Rule1:(capacity,<,0.4,-1)
Rule2:(capacity,>,0.8,+2)
Spout
byte[]
WorkerBolt-A
SaveBolt-A
WorkerBolt-B
SaveBolt-B
Capacity=0.25 Capacity=0.3
Capacity=0.86
Capacity=0.75
Table of Contents
Prototype
General overview
Goals
Efﬁcient processing of huge external datasets
Heterogeneous data management
Processing of arbitrary functions
Infrastructure ﬂexibility
Handling failures
Data relations management
Efﬁcient processing of huge internal datasets
Big Data
A general approach to process external multimedia datasets
Thank you for your attention!