Big Data A general approach to process external multimedia datasets David Mera Laboratory of Data Intensive Systems and Applications (DISA) Masaryk University Brno, Czech Republic 7/10/2014 Table of Contents Table of Contents Introduction Big Data Huge new datasets are constantly created. “90% of the data in the world today has been created in the last two years", 2013 1 Organizations have potential access to a wealth of information, but they do not know how to get value out of it 1 Source: SINTEF. “Big Data - for better or worse” Introduction Big Data Big Data phenomenon Volume refers to the vast amount of data generated every second Variety refers to the different forms of data Velocity refers to the speed at which new data are generated Veracity refers to the reliability of the data Value Variety Velocity Volume Introduction Multimedia Big Data Multimedia Big Data 100 hours of video are uploaded to YouTube every minute 350 millions of photos are uploaded every day to Facebook (2012) Each day, 60 million photos are uploaded on Instagram ... 70% Non-Structured Data 60% Internet Traffic2 2 Source: IBM 2013 Global Technological Outlook report Introduction Multimedia Big Data Getting information from large volumes of multimedia data Content-based retrieval techniques Findability problem Extraction of suitable features → Time-consuming task Feature extraction approaches Sequential approach → not affordable Distributed computing: Cluster computing, Grid computing High computer skills ‘Ad-hoc’ approaches → Low reusability. Lack of handling failures Distributed computing: Big data approaches Batch data: Map-Reduce paradigm (Apache Hadoop) Stream data: S4, Apache Storm. Table of Contents Big Data processing frameworks Apache Hadoop Apache Hadoop characteristics (Map-Reduce paradigm) Batch data processing system Commodity computing No specialized distributed-computing skills are required Machine communication Task scheduling Scalability Handling failures Automatic partition of the input data Big Data processing frameworks Hadoop Map-Reduce paradigm Input Data Split 0 Split 1 ... Map Map Tuples(key, value) Tuples(I-key, I-value) Intermediate pairsInput pairs Split 2 Map Split n Map Reduce ... ......... Reduce Reduce Output pairs Tuples(O-key, O-value) Big Data processing frameworks Apache Hadoop Weaknesses and limitations Large files optimization Batch data processing Response time Hard configuration process - iterative optimization Lack of real-time processing The parallelization level cannot be altered in running time Big Data processing frameworks Apache Storm Apache Storm characteristics Real-time processing system Commodity computing No specialized distributed-computing skills are required Set of generic tools to build distributed graphs of computation Machine communication Task scheduling Scalability Handling failures The parallelization can be adapted in processing time Big Data processing frameworks Apache Storm Storm runs topologies Streams: unbounded sequence of tuples Spouts: source of streams Bolts: input streams → some processing → new streams Spout Spout Bolt A Bolt A1 Bolt A2 Bolt An Bolt B Bolt B1 Bolt B2 Bolt Bn Bolt C Bolt C1 Bolt C2 Bolt Cn Bolt D Bolt D1 Bolt D2 Bolt Dn Bolt E Bolt E1 Bolt E2 Bolt En Stream of data Stream of data Stream ofdata Stream of data' Stream of data' Stream ofdata' ...... ... ...... Stream of data' Big Data processing frameworks Apache Storm Weaknesses and limitations Lack of support for processing batch data low-level framework Pull mode Specific scenario configurations Table of Contents Prototype General overview Prototype goals Efficient processing of huge external datasets Heterogeneous data management Processing of arbitrary functions Infrastructure flexibility Handling failures Prototype General overview Distributed File System Server Cluster Storm topology External data source job de nition Jar les Stream of data Parser Topology creator Storm Topology Manager Job Output Distributed Infrastructure Job Prototype General overview Distributed File System Server Cluster Storm topology External data source job de nition Jar les Stream of data Parser Topology creator Storm Topology Manager Job Output Job Parser Topology creator Job Interface Jar les job de nition Prototype Job definition ... ... * ... ... ... Prototype Job definition ... ... * ... ... ... Topology name Prototype Job definition ... ... * ... ... ... Topology name Spout Prototype Job definition Spouts Socket Apache Kafka Distributed messaging system Spout Topology name byte[] Prototype Job definition ... ... * ... ... ... Topology name Spout Stream of data Save Bolt Prototype Job definition ... ... * ... ... ... Topology name Spout Stream of data Save Bolt Stream processing Operation Class name (inside Jar le) public byte[] methodName(byte[]) Prototype Job definition Bolts SaveBolts Data storage into HDFS Buffer → Hadoop SequenceFiles WorkerBolt Processing tuples public byte[] methodName(byte[]) Spout Topology name Save Bolt Worker Bolt Worker Bolt byte[] Prototype Job definition ... ... * ... ... ... Topology name Spout Stream of data Save Bolt Stream processing Operation Class name (inside Jar le) public byte[] methodName(byte[]) Stream of data Save Bolt ... Prototype Job definition Spout Topology name Save Bolt Worker Bolt Worker Bolt byte[] Prototype Job definition Distributed File System Cluster Storm topology External data source job de nition Jar les Stream of data Storm Topology Manager Job Output Job Parser Topology creator Job Interface Jar les job de nition Storm Topology Topology deployment Hadoop File System Topology monitor Activate Kafka Tuples Prototype Monitoring system Internal monitoring system → Max pending tuples parameter. Topology starts with a low parameter value. Every ‘X’ seconds the monitor checks the ‘acked’ tuples. First iteration → the monitor increases the parameter value. Next iterations: Current ‘acked’ tuples > previous ‘acked’ tuples → Increasing parameter value. Current ‘acked’ tuples < previous ‘acked’ tuples → Decreasing parameter value. Current ‘acked’ tuples == previous ‘acked’ tuples → Doing nothing unless this scenario was repeated ‘X’ times → Increasing parameter value. Prototype Monitoring system External monitoring system Administrator can add rules. Rule = (metric, operator, value, action) The monitor gets topology metrics every ‘X’ seconds. Each bolt produces a set of metrics. The monitors evaluates each rules using the bolt metrics The monitor applies the rule action in every Bolt which has triggered it. Prototype Monitoring system - Example Rule1:(capacity,<,0.4,-1) Rule2:(capacity,>,0.8,+2) Spout byte[] WorkerBolt-A SaveBolt-A WorkerBolt-B SaveBolt-B Capacity=0.25 Capacity=0.3 Capacity=0.86 Capacity=0.75 Table of Contents Prototype General overview Goals Efficient processing of huge external datasets Heterogeneous data management Processing of arbitrary functions Infrastructure flexibility Handling failures Data relations management Efficient processing of huge internal datasets Big Data A general approach to process external multimedia datasets Thank you for your attention!