Introduction, Business Intelligence, Data Warehouse PA220: Database systems for data analytics Overview of data warehousing • motivation • business intelligence / data analytics • data warehouse • architecture • oltp vs. olap • big data 230.09.2020 PA220 DB for Analytics Motivation • Data production • Information systems • Monitoring services • Sensors, GPS tracking • Social networks • Data processing • Storage & archiving • Summarization • Reporting • Visualization • Insights • Predictions 3 Business Intelligence pyramid 30.09.2020 PA220 DB for Analytics Business Intelligence • a process of analyzing data and presenting results to business managers • to make informed decisions • tools and applications • to collect data • to prepare it for storage and analysis • to develop and run queries • to create reports and dashboards • to visualize data • evolved from decision support systems • business analytics / (advanced) data analytics • prescriptive analytics 430.09.2020 PA220 DB for Analytics Business Intelligence vs. Artificial Intelligence • BI is the “opposite” of Artificial Intelligence (AI) • AI systems make decisions for the users • BI systems help users make the right decisions, based on the available data • Many BI techniques have roots in AI, though. 30.09.2020 PA220 DB for Analytics 5 Business Intelligence: Key Problems 1. Complex and unusable models • Many DB models are difficult to understand • DB models do not focus on a single clear business purpose 2. Same data found in many different systems • Example: customer data in many different systems • The same concept is defined differently 3. Data is suited for operational systems • Accounting, billing, etc. • Do not support analysis across business functions 4. Data quality is bad • Missing data, imprecise data, different use of systems 5. Data are ”volatile” • Data deleted in operational systems (6 months) • Data change over time – no historical information 30.09.2020 PA220 DB for Analytics 6 Business Intelligence Architecture Data Warehouse Reporting Visualization Analysis Data Mart Sales Marketing Transaction data Stock Management E-shop ERP Services CRM Transforming Raw data Meta data Data dimensions Summary data Cleaning Integrating Web Logs Files Data Staging Staging data Aggregate Precalc. Storage 730.09.2020 PA220 DB for Analytics Data Warehouse • Typically: • One large repository for entire company • Dedicated hard- and software • Enterprise-grade DBMS • Often: database appliances (e.g., Teradata, Oracle Exadata, IBM Netezza, …) • Goal: • Single source of truth for analysis and reporting • Requires data cleansing and conflict resolution 30.09.2020 PA220 DB for Analytics 8 Data Warehouse Users • Business analysts: • Explore data to discover information • Use for decision making -> “Decision Support System (DSS)” • Consequences: • Workloads and access patterns not known in advance • For exploration, data representation must be easy to understand (even by business analysts) • Design and usage driven by data, not applications 30.09.2020 PA220 DB for Analytics 9 Definition of Data Warehouse • Barry Devlin, IBM Consultant • A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. • Ralph Kimball, The Data Warehouse Toolkit • A copy of transaction data specifically structured for query and analysis. • W. H. Inmon, Building the Data Warehouse • A data warehouse is a subject-oriented, integrated, time-varying, nonvolatile collection of data that is used primarily in organizational decision making. 30.09.2020 PA220 DB for Analytics 10 Definition of Data Warehouse (1) • Subject-oriented • The data in the DW is organized in such a way that all the data elements relating to the same real-world event or object are linked together • Typical subject areas in DWs are Customer, Product, Order, Claim, Account,… 30.09.2020 PA220 DB for Analytics 11 Base tables hold original data and can be different in schema. Customer activity represents summary (derived) information. Definition of Data Warehouse (2) • Integrated • The DW contains data from most or all the organization's operational systems and this data is made consistent • E.g. gender, measurement, conflicting keys, consistency,… 30.09.2020 PA220 DB for Analytics 12 Definition of Data Warehouse (3) • Non-volatile • Data in the DW is never over-written or deleted-once committed, the data is static, read-only, and retained for future reporting • Data is loaded, but not updated • When subsequent changes occur, a new version or snapshot record is written 30.09.2020 PA220 DB for Analytics 13 Definition of Data Warehouse (4) • Time-varying • The changes to the data in the DW are tracked and recorded so that reports show changes over time • Different environments have different time horizons associated • While for operational systems a 60-to-90 daytime horizon is normal, DWs have a 5-to-10year horizon 30.09.2020 PA220 DB for Analytics 14 General DW Definition • A large repository of some organization’s electronically stored data • Specifically designed to facilitate reporting and analysis • Requirements: • information easily accessible • consistent information • present information timely • protect the information • adapt to change • accepted by users 30.09.2020 PA220 DB for Analytics 15 Example: Oracle Exadata X8-2 • Up to 912 CPU cores and 28.5TB memory per rack for database processing • Up to 576 CPU cores per rack dedicated to SQL processing in storage • From 2 to 19 Database Servers per rack; From 3 to 18 Storage Servers per rack • Up to 3.0 PB of disk capacity (raw) per rack • Up to 920 TB of flash capacity (raw) per rack • Ability to perform up to 4.8M 8K database read I/O operations, • or 4.3M 8K flash write I/O operations per second per full rack • Hybrid Columnar Compression (10-15x compression ratio) • 40 Gb/second (QDR) InfiniBand Network • Uncompressed I/O bandwidth of up to 560 GB/second per full rack from SQL • Complete redundancy for high availability • Scale by connecting by InfiniBand up to 18 racks 30.09.2020 PA220 DB for Analytics 16 30.09.2020 PA220 DB for Analytics 17 30.09.2020 PA220 DB for Analytics 18 30.09.2020 PA220 DB for Analytics 19 30.09.2020 PA220 DB for Analytics 20 30.09.2020 PA220 DB for Analytics 21 (Enterprise) Data Warehouse • integrates data from more sources • subject-oriented • stores current and historical data • answers multidimensional queries • create aggregated data reports • is it a relational DBMS? • on-line transaction processing (OLTP) • on-line analytical processing (OLAP) • MOLAP (Multidimensional OLAP) • ROLAP (Relational OLAP) OLTP DW few Indexes many many Joins some normalized Schema denormalized rare Derived data and aggregates common update/select Workload select predefined Typical operations ad hoc few days Historical data years indexing / hashing Data access method full scans GB Data volume TB Thousands Users Hundreds2230.09.2020 PA220 DB for Analytics OLTP (Online Transaction Processing) • Day-to-day business operations • Mix of insert, update, delete, and read operations • e.g., enter orders, maintain customer data, etc. • System sometimes called operational data store (ODS) • Up-to-date state of the data • From a database perspective: • Short-running operations • Most queries known in advance • Often point access, usually through indexes • write access⇝ ACID principles 30.09.2020 PA220 DB for Analytics 23 OLAP (Online Analytical Processing) • Provide data for reporting and decision making • Mostly read-only access • e.g., resource planning, marketing initiatives • Need archive data; (slightly) outdated information might be okay • Report changes over time • Can use separate data store (non-ODS) • From a database perspective: • Long-running operations, mostly read-only • Queries not known in advance, often complex (⇝indexing?) • Might need to scan through large amounts of data • Data is (almost) append-only. 30.09.2020 PA220 DB for Analytics 24 Transactional vs. Analytical Workloads 30.09.2020 PA220 DB for Analytics 25 OLTP/OLAP and DW 30.09.2020 PA220 DB for Analytics 26 OLTP/OLAP and DW (2) 30.09.2020 PA220 DB for Analytics 27 Architecture Alternatives • Variants of the full data warehouse architecture: 1. Independent data marts (no central warehouse) • Populate data marts directly from sources • Like several “mini warehouses” • Redundancy, no “single source of truth” 2. Logical data marts (no explicit, physical data marts) • Data mart just a logical view on full warehouse • Easier to provide integrated, consistent view across the • enterprise • Data marts (and warehouse) might also reside at different geographic locations. 30.09.2020 PA220 DB for Analytics 28 Building DW: Top-down Approach • Analyze global business needs, plan how to develop a data warehouse, design it, and implement it as a whole • This procedure is promising: it is based on a global picture of the goal to achieve, and in principle it ensures consistent, well integrated data warehouses. • High-cost estimates with long-term implementations discourage company managers from embarking on these kind of projects. • Analyzing and integrating all relevant sources at the same time is a very difficult task, even because it is not very likely that they are all available and stable at the same time. • It is extremely difficult to forecast the specific needs of every department involved in a project, which can result in the analysis process coming to a standstill. • Since no working system is going to be delivered in the short term, users cannot check for this project to be useful, so they lose trust and interest in it. 30.09.2020 PA220 DB for Analytics 29 Building DW: Bottom-up Approach • DWs are incrementally built and several data marts are iteratively created. Each data mart is based on a set of facts that are linked to a specific department and that can be interesting for a user group. • Leads to concrete results in a short time • Does not require huge investments • Enables designers to investigate one area at a time • Gives managers a quick feedback about the actual benefits of the system being built • Keeps the interest for the project constantly high • May determine a partial vision of the business domain 30.09.2020 PA220 DB for Analytics 30 Building DW: Comparison of Approaches 30.09.2020 PA220 DB for Analytics 31 Data Warehouse Preparation • Data is periodically brought from the ODS to the data warehouse. • This is also referred to as ETL Process. 30.09.2020 PA220 DB for Analytics 32 Data Warehouse Preparation - Transformation • Query-Driven Data Integration • Data is integrated on demand (lazy) • PROS • Access to most up-to-date data (all source data directly available) • No duplication of data • CONS • Delay in query processing • Slow (or currently unavailable) information sources • Complex filtering and integration • Inefficient and expensive for frequent queries • Competes with local processing at sources • Data loss at the sources (e.g., historical data) cannot be recovered • Has not caught on in industry 30.09.2020 PA220 DB for Analytics 33 Data Warehouse Preparation - Transformation • Warehouse-Driven Data Integration • Data is integrated in advance (eager) • Data is stored in DW for querying and analysis • PROS • High query performance • Does not interfere with local processing at sources • Assumes that data warehouse update is possible during downtime of local processing • Complex queries are run at the data warehouse • OLTP queries are run at the source systems • CONS • Duplication of data • The most current source data is not available • Has caught on in industry 30.09.2020 PA220 DB for Analytics 34 Data Warehouse Lifecycle 30.09.2020 PA220 DB for Analytics 35 Data Warehouse – Lifecycle (1) • gather system goals, borders, and size & prioritize them • estimate costs and benefits • analyze risks and expectations • examine the skills of the working team 30.09.2020 PA220 DB for Analytics 36 Data Warehouse – Lifecycle (2) • analyze and compare the possible architectural solutions • assess the available technologies and tools • select an approach for design and implementation • create a preliminary plan of the whole system 30.09.2020 PA220 DB for Analytics 37 Data Warehouse – Lifecycle (3) • Every iteration causes a new data mart and new applications to be created and progressively added to the DW system. 30.09.2020 PA220 DB for Analytics 38 First Data Mart • is the one playing the most strategic role for the enterprise • should be a backbone for the whole DW • should lean on available and consistent data sources 30.09.2020 PA220 DB for Analytics 40 Modelling and Data Warehouse • Operational stores – normalized database • E.g. e-shop 30.09.2020 PA220 DB for Analytics 41 Modelling and Data Warehouse • Data Warehouse is subject-oriented -> sales are important 30.09.2020 PA220 DB for Analytics 42 Data Warehouse – Data Model • star schema – fact table and dimension tables • one purpose-focus analytics • not very good for complex analytics • fact table • data measurements/metrics • numeric values and foreign keys to dimensions • atomic level of detail • dimension table • description of fact • many attributes Fact_sales id_time id_store id_product quantity price Dim_store id (PK) id_store name street city state country region manager Dim_product id (PK) id_product name brand category Dim_time id date day month year day_of_week4330.09.2020 PA220 DB for Analytics Nature of Current Data and Processing • Volume • the amount of data increases tenfold every five years • Velocity • continuous data flow from sensors, social networks, … • Variety • data structure can change, text, multimedia, … • Veracity • with different data sources, it is getting more difficult to maintain data certainty • Real-time processing 4430.09.2020 PA220 DB for Analytics Distributed Data Warehouse • new platforms • columnar DBMSs (e.g. C-Store) • HDFS & MapReduce (e.g. Hadoop) • NoSQL platforms (e.g. HBase) • in-memory DBMSs (e.g. VoltDB) • horizontal scaling instead of vertical scaling • ETL in Hadoop • Storing raw data in HDFS • SQL-based analytics in columnar DBMSs 4530.09.2020 PA220 DB for Analytics Distributed Data Warehouse 46 Distributed Data Warehouse Reporting / Visualization Analytics R / Apache Mahout Hive & Pig Data Warehouse Hadoop Sources Batch ETL Spark Hive Impala Kylin Stream Staging Hadoop Strom / Spark & Impala / VoltDB 30.09.2020 PA220 DB for Analytics Hadoop Platform • SW library for distributed processing of large data sets • across clusters of computers • High-availability achieved on application layer by replication • tasks run / data stored on unreliable HW • MapReduce – programming model for large scale data processing • Map() – filtering and sorting, outputs “key,value” pairs • Reduce() – summarizing Map() results by their keys • HDFS – distributed high-throughput file system • designed for mostly immutable files • concurrent write not supported • cooperation with MapReduce – data & computation locality Map(k1,v1) → list(k2,v2) Reduce(k2, list (v2)) → list(v3) 4730.09.2020 PA220 DB for Analytics Distributed Data Warehouse • Apache Hive – data warehouse for large datasets • projects structure onto the data in HDFS • manages and queries data using HiveQL • converts them to Map-Reduce jobs • supports indexing • DML operations • UPDATE & DELETE at row level (new since end of 2014) 4830.09.2020 PA220 DB for Analytics Apache Hive CREATE TABLE dim_product ( id INT, id_product INT, name STRING, brand STRING, category STRING ); LOAD DATA LOCAL INPATH ‘…/file.txt' OVERWRITE INTO TABLE dim_product PARTITION (ds='2014-12-15'); INSERT OVERWRITE TABLE dim_product SELECT TRANSFORM (id, id_product, name, brand, category) USING 'python cleaning_mapper.py' AS (id, id_product, name, brand, category) FROM tmp_product; SELECT p.name, SUM(s.quantity * price) AS income FROM dim_product p LEFT JOIN fact_sales s ON (p.id = s.id_product) GROUP BY p.name; Fact_sales id_time id_store id_product quantity price Dim_product id (PK) id_product name brand category 4930.09.2020 PA220 DB for Analytics Apache Impala • similar to Apache Hive • native analytical database for Hadoop • low-latency queries • no translation to Map-Reduce jobs • in-memory data in Parquet format • preference to numeric values than strings • joins done in memory 5030.09.2020 PA220 DB for Analytics Apache Pig • platform for analyzing large data set • scripting language for data processing and transformations • similar to procedural languages • creates Map-Reduce jobs 51 raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query); clean = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); hour_freq_tmp = GROUP clean BY (query, time); hour_freq_tmp2 = FOREACH hour_freq_tmp GENERATE flatten($0), COUNT($1) as count; hour_frequency = FOREACH hour_freq_tmp GENERATE $0 as query, $1 as hour, $2 as count; STORE hour_frequency INTO '/tmp/hour_freq' USING PigStorage();30.09.2020 PA220 DB for Analytics Apache Kylin • framework for OLAP in Hadoop • uses Hive • precalculates aggregations • query engine translation • exploit prepared aggregations • integrate with your favorite BI tools like Tableau and Power BI 5230.09.2020 PA220 DB for Analytics Summary • BI is well-recognized and is a combination of a number of techniques to support decision making. • DW is at the core of BI that • provides a complete, consistent, subject-oriented and time-varying collection of the data; • allows to separate OLTP from OLAP. • Applications that use the DW include OLAP, data mining, visualization • BI can provide many advantages to an organization • Creates added value by transforming data into information • Provides comprehensive knowledge about your business • A good DW is a prerequisite for BI • But, a DW is a means rather than a goal … it is only a success if it is heavily used • Following a clear design methodology is important. 30.09.2020 PA220 DB for Analytics 53 What next? • Modeling for Data Warehouse 30.09.2020 PA220 DB for Analytics 54