Introduction, Business
Intelligence, Data
Warehouse
PA220: Database systems for data analytics
Overview of data warehousing
• motivation
• business intelligence / data analytics
• data warehouse
• architecture
• oltp vs. olap
• big data
230.09.2020 PA220 DB for Analytics
Motivation
• Data production
• Information systems
• Monitoring services
• Sensors, GPS tracking
• Social networks
• Data processing
• Storage & archiving
• Summarization
• Reporting
• Visualization
• Insights
• Predictions
3
Business Intelligence pyramid
30.09.2020 PA220 DB for Analytics
Business Intelligence
• a process of analyzing data and presenting results to business managers
• to make informed decisions
• tools and applications
• to collect data
• to prepare it for storage and analysis
• to develop and run queries
• to create reports and dashboards
• to visualize data
• evolved from decision support systems
• business analytics / (advanced) data analytics
• prescriptive analytics
430.09.2020 PA220 DB for Analytics
Business Intelligence vs. Artificial Intelligence
• BI is the “opposite” of Artificial Intelligence (AI)
• AI systems make decisions for the users
• BI systems help users make the right decisions, based on the available data
• Many BI techniques have roots in AI, though.
30.09.2020 PA220 DB for Analytics 5
Business Intelligence: Key Problems
1. Complex and unusable models
• Many DB models are difficult to understand
• DB models do not focus on a single clear business purpose
2. Same data found in many different systems
• Example: customer data in many different systems
• The same concept is defined differently
3. Data is suited for operational systems
• Accounting, billing, etc.
• Do not support analysis across business functions
4. Data quality is bad
• Missing data, imprecise data, different use of systems
5. Data are ”volatile”
• Data deleted in operational systems (6 months)
• Data change over time – no historical information
30.09.2020 PA220 DB for Analytics 6
Business Intelligence Architecture
Data Warehouse
Reporting
Visualization
Analysis
Data Mart
Sales
Marketing
Transaction data
Stock
Management
E-shop
ERP
Services
CRM
Transforming
Raw data
Meta data
Data
dimensions
Summary
data
Cleaning
Integrating
Web
Logs
Files
Data Staging
Staging data
Aggregate Precalc.
Storage
730.09.2020 PA220 DB for Analytics
Data Warehouse
• Typically:
• One large repository for entire company
• Dedicated hard- and software
• Enterprise-grade DBMS
• Often: database appliances (e.g., Teradata, Oracle Exadata, IBM Netezza, …)
• Goal:
• Single source of truth for analysis and reporting
• Requires data cleansing and conflict resolution
30.09.2020 PA220 DB for Analytics 8
Data Warehouse Users
• Business analysts:
• Explore data to discover information
• Use for decision making -> “Decision Support System (DSS)”
• Consequences:
• Workloads and access patterns not known in advance
• For exploration, data representation must be easy to understand (even by
business analysts)
• Design and usage driven by data, not applications
30.09.2020 PA220 DB for Analytics 9
Definition of Data Warehouse
• Barry Devlin, IBM Consultant
• A data warehouse is simply a single, complete, and consistent store of data
obtained from a variety of sources and made available to end users in a way
they can understand and use it in a business context.
• Ralph Kimball, The Data Warehouse Toolkit
• A copy of transaction data specifically structured for query and analysis.
• W. H. Inmon, Building the Data Warehouse
• A data warehouse is a subject-oriented, integrated, time-varying, nonvolatile
collection of data that is used primarily in organizational decision
making.
30.09.2020 PA220 DB for Analytics 10
Definition of Data Warehouse (1)
• Subject-oriented
• The data in the DW is organized in such a way that all the data elements
relating to the same real-world event or object are linked together
• Typical subject areas in DWs are Customer, Product, Order, Claim, Account,…
30.09.2020 PA220 DB for Analytics 11
Base tables hold original data
and can be different in
schema.
Customer activity represents
summary (derived)
information.
Definition of Data Warehouse (2)
• Integrated
• The DW contains data from most or all the organization's operational systems
and this data is made consistent
• E.g. gender, measurement, conflicting keys, consistency,…
30.09.2020 PA220 DB for Analytics 12
Definition of Data Warehouse (3)
• Non-volatile
• Data in the DW is never over-written or deleted-once committed, the data is
static, read-only, and retained for future reporting
• Data is loaded, but not updated
• When subsequent changes occur, a new version or snapshot record is written
30.09.2020 PA220 DB for Analytics 13
Definition of Data Warehouse (4)
• Time-varying
• The changes to the data in the DW are tracked and recorded so that reports
show changes over time
• Different environments have different time horizons associated
• While for operational systems a 60-to-90 daytime horizon is normal, DWs have a 5-to-10year
horizon
30.09.2020 PA220 DB for Analytics 14
General DW Definition
• A large repository of some organization’s electronically stored data
• Specifically designed to facilitate reporting and analysis
• Requirements:
• information easily accessible
• consistent information
• present information timely
• protect the information
• adapt to change
• accepted by users
30.09.2020 PA220 DB for Analytics 15
Example: Oracle Exadata X8-2
• Up to 912 CPU cores and 28.5TB memory per rack for database processing
• Up to 576 CPU cores per rack dedicated to SQL processing in storage
• From 2 to 19 Database Servers per rack; From 3 to 18 Storage Servers per rack
• Up to 3.0 PB of disk capacity (raw) per rack
• Up to 920 TB of flash capacity (raw) per rack
• Ability to perform up to 4.8M 8K database read I/O operations,
• or 4.3M 8K flash write I/O operations per second per full rack
• Hybrid Columnar Compression (10-15x compression ratio)
• 40 Gb/second (QDR) InfiniBand Network
• Uncompressed I/O bandwidth of up to 560 GB/second per full rack from SQL
• Complete redundancy for high availability
• Scale by connecting by InfiniBand up to 18 racks
30.09.2020 PA220 DB for Analytics 16
30.09.2020 PA220 DB for Analytics 17
30.09.2020 PA220 DB for Analytics 18
30.09.2020 PA220 DB for Analytics 19
30.09.2020 PA220 DB for Analytics 20
30.09.2020 PA220 DB for Analytics 21
(Enterprise) Data Warehouse
• integrates data from more sources
• subject-oriented
• stores current and historical data
• answers multidimensional queries
• create aggregated data reports
• is it a relational DBMS?
• on-line transaction processing (OLTP)
• on-line analytical processing (OLAP)
• MOLAP (Multidimensional OLAP)
• ROLAP (Relational OLAP)
OLTP DW
few Indexes many
many Joins some
normalized Schema denormalized
rare
Derived data
and aggregates
common
update/select Workload select
predefined
Typical
operations
ad hoc
few days Historical data years
indexing /
hashing
Data access
method
full scans
GB Data volume TB
Thousands Users Hundreds2230.09.2020 PA220 DB for Analytics
OLTP (Online Transaction Processing)
• Day-to-day business operations
• Mix of insert, update, delete, and read operations
• e.g., enter orders, maintain customer data, etc.
• System sometimes called operational data store (ODS)
• Up-to-date state of the data
• From a database perspective:
• Short-running operations
• Most queries known in advance
• Often point access, usually through indexes
• write access⇝ ACID principles
30.09.2020 PA220 DB for Analytics 23
OLAP (Online Analytical Processing)
• Provide data for reporting and decision making
• Mostly read-only access
• e.g., resource planning, marketing initiatives
• Need archive data; (slightly) outdated information might be okay
• Report changes over time
• Can use separate data store (non-ODS)
• From a database perspective:
• Long-running operations, mostly read-only
• Queries not known in advance, often complex (⇝indexing?)
• Might need to scan through large amounts of data
• Data is (almost) append-only.
30.09.2020 PA220 DB for Analytics 24
Transactional vs. Analytical Workloads
30.09.2020 PA220 DB for Analytics 25
OLTP/OLAP and DW
30.09.2020 PA220 DB for Analytics 26
OLTP/OLAP and DW (2)
30.09.2020 PA220 DB for Analytics 27
Architecture Alternatives
• Variants of the full data warehouse architecture:
1. Independent data marts (no central warehouse)
• Populate data marts directly from sources
• Like several “mini warehouses”
• Redundancy, no “single source of truth”
2. Logical data marts (no explicit, physical data marts)
• Data mart just a logical view on full warehouse
• Easier to provide integrated, consistent view across the
• enterprise
• Data marts (and warehouse) might also reside at different geographic
locations.
30.09.2020 PA220 DB for Analytics 28
Building DW: Top-down Approach
• Analyze global business needs, plan how to develop a data warehouse,
design it, and implement it as a whole
• This procedure is promising: it is based on a global picture of the goal to achieve, and
in principle it ensures consistent, well integrated data warehouses.
• High-cost estimates with long-term implementations discourage company managers
from embarking on these kind of projects.
• Analyzing and integrating all relevant sources at the same time is a very difficult task,
even because it is not very likely that they are all available and stable at the same
time.
• It is extremely difficult to forecast the specific needs of every department involved in
a project, which can result in the analysis process coming to a standstill.
• Since no working system is going to be delivered in the short term, users cannot
check for this project to be useful, so they lose trust and interest in it.
30.09.2020 PA220 DB for Analytics 29
Building DW: Bottom-up Approach
• DWs are incrementally built and several data marts are iteratively
created. Each data mart is based on a set of facts that are linked to a
specific department and that can be interesting for a user group.
• Leads to concrete results in a short time
• Does not require huge investments
• Enables designers to investigate one area at a time
• Gives managers a quick feedback about the actual benefits of the system
being built
• Keeps the interest for the project constantly high
• May determine a partial vision of the business domain
30.09.2020 PA220 DB for Analytics 30
Building DW: Comparison of Approaches
30.09.2020 PA220 DB for Analytics 31
Data Warehouse Preparation
• Data is periodically brought from the ODS to the data warehouse.
• This is also referred to as ETL Process.
30.09.2020 PA220 DB for Analytics 32
Data Warehouse Preparation - Transformation
• Query-Driven Data Integration
• Data is integrated on demand (lazy)
• PROS
• Access to most up-to-date data (all source data directly available)
• No duplication of data
• CONS
• Delay in query processing
• Slow (or currently unavailable) information sources
• Complex filtering and integration
• Inefficient and expensive for frequent queries
• Competes with local processing at sources
• Data loss at the sources (e.g., historical data) cannot be recovered
• Has not caught on in industry
30.09.2020 PA220 DB for Analytics 33
Data Warehouse Preparation - Transformation
• Warehouse-Driven Data Integration
• Data is integrated in advance (eager)
• Data is stored in DW for querying and analysis
• PROS
• High query performance
• Does not interfere with local processing at sources
• Assumes that data warehouse update is possible during downtime of local processing
• Complex queries are run at the data warehouse
• OLTP queries are run at the source systems
• CONS
• Duplication of data
• The most current source data is not available
• Has caught on in industry
30.09.2020 PA220 DB for Analytics 34
Data Warehouse Lifecycle
30.09.2020 PA220 DB for Analytics 35
Data Warehouse – Lifecycle (1)
• gather system goals, borders, and
size & prioritize them
• estimate costs and benefits
• analyze risks and expectations
• examine the skills of the working
team
30.09.2020 PA220 DB for Analytics 36
Data Warehouse – Lifecycle (2)
• analyze and compare the
possible architectural solutions
• assess the available technologies
and tools
• select an approach for design
and implementation
• create a preliminary plan of the
whole system
30.09.2020 PA220 DB for Analytics 37
Data Warehouse – Lifecycle (3)
• Every iteration causes a new data
mart and new applications to be
created and progressively added
to the DW system.
30.09.2020 PA220 DB for Analytics 38
First Data Mart
• is the one playing the most
strategic role for the
enterprise
• should be a backbone for
the whole DW
• should lean on available
and consistent data
sources
30.09.2020 PA220 DB for Analytics 40
Modelling and Data Warehouse
• Operational stores – normalized database
• E.g. e-shop
30.09.2020 PA220 DB for Analytics 41
Modelling and Data Warehouse
• Data Warehouse is subject-oriented -> sales are important
30.09.2020 PA220 DB for Analytics 42
Data Warehouse – Data Model
• star schema – fact table and dimension tables
• one purpose-focus analytics
• not very good for complex analytics
• fact table
• data measurements/metrics
• numeric values and foreign keys to dimensions
• atomic level of detail
• dimension table
• description of fact
• many attributes
Fact_sales
id_time
id_store
id_product
quantity
price
Dim_store
id (PK)
id_store
name
street
city
state
country
region
manager
Dim_product
id (PK)
id_product
name
brand
category
Dim_time
id
date
day
month
year
day_of_week4330.09.2020 PA220 DB for Analytics
Nature of Current Data and Processing
• Volume
• the amount of data increases tenfold every five years
• Velocity
• continuous data flow from sensors, social networks, …
• Variety
• data structure can change, text, multimedia, …
• Veracity
• with different data sources, it is getting more difficult to maintain data certainty
• Real-time processing
4430.09.2020 PA220 DB for Analytics
Distributed Data Warehouse
• new platforms
• columnar DBMSs (e.g. C-Store)
• HDFS & MapReduce (e.g. Hadoop)
• NoSQL platforms (e.g. HBase)
• in-memory DBMSs (e.g. VoltDB)
• horizontal scaling instead of vertical scaling
• ETL in Hadoop
• Storing raw data in HDFS
• SQL-based analytics in columnar DBMSs
4530.09.2020 PA220 DB for Analytics
Distributed Data Warehouse
46
Distributed Data Warehouse
Reporting /
Visualization
Analytics
R / Apache
Mahout
Hive & Pig
Data
Warehouse
Hadoop
Sources
Batch
ETL
Spark
Hive
Impala
Kylin
Stream
Staging
Hadoop
Strom / Spark &
Impala / VoltDB
30.09.2020 PA220 DB for Analytics
Hadoop Platform
• SW library for distributed processing of large data sets
• across clusters of computers
• High-availability achieved on application layer by replication
• tasks run / data stored on unreliable HW
• MapReduce – programming model for large scale data processing
• Map() – filtering and sorting, outputs “key,value” pairs
• Reduce() – summarizing Map() results by their keys
• HDFS – distributed high-throughput file system
• designed for mostly immutable files
• concurrent write not supported
• cooperation with MapReduce – data & computation locality
Map(k1,v1) → list(k2,v2)
Reduce(k2, list (v2)) → list(v3)
4730.09.2020 PA220 DB for Analytics
Distributed Data Warehouse
• Apache Hive – data warehouse for large datasets
• projects structure onto the data in HDFS
• manages and queries data using HiveQL
• converts them to Map-Reduce jobs
• supports indexing
• DML operations
• UPDATE & DELETE at row level (new since end of 2014)
4830.09.2020 PA220 DB for Analytics
Apache Hive
CREATE TABLE dim_product ( id INT,
id_product INT,
name STRING,
brand STRING,
category STRING );
LOAD DATA LOCAL INPATH ‘…/file.txt' OVERWRITE INTO TABLE dim_product PARTITION (ds='2014-12-15');
INSERT OVERWRITE TABLE dim_product
SELECT TRANSFORM (id, id_product, name, brand, category) USING 'python cleaning_mapper.py'
AS (id, id_product, name, brand, category)
FROM tmp_product;
SELECT p.name, SUM(s.quantity * price) AS income
FROM dim_product p LEFT JOIN fact_sales s ON (p.id = s.id_product)
GROUP BY p.name;
Fact_sales
id_time
id_store
id_product
quantity
price
Dim_product
id (PK)
id_product
name
brand
category
4930.09.2020 PA220 DB for Analytics
Apache Impala
• similar to Apache Hive
• native analytical database for Hadoop
• low-latency queries
• no translation to Map-Reduce jobs
• in-memory data in Parquet format
• preference to numeric values than strings
• joins done in memory
5030.09.2020 PA220 DB for Analytics
Apache Pig
• platform for analyzing large data set
• scripting language for data processing and transformations
• similar to procedural languages
• creates Map-Reduce jobs
51
raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
clean = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
hour_freq_tmp = GROUP clean BY (query, time);
hour_freq_tmp2 = FOREACH hour_freq_tmp GENERATE flatten($0), COUNT($1) as count;
hour_frequency = FOREACH hour_freq_tmp GENERATE $0 as query, $1 as hour, $2 as count;
STORE hour_frequency INTO '/tmp/hour_freq' USING PigStorage();30.09.2020 PA220 DB for Analytics
Apache Kylin
• framework for OLAP in Hadoop
• uses Hive
• precalculates aggregations
• query engine translation
• exploit prepared aggregations
• integrate with your favorite BI tools like Tableau and Power BI
5230.09.2020 PA220 DB for Analytics
Summary
• BI is well-recognized and is a combination of a number of techniques to
support decision making.
• DW is at the core of BI that
• provides a complete, consistent, subject-oriented and time-varying collection of the
data;
• allows to separate OLTP from OLAP.
• Applications that use the DW include OLAP, data mining, visualization
• BI can provide many advantages to an organization
• Creates added value by transforming data into information
• Provides comprehensive knowledge about your business
• A good DW is a prerequisite for BI
• But, a DW is a means rather than a goal … it is only a success if it is heavily used
• Following a clear design methodology is important.
30.09.2020 PA220 DB for Analytics 53
What next?
• Modeling for Data Warehouse
30.09.2020 PA220 DB for Analytics 54