Data Warehouse
Implementation & Querying
PA220: Database systems for data analytics
Contents
• Implementation of Dimensional modelling
• Querying by dimensions
• Grouping possibilities
• Aggregate functions
• Window functions
• Case Study: Grocery Store
October 10, 2023 PA220 DB for Analytics 2
ER Model vs. Multidimensional Model
• The multidimensional model
• Its only purpose: data analysis
• It is not suitable for OLTP systems
• More built in “meaning”
• What is important
• What describes the important
• What we want to optimize
• Easy for query operations
• Recognized by OLAP/BI tools
• Tools offer powerful query facilities based on MD design
October 10, 2023 PA220 DB for Analytics 3
The multidimensional model
• Data is divided into:
• Facts
• Dimensions
• Facts are the important entity: a sale
• Facts have measures that can be aggregated: sales amount
• Dimensions describe facts
• A sale has the dimensions Product, Store and Time
• Facts “live” in a multidimensional cube (dice)
• Think of an array from programming languages
• Goal for dimensional modeling:
• Surround facts with as much context (dimensions) as possible
• Hint: redundancy may be ok (in well-chosen places)
• But you should not try to model all relationships in the data (unlike ER and OO modeling!)
October 10, 2023 PA220 DB for Analytics 4
Cube Example
October 10, 2023 PA220 DB for Analytics 5
Cubes
• A “cube” may have many dimensions!
• More than 3 - the term ”hypercube” is sometimes used
• Theoretically no limit for the number of dimensions
• Typical cubes have 4-12 dimensions
• But only 2-4 dimensions can be viewed at a time
• Dimensionality reduced by queries via projection/aggregation
• A cube consists of cells
• A given combination of dimension values
• A cell can be empty (no data for this combination)
• A sparse cube has few non-empty cells
• A dense cube has many non-empty cells
• Cubes become sparser for many/large dimensions
October 10, 2023 PA220 DB for Analytics 6
Dimensions
• Dimensions are used for
• Selection of data
• Grouping of data at the right level of detail
• Dimensions consist of dimension values
• Product dimension have values ”milk”, ”cream”, …
• Time dimension have values ”1/1/2001”, ”2/1/2001”,…
• Dimension values may have an ordering
• Used for comparing cube data across values
• Example: ”percent sales increase compared with last month”
• Especially used for Time dimension
• Dimensions are the core of multidimensional databases
• Other types of databases do not support dimensions
October 10, 2023 PA220 DB for Analytics 7
Schema Documentation
• No well-defined standard
• T level corresponds to ALL
• Record the measures
• You could also use a UML like notation
• Modeling and OLAP tools may have their own notation
October 10, 2023 PA220 DB for Analytics 8
OLAP Systems
• A key concept of OLAP systems is multidimensional analysis:
• Examining data from many dimensions.
• Show total sales across all products at increasing aggregation levels for a geography
dimension, from state to country to geographical region, for 1999 and 2000.
• Create a cross-tabular analysis of our operations showing expenses by territory in South
America for 1999 and 2000. Include all possible subtotals.
• List the top 10 sales representatives in Asia according to 2000 sales revenue for food
products and rank their commissions.
• Organization of cubes to efficiently answer the requests
• Response time of seconds / few minutes
October 10, 2023 PA220 DB for Analytics 9
Relational OLAP (ROLAP)
• Store data in relational databases and simulate multidimensionality with special
schemas
• Data stored in relational tables
• Star (or snowflake) schemas used for modeling
• SQL used for querying
• Pros
• Leverages investments in relational technology
• Scalable (billions of facts)
• Flexible, designs easier to change
• New, performance enhancing, techniques adapted from MOLAP
• Indices, materialized views
• Cons
• Storage use (often 3-4 times MOLAP)
• Response times
October 10, 2023 PA220 DB for Analytics 10
Multidimensional OLAP (MOLAP)
• Physically stages the processed multidimensional information to deliver
consistent and rapid response times to end users
• Data stored in special multidimensional data structures
• E.g., multidimensional array on hard disk
• Pros
• Less storage use (“foreign keys” not stored)
• Faster query response times
• Cons
• Up till now not so good scalability
• Less flexible, e.g., cube must be re-computed when design changes
• Does not reuse an existing investment (but often bundled with RDBMS)
• Not as open technology
October 10, 2023 PA220 DB for Analytics 11
Hybrid OLAP (HOLAP)
• Detail data stored in relational tables (ROLAP)
• Aggregates stored in multidimensional structures (MOLAP)
• Pros
• Scalable (as ROLAP)
• Fast (as MOLAP)
• Cons
• High complexity
October 10, 2023 PA220 DB for Analytics 12
Question time
• Suppose that we want to replace the original Store hierarchy A by a
new hierarchy B
• How do we modify the schema to reflect it in ROLAP / MOLAP?
October 10, 2023 PA220 DB for Analytics 13
Question time (2)
• New store is being open and fact table needs to be populated
• How do we modify the fact table in ROLAP / MOLAP ?
October 10, 2023 PA220 DB for Analytics 14
Relational OLAP Cubes
• Two kinds of queries
• Navigation queries examine one dimension
• SELECT DISTINCT l FROM d [WHERE p]
• Aggregation queries summarize fact data
• SELECT d1.l1, d2.l2, SUM(f.m)
FROM d1, d2, f
WHERE f.dk1 = d1.dk1 AND f.dk2 = d2.dk2 [AND p]
GROUP BY d1.l1,d2.l2
• Fast, interactive analysis of large amounts of data
October 10, 2023 PA220 DB for Analytics 15
OLAP Queries
October 10, 2023 PA220 DB for Analytics 16
OLAP Cube in MS Analysis Services Project
October 10, 2023 PA220 DB for Analytics 17
Multidimensional database implementation
• Microsoft SQL Server Analysis Services (SSAS) in MS SQL Server
• Oracle Database OLAP Option within Oracle database
October 10, 2023 PA220 DB for Analytics 18
Microsoft SQL Server Analysis Services (SSAS)
• Introduced in SQL Server 2008
• Nice features built-in
• Analysis Services
• Integration Services
• Reporting Services
• Easy to use
• Graphical “Management Studio” and “BI Developer Studio”
• Allows all flavors of MOLAP, ROLAP and HOLAP to be used within the
same model
• Intelligent pre-aggregation (for improving query performance)
• Uses the query language MDX (MultiDimensional eXpressions)
October 10, 2023 PA220 DB for Analytics 19
OLAP Operations & Queries
• DW queries are big queries
• Imply a large portion of the data
• Mostly read queries
• Redundancy is necessity
• Materialized views, special purpose indexes, denormalized schemas
• Data is refreshed periodically
• Daily or weekly
• Their purpose is to analyze data
• OLAP (OnLine Analytical Processing)
October 10, 2023 PA220 DB for Analytics 20
OLAP Operations
• Typical OLAP operations
• Roll up
• Drill down
• Slice and dice
• Pivot (rotate) (aka crosstab)
• Other operations
• Aggregate functions
• Ranking and comparing
• Drill across
• Drill through
• Data densification (partitioned outer join)
October 10, 2023 PA220 DB for Analytics 21
Roll Up
• Roll up (drill up)
• Taking the current aggregation level of fact values and doing a further
aggregation
• Summarize data by
• Climbing up hierarchy (hierarchical roll up)
• By dimensional reduction (dimensional roll up)
• Or by a mix of these 2 techniques
• Used for obtaining an increased generalization
• E.g., from Time.Week to Time.Year
October 10, 2023 PA220 DB for Analytics 22
Roll Up
• Hierarchical roll ups
• Performed on the fact table and some dimension tables by climbing up the
attribute hierarchies
• E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group
October 10, 2023 PA220 DB for Analytics 23
Roll Up
• Dimensional roll ups
• Are done solely on the fact table by dropping one or more dimensions
• E.g., drop the Client dimension
October 10, 2023 PA220 DB for Analytics 24
Roll Up
• Climbing above the top in hierarchical roll up
• In an ultimate case, hierarchical roll up above the top level of an attribute
hierarchy (attribute “ALL”) can be viewed as converting to a dimensional roll
up
October 10, 2023 PA220 DB for Analytics 25
Drill Down
• Drill down (roll down)
• Reverse of roll up
• Represents a de-aggregate operation
• From higher level of summary to lower level of summary detailed data
• Introducing new dimensions
• Requires the existence of materialized finer grained data
• You can not drill if you do not have the data
October 10, 2023 PA220 DB for Analytics 26
Roll Up & Drill Down Example
October 10, 2023 PA220 DB for Analytics 27
Slice
• Slice: a subset of the multi-dimensional array corresponding to a
single value of one or more dimensions and projection on the rest of
dimensions
• E.g., project on Geo (store) and Time from values corresponding to Laptops
(cat. of products) in the product dimension
𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑> 𝑆𝑎𝑙𝑒𝑠
October 10, 2023 PA220 DB for Analytics 28
Slice
• Amounts to equality select condition on one dimension
• WHERE clause in SQL
• E.g., slice Laptops
October 10, 2023 PA220 DB for Analytics 29
Dice
• Dice: amounts to range select condition on one dimension, or to
equality select condition on more than one dimension
• E.g., range SELECT
𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑>∨𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐶𝑒𝑙𝑙𝑃ℎ𝑜𝑛𝑒𝐼𝑑> 𝑆𝑎𝑙𝑒𝑠
October 10, 2023 PA220 DB for Analytics 30
Dice
• E.g., equality SELECT on 2 dimensions Product and Time
• 𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑>∧𝑀𝑜𝑛𝑡ℎ𝐼𝑑=<𝐷𝑒𝑐𝑒𝑚𝑏𝑒𝑟> 𝑆𝑎𝑙𝑒𝑠
October 10, 2023 PA220 DB for Analytics 31
Pivoting
• Pivot (rotate): re-arranging data for viewing purposes
• The simplest view of pivoting is that it selects two dimensions to aggregate
the measure
• The aggregated values are often displayed in a grid where each point in the (x, y)
coordinate system corresponds to an aggregated value of the measure
• The x and y coordinate values are the values of the selected two dimensions
• The result of pivoting is also called cross tabulation
• This is space efficient for dense data only (thus, few dimensions)
• Shows data at different “granularities”
October 10, 2023 PA220 DB for Analytics 32
Pivoting in ROLAP
• Tabular representation for the cross-tabular
report with totals.
• ALL is a dummy value and stands for all or multiple
values.
• Probably not as nice to read as the crosstab.
• Information content is the same as in the
crosstab.
• Is more space efficient than crosstab if the data is
sparse.
October 10, 2023 PA220 DB for Analytics 33
SQL & OLAP
• The idea is to
• Select by Attributes of Dimensions
• E.g., region = „Europe“
• Group by Attributes of Dimensions
• E.g., region, month, quarter
• Aggregate on measures
• E.g., sum(price * volume)
• OLAP queries in SQL
October 10, 2023 PA220 DB for Analytics 34
SQL & OLAP
• No standard query language for OLAP
• SQL99 for ROLAP
• SQL:2003 OLAP extensions
• New SQL SELECT clauses
• GROUPING SETS
• ROLLUP
• CUBE
• Queries of type “top k”
• New aggregate functions
October 10, 2023 PA220 DB for Analytics 35
SQL & OLAP
• Shortcomings of SQL/92 with regard to OLAP queries
• Hard or impossible to express in SQL
• Multiple aggregations
• Comparisons (with aggregation)
• Reporting features
• Performance penalty
• Poor execution of queries with many AND and OR conditions
• Lack of support for advanced statistical functions
October 10, 2023 PA220 DB for Analytics 36
SQL92
• Multiple aggregations in SQL/92
• Create a 2D spreadsheet that shows sum of sales by maker as well as car
model
• Each subtotal requires a separate aggregation query
October 10, 2023 PA220 DB for Analytics 37
SELECT model, maker, sum(amt) FROM sales GROUP BY model, maker
union
SELECT model, sum(amt) FROM sales GROUP BY model
union
SELECT maker, sum(amt) FROM sales GROUP BY maker
union
SELECT sum(amt) FROM sales
SQL92
• Comparisons in SQL/92
• This year’s sales vs. last year’s sales for each product
• Requires a self join
• CREATE VIEW v_sales AS
SELECT prod_id , year, sum(qty) AS sale_sum
FROM sales GROUP BY prod_id, year;
• SELECT cur.prod_id , cur.year , cur.sale_sum , last.year, last.sale_sum
FROM v_sales cur, v_sales last
WHERE cur.year = (last.year+1) AND cur.prod_id = last.prod_id
October 10, 2023 PA220 DB for Analytics 38
SQL92
• Reporting features in SQL/92
• Too complex to express
• RANK (top k) and NTILE (“top X%” of all products)
• Median
• Running/rolling/moving total, moving average, cumulative totals
• E.g., moving average over a 3-day window of total sales for each product
• CREATE OR REPLACE VIEW v_sales AS SELECT prod_id , time_id , sum(qty) AS sale_sum
FROM sales GROUP BY prod_id , time_id
• SELECT end.time , avg( start.sale_sum ) FROM v_sales start, v_sales end
WHERE end.time >= start.time AND end.time <=start.time + 2 GROUP BY end.time
October 10, 2023 PA220 DB for Analytics 39
SQL99: Grouping Operators
• GROUP BY ROLLUP(gcols)
• Roll-up hierarchically
• GROUP BY CUBE (gcols)
• Roll-up to all possible combinations
• GROUP BY gcols1, CUBE(gcols2)
• Partial roll-up
• GROUP BY GROUPING SETS (gcols1, …, gcolsN)
• Explicit specification of roll-ups
• GROUP BY groupings1, groupings2, …
• Cross-product of groupings
• SELECT … GROUPING_ID(gcols)…
• Identification of roll-up level
October 10, 2023 PA220 DB for Analytics 40
Roll Up
• ROLLUP creates subtotals at n+1 levels, where n is the number of
grouping columns
• Rows that would be produced by GROUP BY without ROLLUP
• First-level subtotals
• Second-level subtotals
• …
• A grand total row
• It is very helpful for subtotaling along a hierarchical dimensions such
as time or geography
• ROLLUP(y, m, day) or ROLLUP(country, state, city)
• Order of attributes is significant!
October 10, 2023 PA220 DB for Analytics 41
Roll Up
• Roll up operation, e.g.:
• SELECT year, brand, SUM(qty) FROM sales GROUP BY ROLLUP(year, brand);
October 10, 2023 PA220 DB for Analytics 42
Cube
• CUBE creates 2n combinations of subtotals, where n is the number of
grouping columns
• Includes all the rows produced by ROLLUP
• CUBE is typically most suitable in queries that use columns from
multiple dimensions rather than columns representing different levels
of a single dimension
• e.g., subtotals for all combinations of month, state, and product
• Partial CUBE similar to partial ROLLUP
October 10, 2023 PA220 DB for Analytics 43
Cube
October 10, 2023 PA220 DB for Analytics 44
Cube
• Example
• SELECT year, brand, SUM(qty) FROM sales GROUP BY CUBE (year, brand);
October 10, 2023 PA220 DB for Analytics 45
Grouping Sets
• Grouping sets produce just the specified groupings.
• No (automatic) rollup is performed.
• E.g., GROUPING SETS ( (A,B), (D), (C, E, F) )
• Collection of columns in paratheses → composite column
• Efficiently replaces the series of UNIONed queries
• SELECT dept_name , CAST(NULL AS CHAR(10)) AS job_title , COUNT(*) FROM personnel
GROUP BY dept_name
UNION ALL
SELECT CAST(NULL AS CHAR(8)) AS dept_name , job_title , COUNT(*) FROM personnel
GROUP BY job_title;
• Can be rewritten as:
• SELECT dept_name , job_title , COUNT(*) FROM Personnel GROUP BY GROUPING SETS
( (dept_name), (job_title) );
October 10, 2023 PA220 DB for Analytics 46
Grouping Sets
• The issue of NULL values
• The new grouping functions generate NULL values at the subtotal levels
• How do we tell the difference between “generated NULLs” and “real NULLs” from the
data itself?
• The GROUPING function call returns 0 for NULL in the data and 1 for generated NULL
October 10, 2023 PA220 DB for Analytics 47
Grouping Operators: Equivalences
• CUBE(a,b) ≡ GROUPING SETS ((a,b), (a), (b), ())
• ROLLUP(a,b,c) ≡ GROUPING SETS ((a,b,c), (a,b), (a), ())
• GROUP BY GROUPING SETS (a,b,c) ≡
GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY c
• GROUP BY GROUPING SETS ((a,b,c)) ≡ GROUP BY a, b, c
• GROUP BY GROUPING SETS (a,b,(b,c)) ≡
GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY b, c
• GROUP BY GROUPING SETS (a,ROLLUP(b,c)) ≡
GROUP BY a UNION ALL GROUP BY ROLLUP(b, c)
October 10, 2023 PA220 DB for Analytics 48
Identification of Groupings
• With rollup and cube, we must provide a possibility to
programmatically determine the rollup level.
• The GROUPING_ID function is designed for this.
• GROUPING_ID takes a list of grouping columns as an argument.
• For each column it returns 1 if its value is NULL because of a rollup, and 0
otherwise.
• The list of binary digits is interpreted as a binary number and returned as a
base-10 number.
• Example: GROUPING_ID(a,b) for CUBE(a,b)
October 10, 2023 PA220 DB for Analytics 49
Concatenated Groupings
• A concatenated grouping is specified by listing multiple grouping sets,
cubes, and rollups, and produces the cross-product of groupings from
each grouping set
• Example:
• GROUP BY GROUPING SETS (a,b), GROUPING SETS (c,d)
produces (a,c), (a,d), (b,c), (b,d)
• A concise and easy way to generate useful combinations of groupings
• A small number of concatenated groupings can generate a large number of
final groups
• One of the most important uses for concatenated groupings is to generate the
aggregates for a hierarchical cube
October 10, 2023 PA220 DB for Analytics 50
Hierarchical Cubes
• A hierarchical cube is a data set where the data is aggregated along
the rollup hierarchy of each of its dimensions.
• The aggregations are combined across dimensions
• Example:
• ROLLUP(year, quarter, month),
ROLLUP(category,subcategory,name),
ROLLUP(region,subregion,country,state,city)
• Produces a total of 4x4x6=96 aggregate groups
• Compare to 2^11 = 2048 groupings by CUBE and 96 explicit group specifications
• Groups: (year,category,region), (quarter,category,region),
(month,category,region), …
October 10, 2023 PA220 DB for Analytics 51
Window Functions
• The window clause specifies an action to perform over a set of rows
• 3 sub clauses: Partitioning, ordering and aggregation grouping
• <aggregate function>
OVER ([PARTITION BY <column list>]
ORDER BY <sort column list>
[<aggregation grouping>])
• SELECT … ,
AVG(sales) OVER (PARTITION BY region
ORDER BY month ASC
ROWS 2 PRECEDING) AS SMA3, …
FROM …
• moving average of 3 rows
October 10, 2023 PA220 DB for Analytics 52
Ranking Operators
• Ranking operators in SQL
• Row numbering is the most basic ranking function
• Old style: ROW_NUMBER() returns a column that contains the row’s number within the
result set
• E.g., SELECT SalesOrderID , CustomerID , ROW_NUMBER()
OVER (ORDER BY SalesOrderID ) as RunningCount
FROM Sales WHERE SalesOrderID > 10000
ORDER BY SalesOrderID
October 10, 2023 PA220 DB for Analytics 53
Ranking Operators
• ROW_NUMBER doesn’t consider tied values
• Each 2 equal values get 2 different row numbers
• The behavior is nondeterministic
• Each tied value could have its number switched!
• We need something deterministic
October 10, 2023 PA220 DB for Analytics 54
Ranking Operators
• RANK and DENSE_RANK functions
• Allow ranking items in a group
• Syntax:
• RANK ( ) OVER ( [ query_partition_clause ] order_by_clause )
• DENSE_RANK ( ) OVER ( [ query_partition_clause ] order_by_clause )
• DENSE_RANK leaves no gaps in ranking sequence when there are ties
• PERCENT_RANK → (rank - 1) / (total rows - 1)
• CUME_DIST - the cumulative distribution
• the number of partition rows preceding (or peers with) the current row / total partition
rows
• The value ranges from 1/N to 1
October 10, 2023 PA220 DB for Analytics 55
Ranking Operators
• E.g.,
SELECT channel, calendar,
TO_CHAR(TRUNC(SUM(amount_sold), -6), '9,999,990’) AS sales,
RANK() OVER (ORDER BY TRUNC(amount_sold, -6)) DESC) AS rank,
DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold), -6)) DESC) AS dense_rank
FROM sales, products
…
October 10, 2023 PA220 DB for Analytics 56
Ranking Operators
• Group ranking - RANK function can operate within groups: the rank
gets reset whenever the group changes
• A single query can contain more than one ranking function, each partitioning
the data into different groups.
• PARTITION BY clause
October 10, 2023 PA220 DB for Analytics 57
SELECT … RANK() OVER (PARTITION BY channel ORDER BY SUM(amount_sold) DESC) AS rank_by_channel
Ntile
• NTILE splits a set into equal-sized groups
• It divides an ordered partition into buckets and assigns a bucket number to
each row in the partition
• Buckets are calculated so that each bucket has exactly the same number of
rows assigned to it or at most 1 row more than the others
• NTILE(4) - quartile
• NTILE(100) - percentage
• Not a part of the SQL99 standard, but adopted by major vendors
October 10, 2023 PA220 DB for Analytics 58
SELECT … NTILE(3) OVER (ORDER BY sales) NT_3 FROM …
Window Frame
• Obtain a value of a particular row of a window frame defined by
window clause (PARTITION BY…)
• first_value(expression)
• last_value(expression)
• nth_value (expression)
October 10, 2023 PA220 DB for Analytics 59
SELECT … FIRST_VALUE(sales) OVER (ORDER BY sales) AS lowest_sale
CHANNEL CALENDAR SALES LOWEST_SALE
Direst sales 02.2016 10,000 4,000
Direst sales 03.2016 9,000 4,000
Internet 02.2016 6,000 4,000
Internet 03.2016 6,000 4,000
Partners 03.2016 4,000 4,000
SELECT … FIRST_VALUE(sales) OVER (PARTITION BY channel ORDER BY sales) AS lowest_sale_per_channel
Window Frame
• Access to a row that comes before the current row at a specified
physical offset with the current window frame (partition)
• LAG(expression [,offset [,default_value]])
• … after the current row
• LEAD(expression [,offset [,default_value]]
October 10, 2023 PA220 DB for Analytics 60
CHANNEL CALENDAR SALES PREV_SALE
Direst sales 02.2016 10,000 NULL
Direst sales 03.2016 9,000 10,000
Internet 02.2016 6,000 NULL
Internet 03.2016 6,000 6,000
Partners 03.2016 4,000 NULL
SELECT … LAG(sales, 1) OVER (PARTITION BY channel ORDER BY calendar) AS prev_sale
Data Densification
• Enrich the existing “holey” data with default values
• Goal: produce dense result for all products
• i.e., incl. the weeks 24, 25, and 26.
October 10, 2023 PA220 DB for Analytics 61
Data Densification
• Partitioned Outer Join
• Apply outer join on each partition
• Implemented in Oracle
October 10, 2023 PA220 DB for Analytics 62
Case Study: Grocery Store
• Stock Keeping Units (SKUs)
• Point Of Sale (POS) system
• Stores/Branches
• Promotions
• Task: Analyze how promotions affect sales
October 10, 2023 PA220 DB for Analytics 63
Case Study: DW Design Steps
• Choose the business process(es) to model
• Sales
• Choose the granularity of the business process
• Sales by Product by Store by Promotion by Day
• Low granularity is needed
• Are individual transactions necessary/feasible?
• Choose the dimensions
• Time, Branch (store), Advert (promotion), Product
• Choose the measures
• Dollar_sales, unit_sales, dollar_cost, customer_count
• Resisting normalization and preserving browsing
• Flat dimension tables makes browsing easy and fast
October 10, 2023 PA220 DB for Analytics 64
Case Study: Dimensions
• Time dimension
• Explicit time dimension is needed (events, holidays,..)
• Product dimension
• Many-level hierarchy allows drill-down/roll-up
• Many descriptive attributes (often more than 50)
• Branch (store) dimension
• Many descriptive attributes
• Advert (promotion) dimension
• Example of a causal dimension
• Used to see if promotions work/are profitable
• Ads, price reductions, end-of-aisle displays, coupons
October 10, 2023 PA220 DB for Analytics 65
Case Study: Measures (Facts)
• All additive across all dimensions
• Dollar_sales
• Unit_sales
• Dollar_cost
• Gross profit (derived)
• Computed from sales and cost: sales – cost
• Additive
• Gross margin (derived)
• Computed from gross profit and sales: (sales – cost)/cost
• Non-additive across all dimensions
• Customer_count
• Additive across time, promotion, and store
• Non-additive across product. Why?
• Semi-additive
October 10, 2023 PA220 DB for Analytics 66
Case Study: Data Model
t(time)
t_id (PK)
t_day_name
t_cal_month_num
t_cal_year
t_cal_week_num
t_date
t_holiday
…
p(roduct)
p_id
p_name
p_desc
p_cat
p_subcat
p_list_price
b(ranch)
b_id
b_name
b_city
b_country
b_country_code
b_region
…
a(dvert)
a_id
a_name
a_aisle_display
a_coupon
a_price_reduction
m(edia)
m_id
m_desc
m_class
…
c(ustomer)
c_id
c_first_name
c_last_name
c_country
c_country_code
c_region
s(ales)
s_p_id (FK)
s_c_id (FK)
s_t_id (FK)
s_m_id (FK)
s_a_id (FK)
s_b_id (FK)
s_quant_sold
s_dollar_sold
s_dollar_cost
s_custom_cnt
October 10, 2023 PA220 DB for Analytics 68
Case Study: Common View
• A common view is created to simplify queries
October 10, 2023 PA220 DB for Analytics 69
CREATE VIEW spcbatm
SELECT *
FROM sales s JOIN product p ON (s_p_id=p_id)
JOIN customer c ON (s_c_id=c_id)
JOIN branch b ON (s_b_id=b_id)
JOIN advert a ON (s_a_id=a_id)
JOIN time t ON (s_t_id=t_id)
JOIN media m ON (s_m_id=m_id)
Case Study: Roll up Example
• Rollup from right to left
• Computes and combines the following groupings
• m_desc, t_cal_month_desc, c_country_code
• m_desc, t_cal_month_desc
• m_desc
• October
10, 2023 PA220 DB for Analytics 70
SELECT m_desc, t_cal_month_desc, c_country_code,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet')
AND t_cal_month_desc IN ('2000-09', '2000-10')
AND c_country_code IN ('GB', 'US')
GROUP BY ROLLUP(m_desc, t_cal_month_desc, c_country_code);
Case Study: Roll up Example
October 10, 2023 PA220 DB for Analytics 71
C_
Case Study: Partial Roll up
• m_desc is always present and not part of the rollup hierarchy
• Computes and combines the following groupings
• m_desc, t_cal_month_desc, c_country_code
• m_desc, t_cal_month_desc
• m_desc
October 10, 2023 PA220 DB for Analytics 72
SELECT m_desc, t_cal_month_desc, c_country_code,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet')
AND t_cal_month_desc IN ('2000-09', '2000-10')
AND c_country_code IN ('GB', 'US')
GROUP BY m_desc, ROLLUP(t_cal_month_desc, c_country_code);
Case Study: Partial Roll up
October 10, 2023 PA220 DB for Analytics 73
C_ SUM(S_DOLLAR_SOLD)
Case Study: Cube Example
• Produces all possible roll-up combinations
• Computes and combines the following groupings
• m_desc, t_cal_month_desc, n_iso_code
• m_desc, t_cal_month_desc
• m_desc, n_iso_code
• t_cal_month, n_iso_code
• m_desc
• October
10, 2023 PA220 DB for Analytics 74
SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet')
AND t_cal_month_desc IN ('2000-09', '2000-10')
AND c_country_code IN ('GB', 'US')
GROUP BY CUBE(m_desc, t_cal_month_desc, c_country_code);
Case Study: Cube Example
October 10, 2023 PA220 DB for Analytics 75
SUM(S_DOLLAR_SOLD)
Grouping Sets
October 10, 2023 PA220 DB for Analytics 76
SELECT m_desc, t_cal_month_desc, c_country_code,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet’)
AND t_cal_month_desc IN ('2000-09', '2000-10’)
AND c_country_code IN ('GB', 'US')
GROUP BY GROUPING SETS ((m_desc, t_cal_month_desc, c_country_code),
(m_desc, c_country_code),
(t_cal_month_desc, c_country_code));
Grouping Sets
October 10, 2023 PA220 DB for Analytics 77
SUM(S_DOLLAR_SOLD)C_
Grouping ID Example
• Replaces all NULLs from rollup with string ‘*’.
• Leaves NULL that are not the result of rollup untouched.
• Could easily make selective replacements of NULL.
October 10, 2023 PA220 DB for Analytics 78
SELECT
CASE WHEN GROUPING_ID(m_desc)=1 THEN ‘*' ELSE m_desc END,
CASE WHEN GROUPING_ID(c_country_code)=1 THEN ‘*' ELSE c_country_code END,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet’)
AND t_cal_month_desc= '2000-09’
AND c_country_code IN ('GB', 'US')
GROUP BY CUBE(m_desc, c_country_code);
Grouping ID Example
October 10, 2023 PA220 DB for Analytics 79
SUM(S_DOLLAR_SOLD)
Ranking Example
• Rank the media (‘Internet’ versus ‘Direct sales’) used for selling
products according to their dollar sales. Use the number of unit sales
to break ties. Do the analysis for August until November 2000.
October 10, 2023 PA220 DB for Analytics 80
SELECT m_desc, t_cal_month_desc, SUM(s_dollar_sold), SUM(s_quant_sold),
RANK() OVER (ORDER BY SUM(s_dollar_sold) DESC,
SUM(s_quant_sold) DESC) AS Rank
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet'),
AND t_cal_month_desc IN ('2000-08', '2000-09’, '2000-10', '2000-11')
GROUP BY m_desc, t_cal_month_desc;
Ranking Example (2)
• Determine the two least and most successful sales media,
respectively (in terms of total amount sold).
October 10, 2023 PA220 DB for Analytics 81
SELECT * FROM (SELECT m_desc, SUM(s_dollar_sold),
RANK() OVER (ORDER BY SUM(s_dollar_sold)) worst,
RANK() OVER (ORDER BY SUM(s_dollar_sold) DESC) best
FROM spcbatm
GROUP BY m_desc)
WHERE worst < 3 OR best < 3;
SUM(S_DOLLAR_SOLD)
Ranking Example (3)
• Determine the output of the following statement:
October 10, 2023 PA220 DB for Analytics 82
SELECT c_id, p_id, RANK() OVER (ORDER BY p_id) AS r1,
RANK() OVER (ORDER BY c_id) AS r2,
RANK() OVER (ORDER BY 1) AS r3,
RANK() OVER (PARTITION BY c_id ORDER BY p_id) AS r4,
RANK() OVER (PARTITION BY p_id ORDER BY c_id) AS r5
FROM spcbatm
WHERE c_id in (214, 608, 699)
AND p_id in (42, 98, 123)
GROUP BY c_id, p_id;
Summary
• ROLAP is a good option
• exploits existing investments
• SQL:2003 has added a lot of support for OLAP operations
• SQL is not just select-from-where
• Extensions of GROUP BY clause
• rollup, grouping sets, cube
• functions to identify grouping
• Case study for Grocery store
October 10, 2023 PA220 DB for Analytics 83