ETL Process PA220: Database systems for data analytics Contents • Overview of ETL • Data Cleaning • Loading Tips • Issues • Summary December 14, 2022 PA220 DB for Analytics 2 ETL Process Overview • Data is periodically brought from the ODS to the data warehouse. • In most DW systems, the ETL process is the most complex part. • and the most underestimated and time-consuming part. • Often, 80% of development time is spent on ETL December 14, 2022 PA220 DB for Analytics 3 Data Staging Area • Transit storage for data underway in the ETL process • Transformations/cleansing done here • No user queries (some do it) • Sequential operations (few) on large data volumes • Performed by central ETL logic • Easily restarted • No need for locking, logging, etc. • RDBMS or flat files? (DBMS have become better at this) • Finished dimensions copied from staging area to relevant marts December 14, 2022 PA220 DB for Analytics 4 Data Warehouse Reporting Visualization Analysis Data Mart Sales Marketing Transaction data Stock Manageme nt E-shop ERP Services CRM Transforming Raw data Meta data Data dimension s Summary data Cleaning Integrating Web Logs Files Data Staging Staging data Aggregate Precalc. Storage ETL Process Types • When do we run the ETL process? December 14, 2022 PA220 DB for Analytics 5 ETL Process Types • Considerations: • Overhead on data warehouse and source sides. • E.g., online propagation puts a permanent burden on both sides; cannot benefit from bulk loading mechanisms • Data Staleness • Frequent updates reduce staleness but increase overhead. • Debugging, Failure Handling • With online/stream-based mechanisms, it may be more difficult to track down problems. • Different process for different flavors of data? • E.g., periodic refresh may work well for small (dimension) tables. December 14, 2022 PA220 DB for Analytics 6 Capturing Data Changes • Detecting changes is a challenge: • Audit Columns • E.g., “last modified” time stamp • Set time stamps or “new” flags on every row update. How? • Unset “new” flags on every load into the DW. Why? • Full Diff • Keep old snapshot and diff it with the current version. • Thorough, will detect any change • Resource-intensive: need to move and scan large volumes • Optimization: Hashes/checksums to speed up comparison • Database Log Scraping • The database’s write-ahead log contains all change inform. • Scraping the log may get messy, though. • Variant: create a message stream ODS → DW December 14, 2022 PA220 DB for Analytics 7 Data Cleansing • After extraction, data must be normalized and cleaned. December 14, 2022 PA220 DB for Analytics 8 Data Quality (Revision) • Data almost never has decent quality • Data in DW must be: • Precise • DW data must match known numbers - or explanation needed • Complete • DW has all relevant data, and the users know • Consistent • No contradictory data: aggregates fit with detail data • Unique • The same thing is called the same and has the same key (customers) • Timely • Data is updated ”frequently enough” and the users know when December 14, 2022 PA220 DB for Analytics 9 Data Cleansing • Problem: • Real-world data is messy. • Consistency rules in the OLTP system? • A lot of data is still entered by people. • Data warehouses serve as an integration platform. • Typical cleaning and normalization tasks: • Correct spelling errors. • Identify record matches and duplicates. • Resolve conflicts and inconsistencies. • Normalize (“conform”) data. December 14, 2022 PA220 DB for Analytics 10 Data Cleansing – Primitives • Similarity Join • Bring together similar data • For record matching and deduplication • Clustering • Put items into groups, based on “similarity” • E.g., pre-processing for deduplication • Parsing • E.g., source table has an ‘address’ column; whereas target table has ‘street’, ‘zip’, and ‘city’ columns • Might have to identify pieces of a string to normalize (e.g., “Road” → “Rd”) December 14, 2022 PA220 DB for Analytics 11 Data Cleansing – Similarity Join • Process of identifying duplicates: • Similarity measures: • edit distance, Jaccard coefficient, Soundex • Threshold of similarity is important • Limits the number of candidates for duplicates! December 14, 2022 PA220 DB for Analytics 12 Data Cleansing – Detecting Inconsistencies • Data screening system: • Column screens: Test data within a column • Correct value ranges, value formatting, null values? • Detect random/noise values • Structure screens: Relationship across columns • Foreign key relationships? • Combination of columns is a valid postal address? • Business rule screens: Data plausible according to business rules? • E.g., customer status X requires Y years of loyalty, Z EUR total revenue, etc. December 14, 2022 PA220 DB for Analytics 13 Improving Data Quality • Appoint ”data stewards” - responsible for data quality • A given steward has the responsibility for certain tables • Includes manual inspections and corrections! • DW-controlled improvement • Default values • ”Not yet assigned 157” note to data steward • Source-controlled improvements • The optimal? • Construct programs that check data quality • Are totals as expected? • Do results agree with alternative source? • Do not fix all problems with data quality • Allow management to see ”weird” data in their reports? December 14, 2022 PA220 DB for Analytics 14 Schema Integration • Different source systems, types, and schemas must be integrated. • Infer mapping between schemas (automatically)? • Tools: • Compare table and attribute names; consider synonyms and homonyms • Infer data types/formats and mapping rules • Techniques like similarity joins and deduplication. • Still: • Often a lot of manual work needed. December 14, 2022 PA220 DB for Analytics 15 Star- schema Model Data Transformation • Source → Staging Table: • Tool depends on data source (database, XML, flat files, etc.) • e.g., SQL, XQuery, Perl, awk, etc. • Often: • Extract to flat file (e.g., CSV) • Then bulk-load into staging table December 14, 2022 PA220 DB for Analytics 16 Prepare Dimension Tables • Checks • dimension row is new • attributes in dimension have changed • handle updates respecting SCD type of dimension December 14, 2022 PA220 DB for Analytics 17 Prepare Dimension Tables - Problems • “upsert” – update if exists, else insert (aka SQL-based update) • often a real performance killer • better: separate updates and bulk-load inserts • Generate and find dimension surrogate keys • e.g., use key generator of back-end DB • Maintain “Dim Cross Ref” table in memory or in back-end DB • Dimensions must be updated before facts • The relevant dimension rows for new facts must be in place • Special key considerations if initial load must be performed again • May re-compute aggregates (Type 1 updates) • again, bulk-loading/changing is a good choice December 14, 2022 PA220 DB for Analytics 18 Loading Data – Performance Tips 1. Turn off logging • Databases maintain a write-ahead log to implement failure tolerance mechanisms. • Row-by-row logging causes huge overhead. 2. Disable indexes and reindex after updates 3. Pre-sort data • Depending on system, may speed up index construction. • Additional benefit: may result in better physical layout 4. Truncate table • When loading from scratch December 14, 2022 PA220 DB for Analytics 19 Loading Data – Performance Tips 5. Enable “fast mode” • If data is prepared properly, database may use faster parsing mechanisms • e.g., “copy from” command 6. Make sure data is correct • Transformation, field truncation, error reporting may slow down bulk-loading significantly 7. Temporarily disable integrity control • Avoid checking during load, but do it in bulk, too. • e.g., foreign keys in the fact table December 14, 2022 PA220 DB for Analytics 20 Loading Data – Performance Tips 8. Parallelization • Dimensions can be loaded concurrently • Fact tables can be loaded concurrently • Partitions can be loaded concurrently • when horizontal partitioning of fact tables is used December 14, 2022 PA220 DB for Analytics 21 Hints on ETL Design • Do not try to implement all transformations in one step! • Do one (or just a few) thing(s) at the time • Copy source data one-one to staging area • Compute deltas • Only if doing incremental load • Handle versions and DW keys • Versions only if handling slowly changing dimensions • Implement complex transformations • Load dimensions • Load facts December 14, 2022 PA220 DB for Analytics 22 Issues • Files versus streams/pipes • Streams/pipes: no disk overhead, fast throughput • Files: easier restart, often the only possibility • ETL tool or not • Code: easy start, co-existence with IT infrastructure • Tool: better productivity on subsequent projects • Load frequency • ETL time dependent on data volumes • Daily load is much faster than monthly • Applies to all steps in the ETL process • Should DW be on-line 24/7? • Use partitions or several sets of tables December 14, 2022 PA220 DB for Analytics 23 Summary • ETL is very time consuming (80% of entire project) • Needs to be implemented as a sequence of many small steps • Data quality is crucial - fixed in ETL • Extraction of data from source systems might be very time consuming • Incremental approach is suggested • Transformation into DW format includes many steps, such as • building key, cleansing the data, handle inconsistent/duplicate data, etc. • Load includes the loading of the data in the DW, updating indexes, computing pre-aggregates, etc. December 14, 2022 PA220 DB for Analytics 24