ETL Process PA220: Database systems for data analytics Contents • Overview of ETL • Data Cleaning • Loading Tips • Issues • Summary October 16, 2023 PA220 DB for Analytics 2 ETL Process Overview • Data is periodically brought from the ODS to the data warehouse. • In most DW systems, the ETL process is the most complex part. • and the most underestimated and time-consuming part. • Often, 80% of development time is spent on ETL October 16, 2023 PA220 DB for Analytics 3 Capture Data Change Data Staging Area • Transit storage for data underway in the ETL process • Transformations/cleansing done here • No user queries (some do it) • Sequential operations (few) on large data volumes • Performed by central ETL logic • Easily restarted • No need for locking, logging, etc. • RDBMS or flat files? (DBMS have become better at this) • Finished dimensions copied from staging area to relevant marts October 16, 2023 PA220 DB for Analytics 4 Data Warehouse Reporting Visualization Analysis Data Mart Sales Marketing Transaction data Stock Management E-shop ERP Services CRM Transforming Raw data Meta data Data dimensions Summary data Cleaning Integrating Web Logs Files Data Staging Staging data Aggregate Precalc. Storage ETL Process Types • When do we run the ETL process? October 16, 2023 PA220 DB for Analytics 5 ETL Process Types • Considerations: • Overhead on data warehouse and source sides. • E.g., online propagation puts a permanent burden on both sides; cannot benefit from bulk loading mechanisms • Data Staleness • Frequent updates reduce staleness but increase overhead. • Debugging, Failure Handling • With online/stream-based mechanisms, it may be more difficult to track down problems. • Different process for different flavors of data? • E.g., periodic refresh may work well for small (dimension) tables. October 16, 2023 PA220 DB for Analytics 6 Star- schema Model Data Extraction: Getting Data • Source → Staging Table: • Tool selection depends on data source • database, XML, flat files, etc. • Use SQL, XQuery, Perl, awk, etc. to query the source system • Often: • Extract to flat file (e.g., CSV) • Then bulk-load into staging table • Data compression for large data transfers • Data encryption if transfer over public networks October 16, 2023 PA220 DB for Analytics 7 Data Extraction: Capturing Data Changes • Detecting changes is a challenge: • Audit Columns • E.g., “last modified” timestamp • Set timestamps or “new” flags on every row update. How? • Unset “new” flags on every load into the DW. Why? • Full Diff • Keep old snapshot and diff it with the current version. • Thorough, will detect any change • Resource-intensive: need to move and scan large volumes • Optimization: Hashes/checksums to speed up comparison • Database Log Scraping • The database’s write-ahead log contains all change inform. • Scraping the log may get messy, though. • Variant: create a message stream ODS → DW • Message Queue Monitoring • source system must use a messaging framework; then low overhead October 16, 2023 PA220 DB for Analytics 8 Data Cleansing • After extraction, data must be normalized and cleaned. October 16, 2023 PA220 DB for Analytics 9 Data Quality (Revision) • Data almost never has decent quality • Data in DW must be: • Precise • DW data must match known numbers - or explanation needed • Complete • DW has all relevant data, and the users know • Consistent • No contradictory data: aggregates fit with detail data • Unique • The same thing is called the same and has the same key (customers) • Timely • Data is updated ”frequently enough” and the users know when October 16, 2023 PA220 DB for Analytics 10 Data Cleansing • Problem: • Real-world data is messy. • Consistency rules in the OLTP system? • A lot of data is still entered by people. • Data warehouses serve as an integration platform. • Typical cleaning and normalization tasks: • Correct spelling errors. • Handle missing / null values • Identify record matches and duplicates. • Resolve conflicts and inconsistencies. • Normalize (“conform”) data. October 16, 2023 PA220 DB for Analytics 11 Data Cleansing: Primitives • Similarity Join • Bring together similar data • For record matching and deduplication • Clustering • Put items into groups, based on “similarity” • E.g., pre-processing for deduplication • Parsing • E.g., source table has an ‘address’ column; whereas target table has ‘street’, ‘zip’, and ‘city’ columns • Might have to identify pieces of a string to normalize (e.g., “Road” → “Rd”) October 16, 2023 PA220 DB for Analytics 12 Data Cleansing: Similarity Join • Process of identifying duplicates: • Similarity measures: • edit distance, Jaccard coefficient, Soundex • Threshold of similarity is important • Limits the number of candidates for duplicates! October 16, 2023 PA220 DB for Analytics 13 Data Cleansing: Detecting Inconsistencies • Data (quality) screening system: • Column screens: Test data within a column • Correct value ranges, value formatting, null values? • Detect random/noise values • Structure screens: Relationship across columns • Two or more columns implement a hierarchy (e.g., a series of m:n relationships) • Foreign key relationships between tables • Combination of columns is a valid item, e.g., an existing postal address • Business rule screens: Data plausible according to business rules? • E.g., customer status X requires Y years of loyalty, Z EUR total revenue, etc. October 16, 2023 PA220 DB for Analytics 14 Data Cleansing: Error Handling • Halting the process on error • requires manual intervention – diagnose, restart/resume the job or abort it • Create a suspense file • Log the errors in a side channel for later processing • Not clear when to handle its contents – fix the records and re-introduce to the job? • until these data items are restored, the overall DB integrity is questionable • Tag the data and continue • Bad fact records – create an audit dimension • Bad dimension data – use unique error values • Best solution whenever possible October 16, 2023 PA220 DB for Analytics 15 Data Cleansing: Error Handling October 16, 2023 PA220 DB for Analytics 16 • A special error event schema can be created • as a result of “Tag the data and continue“ • Grain corresponds to the error appearance • Batch dim – info of the job • Date dim – not a minute and sec of the error • rather a weekday, last day of fiscal period, to constraint / summarize errors • Time of day – timestamp when the error occurred Data Cleansing: Error Handling October 16, 2023 PA220 DB for Analytics 17 • Audit dimension • attached to the resulting fact table • created in data cleansing • stores audit conditions • Example • an ETL job finished with no error • a new audit rec. describing it is created • all new fact records are associated to it • if an error occurred (e.g., out of bounds) • another audit rec is created, and the failing fact records get attached Improving Data Quality • Appoint ”data stewards” - responsible for data quality • A given steward has the responsibility for certain tables • Includes manual inspections and corrections! • DW-controlled improvement • Default values • ”Not yet assigned 157” note to data steward • Source-controlled improvements • The optimal? • Construct programs that check data quality • Are totals as expected? • Do results agree with alternative source? • Do not fix all problems with data quality • Allow management to see ”weird” data in their reports? October 16, 2023 PA220 DB for Analytics 18 Data Transformation: Schema Integration • Different source systems, types, and schemas must be integrated. • Infer mapping between schemas (automatically)? • Tools: • Compare table and attribute names; consider synonyms and homonyms • Infer data types/formats and mapping rules • Techniques like similarity joins and deduplication. • Still: • Often a lot of manual work needed. October 16, 2023 PA220 DB for Analytics 19 Data Loading: Prepare Dimension Tables • Checks • dimension row is new • attributes in dimension have changed • handle updates respecting SCD type of dimension • Generates the surrogate keys • All done for each dimension October 16, 2023 PA220 DB for Analytics 20 Data Loading: Prepare Dimension Tables - Problems • “upsert” – update if exists, else insert (aka SQL-based update) • often a real performance killer • better: separate updates and bulk-load inserts • Generate and find dimension surrogate keys • e.g., use key generator of back-end DB • Maintain “Dim Cross Ref” table in memory or in back-end DB • Dimensions must be updated before facts • The relevant dimension rows for new facts must be in place • Special key considerations if initial load must be performed again • May re-compute aggregates (Type 1 updates) • again, bulk-loading/changing is a good choice October 16, 2023 PA220 DB for Analytics 21 Data Loading: Performance Tips 1. Turn off logging • Databases maintain a write-ahead log to implement failure tolerance mechanisms. • Row-by-row logging causes huge overhead. 2. Disable indexes and reindex after updates 3. Pre-sort data • Depending on system, may speed up index construction. • Additional benefit: may result in better physical layout 4. Truncate table • When loading from scratch October 16, 2023 PA220 DB for Analytics 22 Loading Data – Performance Tips 5. Enable “fast mode” • If data is prepared properly, database may use faster parsing mechanisms • e.g., “copy from” command 6. Make sure data is correct • Transformation, field truncation, error reporting may slow down bulk-loading significantly 7. Temporarily disable integrity control • Avoid checking during load, but do it in bulk, too. • e.g., foreign keys in the fact table October 16, 2023 PA220 DB for Analytics 23 Loading Data – Performance Tips 8. Parallelization • Dimensions can be loaded concurrently • Fact tables can be loaded concurrently • Partitions can be loaded concurrently • when horizontal partitioning of fact tables is used October 16, 2023 PA220 DB for Analytics 24 Hints on ETL Design • Do not try to implement all transformations in one step! • Do one (or just a few) thing(s) at the time • Copy source data one-one to staging area • Compute deltas • Only if doing incremental load • Handle versions and generate DW keys • Versions only if handling slowly changing dimensions • Implement complex transformations • Load dimensions • Load facts October 16, 2023 PA220 DB for Analytics 25 Issues • Files versus streams/pipes • Streams/pipes: no disk overhead, fast throughput • Files: easier restart, often the only possibility • ETL tool or not • Code: easy start, co-existence with IT infrastructure • Tool: better productivity on subsequent projects • Load frequency • ETL time depends on processed data volumes. • Daily load is much faster than monthly. • Applies to all steps in the ETL process • Should DW be on-line 24/7? • Use partitions or several sets of tables October 16, 2023 PA220 DB for Analytics 26 Summary • ETL is very time consuming (80% of entire project) • Needs to be implemented as a sequence of many small steps • Data quality is crucial - fixed in ETL • Extraction of data from source systems might be very time consuming • Incremental approach is suggested • Transformation into DW format includes many steps, such as • building key, cleansing the data, handle inconsistent/duplicate data, etc. • Load includes the loading of the data in the DW, updating indexes, computing pre-aggregates, etc. October 16, 2023 PA220 DB for Analytics 27