1. Aggregates Awareness
- Q: How aggregates awareness helps the users? (2 marks)
- A: Aggregate awareness allows using pre-built summary tables by front-end tools. It can compute higher-level aggregates using lower-level or more detailed aggregates.
2. One-to-One vs One-to-Many Transformation
- Q: Difference between one-to-one and one-to-many transformation (2 marks)
- A:
- One-to-One (Scalar): Simple mapping from one set of values to another using straightforward rules.
- One-to-Many: A single source element maps to several columns in the data warehouse. Example: parsing a 6ร30 address field into street, city, state, zip.
3. Cube Creation in ROLAP
- Q: How cube is created in ROLAP? (3 marks)
- A: Cube is a logical entity containing fact values at a certain aggregation level at the intersection of dimensions.
4. Timestamps
- Q: What are timestamps? (3 marks)
- A: Timestamp columns indicate when a row was last modified, helping to identify the latest data in operational systems.
5. Factors Behind Poor Data Quality (5 marks)
- System conversions/migrations
- Heterogeneous systems integration
- Poor source DB design
- Data aging
- Incomplete info from customers
- Input errors
- Localization issues
- Lack of data policies
Types of problems: dummy values, missing data, multipurpose fields, cryptic/contradicting data, reused keys, non-unique identifiers.
6. Denormalization Technique
- Q: Which denormalization technique reduces foreign keys? (5 marks)
- A: Collapsing tables: combines tables in one-to-one relationships to reduce storage, update time, and number of foreign keys/indexes.
7. Data Granularity
- Q: Difference between low granularity and high granularity? (2 marks)
- A:
- Low granularity: highly detailed data, more storage, slower queries.
- High granularity: summarized data, faster queries, less storage.
8. Summarization During Transformation
- Q: Reason for summarization (3 marks)
- A: Users may not need lowest-level detail; storing daily sales by product/store may suffice. Summarization reduces storage and improves query performance.
9. Clustering & Associative Rules
- Q: How clustering and associative rules work? (3 marks)
- A:
- Clustering: groups similar items (homogeneous clusters) from heterogeneous data. Example: group all apples in one cluster, even if colors differ.
10. Splitting Single Field
- Q: Why split name/address into separate fields? (5 marks)
- A:
- Improves indexing and query performance.
- Allows analysis by components: street, city, sector, etc.
11. Additive vs Non-Additive Facts (5 marks)
- Additive Facts: Sum gives correct results (e.g., quantity sold, total sales).
- Non-Additive Facts: Sum gives incorrect results (e.g., averages, ratios, % discount).
12. Expression Partitioning
- Q: How to gain expressionless?
- A: Groups data for efficient access to relevant partitions, improving query performance.
13. HOLAP Features
- Q: What is HOLAP?
- A: Hybrid OLAP combining MOLAP cubes for small dimensions and relational ROLAP for large dimensions. Supports multidimensional analysis while retrieving detailed data via SQL when needed.
14. Data Validation & Profiling
- Q: How to validate data? Purpose of data profiling?
- A:
- Data Profiling: Identify data type, transformations required, errors, and out-of-domain values.
- Ensures correct, usable data for analytics.
15. Full vs Incremental Extraction
- Full Extraction: Complete extraction, no tracking of changes, data as-is.
- Incremental Extraction: Extracts only new/changed data since last extraction, reflecting temporal changes.
16. Merge/Purge Problem
- Q: What is merge/purge problem?
- A: Duplicate records in merged datasets must be identified and removed to prevent errors in analysis.
17. Data Quality Definitions
- Intrinsic Quality: Accuracy; data correctly reflects real-world objects.
- Realistic Quality: Utility/value of data to meet enterprise objectives; supports decision-making effectively.
18. Erroneous Data
- Q: Define with example
- A: Data that is incorrect or duplicates exist, causing cost/errors. Example: Duplicate addresses in mailing lists causing multiple mailings.
19. Aggregation
- Q: Why is aggregation one-way?
- A: Aggregates summarize data, original values cannot be recovered. Example: 3+2+1 = 6, original numbers not recoverable from 6.
20. Fact Table
- Q: What is a fact table?
- A: Central table in a dimensional model storing measurable data (facts) with foreign keys linking to dimension tables.
21. Active DWH
- Q: What is Active DWH?
- A: Supports near-real-time decision-making; processes updates and queries quickly.
22. Lexical Errors
- Q: Define lexical error.
- A: Syntactically dirty data. Example: row has fewer columns than expected in table.
23. ELT
- Q: What is ELT and why used?
- A: Extract, Load, Transform; data transformed inside DWH to exploit parallelism and avoid separate transformation tools.
24. Physical Extraction
- Online: Extract directly from source system.
- Offline: Extract from staged structure (flat files, logs, transportable tables).
25. Data Cleansing Steps
- Elementizing
- Standardizing
- Verifying
- Matching
- Householding
- Documenting
26. Cube Partitioning
- Q: When to partition cube?
- A: To overcome MOLAP space limits, distribute logical cube across multiple physical cubes, improve scalability while keeping partitioning invisible to users.
27. OLAP Types
- HOLAP: Hybrid, combines MOLAP + ROLAP
- MOLAP: Prebuilt cubes, fast, proprietary
- ROLAP: SQL-based, scalable
- DOLAP: Desktop OLAP, mobile-friendly, small datasets
28. Data Duplication in Source
- Can cause: false frequency distributions, double counting, misleading aggregates, untrustworthy analysis, financial risk in credit evaluations.
MIDTERM Spring 2011 โ CS614
1. Effects of source data duplication on analysis in a DW (5 marks)
- Duplication can result in false frequency distributions, incorrect aggregates due to double counting, and difficulty in identifying unique entities (e.g., credit card customers).
- Leads to misleading statistics, bad decisions, and additional costs.
2. Features of Dimensional Modeling (5 marks)
- DM presents data in a standard, instinctive structure for high performance.
- Based on facts (numeric/additive measures) and dimensions (attributes describing facts).
- Fact table: central table with multipart key.
- Dimension tables: smaller tables with single-part keys linking to fact table.
- Forms star schema structure.
- Example: Average salary of customers age > 35 with >5 years experience.
3. Benefits of CDC in modern systems (3 marks)
- Immediate change detection.
- No loss of history, flat files not required.
- No incremental online I/O needed for log tape.
- Log tape captures all updates, can be processed offline.
4. Statistical analyzer: distributive, algebraic, holistic transformations (3 marks)
- Answer context not fully given; typically:
- Distributive: aggregation can be partitioned.
- Algebraic: functions calculated using algebraic formulae (e.g., average).
- Holistic: cannot be computed incrementally (e.g., median).
- Statistical analyzers usually use algebraic and holistic measures depending on metrics.
5. Round robin distribution is pre-defined? (2 marks)
- Not pre-defined. Spreads data evenly, but does not allow partition elimination.
- Used mainly for temporary tables.
6. MOLAP vs DOLAP (2 marks)
- MOLAP: OLAP with multidimensional data structure, precomputed cubes.
- DOLAP: OLAP for desktop environments, mobile-friendly, small datasets.
7. Offline vs Online extraction (2 marks)
- Online: Direct from source system or intermediate system.
- Offline: Data staged outside source system (flat files, redo logs, transportable tables).
8. Difference between ER and DM (5 marks)
- ER: Entity-relationship model, transactional design.
- DM: Dimensional model, for analytics and query performance; focuses on facts and dimensions.
9. Orrโs Laws of Data Quality (5 marks)
- Data that is unused cannot be correct.
- Quality depends on use, not just collection.
- Data quality is no better than its most stringent use.
- Quality issues increase with system age.
- Rare events have high impact when they occur.
21. Snowflake Schema (2 marks)
- Fact table centralized, connected to multiple dimension tables.
- Dimension tables may be normalized, giving a snowflake shape.
22. Why aggregation & summarization are required? (2 marks)
- To resolve grain mismatch, reduce storage needs, and because detailed data may lose utility.
23. Condition for smart tools to construct less detailed aggregates (3 marks)
- Works when metrics are additive.
- Full aggregate awareness allows less detailed aggregates from detailed aggregates without accessing base data.
24. Web scraping (3 marks)
- Extracting data from web pages, which may include junk (HTML tags, inconsistent formats).
- Problems: limited query interfaces, free text fields, rapid page changes.
25. Data loading strategies (5 marks)
- Full data refresh: Block insert into empty table.
- Incremental refresh: Block insert into populated tables.
- Trickle/continuous feed: Row-level inserts/updates for streaming data.
26. Drawbacks of MOLAP & curse of dimensionality (5 marks)
- Maintenance: every item aggregated into every cube.
- Storage: many cubes take a lot of space.
- Scalability: difficult when dimension cardinality > ~64,000.
- Curse of dimensionality: combinatorial explosion of cubes, sparse cells.
- Solutions: virtual cubes, partitioned cubes.
21. MOLAP features (2 marks)
- Pre-built cubes allow fast retrieval, slicing, dicing, complex calculations.
22. Steps in cleansing data (Marks missing)
- Typically: elementizing, standardizing, verifying, matching, house holding, documenting.
23. Features of Star Schema (3 marks)
- Dimension hierarchies in single table per dimension.
- Single fact table simplifies joins, improves performance.
- Tradeoff: may require more storage.
24. Multi-pass BSN approach (3 marks)
- Several independent runs with small window each time.
- Merge pairs of similar records; apply transitive closure to identify duplicates.
25. Benefits of HOLAP & DOLAP over MOLAP & ROLAP (5 marks)
- HOLAP: Combines relational (ROLAP) and cube (MOLAP) for scalability and performance.
- DOLAP: Mobile/desktop-friendly, handles small datasets efficiently.
26. Relationship between data quality & application value (5 marks)
- Data quality impacts usefulness of applications.
- Logical efforts in improving quality align with critical business objectives.