Order PDF of any content from our website with a little minor Fee to donate for hard work. Online MCQs are fully free but PDF books are paid. For details: contact whatsapp +923028700085 Important notes based PDF Books are available in very little price, starting from 500/-PKR; Order Now: contact whatsapp +923028700085

VU Past Papers CS614 โ€“ Data Warehousing Solved Subjective (Spring/Fall 2011 & 2012)


1. Aggregates Awareness

  • Q: How aggregates awareness helps the users? (2 marks)
  • A: Aggregate awareness allows using pre-built summary tables by front-end tools. It can compute higher-level aggregates using lower-level or more detailed aggregates.

2. One-to-One vs One-to-Many Transformation

  • Q: Difference between one-to-one and one-to-many transformation (2 marks)
  • A:
    • One-to-One (Scalar): Simple mapping from one set of values to another using straightforward rules.
    • One-to-Many: A single source element maps to several columns in the data warehouse. Example: parsing a 6ร—30 address field into street, city, state, zip.

3. Cube Creation in ROLAP

  • Q: How cube is created in ROLAP? (3 marks)
  • A: Cube is a logical entity containing fact values at a certain aggregation level at the intersection of dimensions.

4. Timestamps

  • Q: What are timestamps? (3 marks)
  • A: Timestamp columns indicate when a row was last modified, helping to identify the latest data in operational systems.

5. Factors Behind Poor Data Quality (5 marks)

  • System conversions/migrations
  • Heterogeneous systems integration
  • Poor source DB design
  • Data aging
  • Incomplete info from customers
  • Input errors
  • Localization issues
  • Lack of data policies

Types of problems: dummy values, missing data, multipurpose fields, cryptic/contradicting data, reused keys, non-unique identifiers.


6. Denormalization Technique

  • Q: Which denormalization technique reduces foreign keys? (5 marks)
  • A: Collapsing tables: combines tables in one-to-one relationships to reduce storage, update time, and number of foreign keys/indexes.

7. Data Granularity

  • Q: Difference between low granularity and high granularity? (2 marks)
  • A:
    • Low granularity: highly detailed data, more storage, slower queries.
    • High granularity: summarized data, faster queries, less storage.

8. Summarization During Transformation

  • Q: Reason for summarization (3 marks)
  • A: Users may not need lowest-level detail; storing daily sales by product/store may suffice. Summarization reduces storage and improves query performance.

9. Clustering & Associative Rules

  • Q: How clustering and associative rules work? (3 marks)
  • A:
    • Clustering: groups similar items (homogeneous clusters) from heterogeneous data. Example: group all apples in one cluster, even if colors differ.

10. Splitting Single Field

  • Q: Why split name/address into separate fields? (5 marks)
  • A:
    • Improves indexing and query performance.
    • Allows analysis by components: street, city, sector, etc.

11. Additive vs Non-Additive Facts (5 marks)

  • Additive Facts: Sum gives correct results (e.g., quantity sold, total sales).
  • Non-Additive Facts: Sum gives incorrect results (e.g., averages, ratios, % discount).

12. Expression Partitioning

  • Q: How to gain expressionless?
  • A: Groups data for efficient access to relevant partitions, improving query performance.

13. HOLAP Features

  • Q: What is HOLAP?
  • A: Hybrid OLAP combining MOLAP cubes for small dimensions and relational ROLAP for large dimensions. Supports multidimensional analysis while retrieving detailed data via SQL when needed.

14. Data Validation & Profiling

  • Q: How to validate data? Purpose of data profiling?
  • A:
    • Data Profiling: Identify data type, transformations required, errors, and out-of-domain values.
    • Ensures correct, usable data for analytics.

15. Full vs Incremental Extraction

  • Full Extraction: Complete extraction, no tracking of changes, data as-is.
  • Incremental Extraction: Extracts only new/changed data since last extraction, reflecting temporal changes.

16. Merge/Purge Problem

  • Q: What is merge/purge problem?
  • A: Duplicate records in merged datasets must be identified and removed to prevent errors in analysis.

17. Data Quality Definitions

  • Intrinsic Quality: Accuracy; data correctly reflects real-world objects.
  • Realistic Quality: Utility/value of data to meet enterprise objectives; supports decision-making effectively.

18. Erroneous Data

  • Q: Define with example
  • A: Data that is incorrect or duplicates exist, causing cost/errors. Example: Duplicate addresses in mailing lists causing multiple mailings.

19. Aggregation

  • Q: Why is aggregation one-way?
  • A: Aggregates summarize data, original values cannot be recovered. Example: 3+2+1 = 6, original numbers not recoverable from 6.

20. Fact Table

  • Q: What is a fact table?
  • A: Central table in a dimensional model storing measurable data (facts) with foreign keys linking to dimension tables.

21. Active DWH

  • Q: What is Active DWH?
  • A: Supports near-real-time decision-making; processes updates and queries quickly.

22. Lexical Errors

  • Q: Define lexical error.
  • A: Syntactically dirty data. Example: row has fewer columns than expected in table.

23. ELT

  • Q: What is ELT and why used?
  • A: Extract, Load, Transform; data transformed inside DWH to exploit parallelism and avoid separate transformation tools.

24. Physical Extraction

  • Online: Extract directly from source system.
  • Offline: Extract from staged structure (flat files, logs, transportable tables).

25. Data Cleansing Steps

  1. Elementizing
  2. Standardizing
  3. Verifying
  4. Matching
  5. Householding
  6. Documenting

26. Cube Partitioning

  • Q: When to partition cube?
  • A: To overcome MOLAP space limits, distribute logical cube across multiple physical cubes, improve scalability while keeping partitioning invisible to users.

27. OLAP Types

  • HOLAP: Hybrid, combines MOLAP + ROLAP
  • MOLAP: Prebuilt cubes, fast, proprietary
  • ROLAP: SQL-based, scalable
  • DOLAP: Desktop OLAP, mobile-friendly, small datasets

28. Data Duplication in Source

  • Can cause: false frequency distributions, double counting, misleading aggregates, untrustworthy analysis, financial risk in credit evaluations.

MIDTERM Spring 2011 โ€“ CS614

1. Effects of source data duplication on analysis in a DW (5 marks)

  • Duplication can result in false frequency distributions, incorrect aggregates due to double counting, and difficulty in identifying unique entities (e.g., credit card customers).
  • Leads to misleading statistics, bad decisions, and additional costs.

2. Features of Dimensional Modeling (5 marks)

  • DM presents data in a standard, instinctive structure for high performance.
  • Based on facts (numeric/additive measures) and dimensions (attributes describing facts).
  • Fact table: central table with multipart key.
  • Dimension tables: smaller tables with single-part keys linking to fact table.
  • Forms star schema structure.
  • Example: Average salary of customers age > 35 with >5 years experience.

3. Benefits of CDC in modern systems (3 marks)

  • Immediate change detection.
  • No loss of history, flat files not required.
  • No incremental online I/O needed for log tape.
  • Log tape captures all updates, can be processed offline.

4. Statistical analyzer: distributive, algebraic, holistic transformations (3 marks)

  • Answer context not fully given; typically:
    • Distributive: aggregation can be partitioned.
    • Algebraic: functions calculated using algebraic formulae (e.g., average).
    • Holistic: cannot be computed incrementally (e.g., median).
  • Statistical analyzers usually use algebraic and holistic measures depending on metrics.

5. Round robin distribution is pre-defined? (2 marks)

  • Not pre-defined. Spreads data evenly, but does not allow partition elimination.
  • Used mainly for temporary tables.

6. MOLAP vs DOLAP (2 marks)

  • MOLAP: OLAP with multidimensional data structure, precomputed cubes.
  • DOLAP: OLAP for desktop environments, mobile-friendly, small datasets.

7. Offline vs Online extraction (2 marks)

  • Online: Direct from source system or intermediate system.
  • Offline: Data staged outside source system (flat files, redo logs, transportable tables).

8. Difference between ER and DM (5 marks)

  • ER: Entity-relationship model, transactional design.
  • DM: Dimensional model, for analytics and query performance; focuses on facts and dimensions.

9. Orrโ€™s Laws of Data Quality (5 marks)

  • Data that is unused cannot be correct.
  • Quality depends on use, not just collection.
  • Data quality is no better than its most stringent use.
  • Quality issues increase with system age.
  • Rare events have high impact when they occur.

21. Snowflake Schema (2 marks)

  • Fact table centralized, connected to multiple dimension tables.
  • Dimension tables may be normalized, giving a snowflake shape.

22. Why aggregation & summarization are required? (2 marks)

  • To resolve grain mismatch, reduce storage needs, and because detailed data may lose utility.

23. Condition for smart tools to construct less detailed aggregates (3 marks)

  • Works when metrics are additive.
  • Full aggregate awareness allows less detailed aggregates from detailed aggregates without accessing base data.

24. Web scraping (3 marks)

  • Extracting data from web pages, which may include junk (HTML tags, inconsistent formats).
  • Problems: limited query interfaces, free text fields, rapid page changes.

25. Data loading strategies (5 marks)

  • Full data refresh: Block insert into empty table.
  • Incremental refresh: Block insert into populated tables.
  • Trickle/continuous feed: Row-level inserts/updates for streaming data.

26. Drawbacks of MOLAP & curse of dimensionality (5 marks)

  • Maintenance: every item aggregated into every cube.
  • Storage: many cubes take a lot of space.
  • Scalability: difficult when dimension cardinality > ~64,000.
  • Curse of dimensionality: combinatorial explosion of cubes, sparse cells.
  • Solutions: virtual cubes, partitioned cubes.

21. MOLAP features (2 marks)

  • Pre-built cubes allow fast retrieval, slicing, dicing, complex calculations.

22. Steps in cleansing data (Marks missing)

  • Typically: elementizing, standardizing, verifying, matching, house holding, documenting.

23. Features of Star Schema (3 marks)

  • Dimension hierarchies in single table per dimension.
  • Single fact table simplifies joins, improves performance.
  • Tradeoff: may require more storage.

24. Multi-pass BSN approach (3 marks)

  • Several independent runs with small window each time.
  • Merge pairs of similar records; apply transitive closure to identify duplicates.

25. Benefits of HOLAP & DOLAP over MOLAP & ROLAP (5 marks)

  • HOLAP: Combines relational (ROLAP) and cube (MOLAP) for scalability and performance.
  • DOLAP: Mobile/desktop-friendly, handles small datasets efficiently.

26. Relationship between data quality & application value (5 marks)

  • Data quality impacts usefulness of applications.
  • Logical efforts in improving quality align with critical business objectives.
Contents Copyrights Reserved By T4Tutorials