Order PDF of any content from our website with a little minor Fee to donate for hard work. Online MCQs are fully free but PDF books are paid.
For details: contact whatsapp +923028700085
Important notes based PDF Books are available in very little price, starting from 500/-PKR;
Order Now:  contact whatsapp +923028700085

VU Past Papers CS614 – Data Warehousing Solved Subjective (Spring/Fall 2011 & 2012)

1. Aggregates Awareness

Q: How aggregates awareness helps the users? (2 marks)
A: Aggregate awareness allows using pre-built summary tables by front-end tools. It can compute higher-level aggregates using lower-level or more detailed aggregates.

2. One-to-One vs One-to-Many Transformation

Q: Difference between one-to-one and one-to-many transformation (2 marks)
A:
- One-to-One (Scalar): Simple mapping from one set of values to another using straightforward rules.
- One-to-Many: A single source element maps to several columns in the data warehouse. Example: parsing a 6×30 address field into street, city, state, zip.

3. Cube Creation in ROLAP

Q: How cube is created in ROLAP? (3 marks)
A: Cube is a logical entity containing fact values at a certain aggregation level at the intersection of dimensions.

4. Timestamps

Q: What are timestamps? (3 marks)
A: Timestamp columns indicate when a row was last modified, helping to identify the latest data in operational systems.

5. Factors Behind Poor Data Quality (5 marks)

System conversions/migrations
Heterogeneous systems integration
Poor source DB design
Data aging
Incomplete info from customers
Input errors
Localization issues
Lack of data policies

Types of problems: dummy values, missing data, multipurpose fields, cryptic/contradicting data, reused keys, non-unique identifiers.

6. Denormalization Technique

Q: Which denormalization technique reduces foreign keys? (5 marks)
A: Collapsing tables: combines tables in one-to-one relationships to reduce storage, update time, and number of foreign keys/indexes.

7. Data Granularity

Q: Difference between low granularity and high granularity? (2 marks)
A:
- Low granularity: highly detailed data, more storage, slower queries.
- High granularity: summarized data, faster queries, less storage.

8. Summarization During Transformation

Q: Reason for summarization (3 marks)
A: Users may not need lowest-level detail; storing daily sales by product/store may suffice. Summarization reduces storage and improves query performance.

9. Clustering & Associative Rules

Q: How clustering and associative rules work? (3 marks)
A:
- Clustering: groups similar items (homogeneous clusters) from heterogeneous data. Example: group all apples in one cluster, even if colors differ.

10. Splitting Single Field

Q: Why split name/address into separate fields? (5 marks)
A:
- Improves indexing and query performance.
- Allows analysis by components: street, city, sector, etc.

11. Additive vs Non-Additive Facts (5 marks)

Additive Facts: Sum gives correct results (e.g., quantity sold, total sales).
Non-Additive Facts: Sum gives incorrect results (e.g., averages, ratios, % discount).

12. Expression Partitioning

Q: How to gain expressionless?
A: Groups data for efficient access to relevant partitions, improving query performance.

13. HOLAP Features

Q: What is HOLAP?
A: Hybrid OLAP combining MOLAP cubes for small dimensions and relational ROLAP for large dimensions. Supports multidimensional analysis while retrieving detailed data via SQL when needed.

14. Data Validation & Profiling

Q: How to validate data? Purpose of data profiling?
A:
- Data Profiling: Identify data type, transformations required, errors, and out-of-domain values.
- Ensures correct, usable data for analytics.

15. Full vs Incremental Extraction

Full Extraction: Complete extraction, no tracking of changes, data as-is.
Incremental Extraction: Extracts only new/changed data since last extraction, reflecting temporal changes.

16. Merge/Purge Problem

Q: What is merge/purge problem?
A: Duplicate records in merged datasets must be identified and removed to prevent errors in analysis.

17. Data Quality Definitions

Intrinsic Quality: Accuracy; data correctly reflects real-world objects.
Realistic Quality: Utility/value of data to meet enterprise objectives; supports decision-making effectively.

18. Erroneous Data

Q: Define with example
A: Data that is incorrect or duplicates exist, causing cost/errors. Example: Duplicate addresses in mailing lists causing multiple mailings.

19. Aggregation

Q: Why is aggregation one-way?
A: Aggregates summarize data, original values cannot be recovered. Example: 3+2+1 = 6, original numbers not recoverable from 6.

20. Fact Table

Q: What is a fact table?
A: Central table in a dimensional model storing measurable data (facts) with foreign keys linking to dimension tables.

21. Active DWH

Q: What is Active DWH?
A: Supports near-real-time decision-making; processes updates and queries quickly.

22. Lexical Errors

Q: Define lexical error.
A: Syntactically dirty data. Example: row has fewer columns than expected in table.

23. ELT

Q: What is ELT and why used?
A: Extract, Load, Transform; data transformed inside DWH to exploit parallelism and avoid separate transformation tools.

24. Physical Extraction

Online: Extract directly from source system.
Offline: Extract from staged structure (flat files, logs, transportable tables).

25. Data Cleansing Steps

Elementizing
Standardizing
Verifying
Matching
Householding
Documenting

26. Cube Partitioning

Q: When to partition cube?
A: To overcome MOLAP space limits, distribute logical cube across multiple physical cubes, improve scalability while keeping partitioning invisible to users.

27. OLAP Types

HOLAP: Hybrid, combines MOLAP + ROLAP
MOLAP: Prebuilt cubes, fast, proprietary
ROLAP: SQL-based, scalable
DOLAP: Desktop OLAP, mobile-friendly, small datasets

28. Data Duplication in Source

Can cause: false frequency distributions, double counting, misleading aggregates, untrustworthy analysis, financial risk in credit evaluations.

MIDTERM Spring 2011 – CS614

1. Effects of source data duplication on analysis in a DW (5 marks)

Duplication can result in false frequency distributions, incorrect aggregates due to double counting, and difficulty in identifying unique entities (e.g., credit card customers).
Leads to misleading statistics, bad decisions, and additional costs.

2. Features of Dimensional Modeling (5 marks)

DM presents data in a standard, instinctive structure for high performance.
Based on facts (numeric/additive measures) and dimensions (attributes describing facts).
Fact table: central table with multipart key.
Dimension tables: smaller tables with single-part keys linking to fact table.
Forms star schema structure.
Example: Average salary of customers age > 35 with >5 years experience.

3. Benefits of CDC in modern systems (3 marks)

Immediate change detection.
No loss of history, flat files not required.
No incremental online I/O needed for log tape.
Log tape captures all updates, can be processed offline.

4. Statistical analyzer: distributive, algebraic, holistic transformations (3 marks)

Answer context not fully given; typically:
- Distributive: aggregation can be partitioned.
- Algebraic: functions calculated using algebraic formulae (e.g., average).
- Holistic: cannot be computed incrementally (e.g., median).
Statistical analyzers usually use algebraic and holistic measures depending on metrics.

5. Round robin distribution is pre-defined? (2 marks)

Not pre-defined. Spreads data evenly, but does not allow partition elimination.
Used mainly for temporary tables.

6. MOLAP vs DOLAP (2 marks)

MOLAP: OLAP with multidimensional data structure, precomputed cubes.
DOLAP: OLAP for desktop environments, mobile-friendly, small datasets.

7. Offline vs Online extraction (2 marks)

Online: Direct from source system or intermediate system.
Offline: Data staged outside source system (flat files, redo logs, transportable tables).

8. Difference between ER and DM (5 marks)

ER: Entity-relationship model, transactional design.
DM: Dimensional model, for analytics and query performance; focuses on facts and dimensions.

9. Orr’s Laws of Data Quality (5 marks)

Data that is unused cannot be correct.
Quality depends on use, not just collection.
Data quality is no better than its most stringent use.
Quality issues increase with system age.
Rare events have high impact when they occur.

21. Snowflake Schema (2 marks)

Fact table centralized, connected to multiple dimension tables.
Dimension tables may be normalized, giving a snowflake shape.

22. Why aggregation & summarization are required? (2 marks)

To resolve grain mismatch, reduce storage needs, and because detailed data may lose utility.

23. Condition for smart tools to construct less detailed aggregates (3 marks)

Works when metrics are additive.
Full aggregate awareness allows less detailed aggregates from detailed aggregates without accessing base data.

24. Web scraping (3 marks)

Extracting data from web pages, which may include junk (HTML tags, inconsistent formats).
Problems: limited query interfaces, free text fields, rapid page changes.

25. Data loading strategies (5 marks)

Full data refresh: Block insert into empty table.
Incremental refresh: Block insert into populated tables.
Trickle/continuous feed: Row-level inserts/updates for streaming data.

26. Drawbacks of MOLAP & curse of dimensionality (5 marks)

Maintenance: every item aggregated into every cube.
Storage: many cubes take a lot of space.
Scalability: difficult when dimension cardinality > ~64,000.
Curse of dimensionality: combinatorial explosion of cubes, sparse cells.
Solutions: virtual cubes, partitioned cubes.

21. MOLAP features (2 marks)

Pre-built cubes allow fast retrieval, slicing, dicing, complex calculations.

22. Steps in cleansing data (Marks missing)

Typically: elementizing, standardizing, verifying, matching, house holding, documenting.

23. Features of Star Schema (3 marks)

Dimension hierarchies in single table per dimension.
Single fact table simplifies joins, improves performance.
Tradeoff: may require more storage.

24. Multi-pass BSN approach (3 marks)

Several independent runs with small window each time.
Merge pairs of similar records; apply transitive closure to identify duplicates.

25. Benefits of HOLAP & DOLAP over MOLAP & ROLAP (5 marks)

HOLAP: Combines relational (ROLAP) and cube (MOLAP) for scalability and performance.
DOLAP: Mobile/desktop-friendly, handles small datasets efficiently.

26. Relationship between data quality & application value (5 marks)

Data quality impacts usefulness of applications.
Logical efforts in improving quality align with critical business objectives.

Contents Copyrights Reserved By T4Tutorials