DW/BI fundamentals

Dimensional modeling

DM is identifying the dimensions in which the data should be projected. E.g.: product, time, geography. This allows to build the "cube" in which the data will then be allowed to be sliced.

Should start simple.

Normalized models can't be used in BI systems, because too complex to understand for users, and not efficient to query.

Two major types of models are:

Cubes are often more optimized and performant.

Recommendation is to load detailed, atomic info into a star schema, add optional OLAP cubes populated from the star schema.

Star schema:

erDiagram
    Sales-Facts {
        string productKey
        string customerKey
        float salesDollars
        int salesAmount
    }
    Sales-Facts ||--o| Date-Dimension: PK
    Date-Dimension {
        string dateKey
        int day
        int dayOfWeek
        int month
        int year
    }    Sales-Facts ||--o| Product-Dimension: PK
    Product-Dimension {
        string productKey
        string productName
        string brandName
        string categoryName
    }
    Sales-Facts ||--o| Customer-Dimension: PK
    Customer-Dimension {
        string customerKey
        string customerName
    }

There is one dimensional model per business process.

Dimensional models have no bias per how we want to slice or analyze the data.

Facts

Facts are Performance measurements

E.g. each sale

Referential integrity is when all rows in facts match their dimensions.

Dimensions

Textual context around a measurement event. Describe who what when where how and why.

Sometimes unclear if something is a fact or dim attribute. Decide by

Architecture

ETL

ETL contains:

Extract = read and understand source data.

Transform = Cleanse, dedupe, combine sources.

Load = structure physically and load into dimensional models.

Ensure quality. Validate conformance with 1:1 and 1:many business rules.

Cleanup is done once in the backroom (i.e. ETL).

Presentation area

Where the data is organized, stored, made available for querying by business community. ETL is off limit, so this is the DW/BI environment as far as business community is concerned.

Strong considerations:

Dimensional modeling techniques

Gather requirements and data realities: meet with business to understand requirements, and software to understand data sources.

Build dimensional model collaboratively, in workshops.

Four steps:

  1. Select business process
  2. Declare the grain (what does a single fact represent)
  3. Identify the dimensions (who, what, when, how, why, ...)
  4. Identify the facts (measurements we record in each entry of the fact table)

Fact table is entirely based on a physical activity, not influenced by eventual reports that may be produced

Facts can be additive, semi-additive, non-additive.

Avoid null values for dimensions, instead reference a default value.

Facts can be transactional (something happened) or periodic (measurements at intervals) or accumulating (object whose status changes, e.g. order line which can be added, removed, etc.) or factless (pure boolean events without measurements).

Consolidated facts tables combine multiple facts of the same grain into one table. E.g. consolidated sales = actuals + forecasts.

Dimensions