DW/BI fundamentals
- Understand business user - responsibilities, decisions that need be made
- Deliver high quality, relevant, accessible information
- Make sure data is accurate and can be trusted
- Consistency
- Sustain the environment
Dimensional modeling
DM is identifying the dimensions in which the data should be projected. E.g.: product, time, geography. This allows to build the "cube" in which the data will then be allowed to be sliced.
Should start simple.
Normalized models can't be used in BI systems, because too complex to understand for users, and not efficient to query.
Two major types of models are:
- Relational: star-schema. Facts are the center of the star (e.g. sales), dimensions are the branches (e.g. market dimension, date dimension, product dimension)
- Multi-dimensional: OLAP cube = online analytical processing. Each side of the cube is a dimension, inside the cube are fact (e.g. market, date and product are each a side)
Cubes are often more optimized and performant.
Recommendation is to load detailed, atomic info into a star schema, add optional OLAP cubes populated from the star schema.
Star schema:
erDiagram
Sales-Facts {
string productKey
string customerKey
float salesDollars
int salesAmount
}
Sales-Facts ||--o| Date-Dimension: PK
Date-Dimension {
string dateKey
int day
int dayOfWeek
int month
int year
} Sales-Facts ||--o| Product-Dimension: PK
Product-Dimension {
string productKey
string productName
string brandName
string categoryName
}
Sales-Facts ||--o| Customer-Dimension: PK
Customer-Dimension {
string customerKey
string customerName
}
There is one dimensional model per business process.
Dimensional models have no bias per how we want to slice or analyze the data.
Facts
Facts are Performance measurements
E.g. each sale
- Tend to be deep in rows and narrow in columns.
- There should be a single fact table per dimensional model to make sure that everything is measured in a consistent way.
- Each entry should be of the same "grain" (to avoid double-counting things)
- Have additive properties (e.g. amount). Sometimes semi-additive (e.g. balance,). Sometimes non-additive (e.g. unit price)
- There shouldn't be textual values in facts, unless it is unique to the fact (in which case it might be useless since hard to analyze). Text should go to dimensions.
- Of three "grain" categories:
- Transactions
- Periodic snap
- Accumulating snap
- Accessed through dimensions joined to it.
- Composite key is composed of a subset of foreign keys.
Referential integrity is when all rows in facts match their dimensions.
Dimensions
Textual context around a measurement event. Describe who what when where how and why.
- Often have many columns or attributes.
- Tend to have fewer rows than facts.
- Identified by a single primary key.
- Serve as primary source for query constraint, groupings, filters, labels.
- Identified by the word "by" (sales by product and geo → means product and geo must be in dimensions)
- Source of clarity
- Should consist of real words instead of abbreviations and codes
- Should include units
- Should not combine aspects (e.g. dont combine colour and size into a single code, instead have a column for colour and one for size)
- Are typically highly denormalized and have redundant data (e.g. product brand name can be denormalized into the product table to facilitate querying). This is called snowflaking.
Sometimes unclear if something is a fact or dim attribute. Decide by
- Is it used in calculations and take lots of values (Fact)
- Discretely valued and constant, participates to row labeling (Dim)
- If something that should be a dim changes often, it can be made a fact (e.g. if product price is very changing)
- Sometimes they can be either, and sometimes even redundantly stored in both.
Architecture
ETL
ETL contains:
- Work area
- Data structures
- Processes.
Extract = read and understand source data.
Transform = Cleanse, dedupe, combine sources.
Load = structure physically and load into dimensional models.
Ensure quality. Validate conformance with 1:1 and 1:many business rules.
Cleanup is done once in the backroom (i.e. ETL).
Presentation area
Where the data is organized, stored, made available for querying by business community. ETL is off limit, so this is the DW/BI environment as far as business community is concerned.
Strong considerations:
- It should be using a dimensional schema (either OLAP or star-schema) ; so that it's easy to understand and use.
- It should contain atomic data, rather than pre-calculated aggregates ; so that it can serve un-predictable requests. It may also store some aggregates for performance purposes.
- Should be structured around business process measurement events - correspond to physical data capture events, not to answer the question of the day.
- Is cross functional, i.e. a single fact table for a process that spans across departments.
- Uses common, conformed dimensions.
Dimensional modeling techniques
Gather requirements and data realities: meet with business to understand requirements, and software to understand data sources.
Build dimensional model collaboratively, in workshops.
Four steps:
- Select business process
- Declare the grain (what does a single fact represent)
- Identify the dimensions (who, what, when, how, why, ...)
- Identify the facts (measurements we record in each entry of the fact table)
Fact table is entirely based on a physical activity, not influenced by eventual reports that may be produced
Facts can be additive, semi-additive, non-additive.
- Non additive, where possible, should store the additive components.
Avoid null values for dimensions, instead reference a default value.
Facts can be transactional (something happened) or periodic (measurements at intervals) or accumulating (object whose status changes, e.g. order line which can be added, removed, etc.) or factless (pure boolean events without measurements).
Consolidated facts tables combine multiple facts of the same grain into one table. E.g. consolidated sales = actuals + forecasts.
Dimensions
- Should have a single, integer PK. Avoid natural values, because they might change and all that. Instead use an int starting from 0.
- Only exception is dates, which can be very predictable and so, use something like a string or int representing the date 20230102. Datetime stamp might also be used as a fact column. For time of day, use a separate dimension table "time-of-day".
- Durable keys = unique key for a physical thing that might have multiple records, e.g. an employee who was rehired.
- Multiple hierarchies can be represented in a single dimension table. E.g. for dates, day>month>year and day>week>quarter>fy.
- Don't use nulls in FK columns of facts,