← All writing
Star Schema8 min read · Blog

Designing a star schema, step by step

It's easy to agree that the star schema is the right shape for analytics. It's harder to sit down in front of a messy source system and actually build one. The gap is method. Dimensional modeling has a well-worn four-step process, and following it in order is the difference between a star that lasts and a pile of tables that happen to have a fact in the middle.

This is the practical companion to the case for why the star wins. Here's how.

Step 1: Pick the business process

Start with a verb, not a table. A star schema models a business process — an activity the organization does and wants to measure: placing an order, shipping a package, opening a support ticket, charging a card. Not "the customers table." A process.

Why this first? Because the process determines everything downstream — what an event is, what you measure, what context applies. Teams that skip this and start from "what tables do we have" end up modeling their source schema instead of their business, and a model that mirrors the source inherits all of its accidents.

Step 2: Declare the grain

Now state what one row of the fact table represents — one sentence, no "usually." "One row per order line." "One row per package per scan." This is the most consequential decision in the whole model; I've made the full case for it separately, but the short version is: the grain fixes what every measure means and which joins are safe. Pick the most atomic grain the process supports — one row per individual event. You can always aggregate up; you can never disaggregate down what you never captured.

Step 3: Identify the dimensions

With the grain fixed, ask: by what would someone want to slice this? Every answer is a dimension. For an order line: the customer, the product, the date, the store, the promotion. Dimensions are the "by" in every question — "revenue by region by month."

Each dimension should be a single, conformed table reused across every fact that references it — one customer, one date, shared everywhere. This is what lets metrics line up across processes, and it's the property that makes them survive reorgs.

Step 4: Identify the facts

Finally, the measures — the numbers you'll aggregate. For an order line: quantity, unit price, extended amount, discount. The discipline here is additivity: a fact is most useful when you can SUM it across every dimension. Quantities and amounts usually qualify; ratios and percentages don't (you average or recompute those from their components, never sum them). Store the additive building blocks, derive the ratios at query time.

Why the order matters

Process → grain → dimensions → facts isn't arbitrary. Each step constrains the next. The process bounds the grain; the grain determines which dimensions apply and what the facts mean; the facts fall out of the grain almost mechanically. Do them out of order — facts first, grain last — and you reverse-engineer meaning from whatever SQL you happened to write, which is exactly how mixed grain and orphaned measures sneak in.

Get the four steps right and the rest of this arc — fact table types, durable dimensions, surrogate keys — is detail work on a sound foundation. Get them wrong and no amount of detail saves it.

Also asListensoonSlidessoonPodcastsoonVideosoon

Have data that should be doing more?

Tell me about the pipeline that breaks, the metric nobody trusts, or the analysis stuck in a notebook. Let's operationalize it.