# Tuesday, 31 August 2010

See also:

In Part I we looked at when you should build your data warehouse and concluded that you should build it sooner rather than later to take advantage of reporting and view optimization. Today we will look at your options to build your data warehouse schema.

When architecting a data warehouse, you have two basic options: build a flat “reporting” table for each operation you are performing, or build with BI/cubes in mind and implement a “star” or “snowflake” schema. Let’s take a quick look at the first option and then we will take a look at the star and snowflake schemas.

Whenever the business requests a complex report, developers usually slow down the system with a complex SQL statement or operation. For example, pretend in our order entry system (OLTP) the business wants a report that says this: show me the top ten customers in each market including their overall rank. You would usually have to perform a query like this:

  1. Complex joins for unique customer
  2. Rollup the sales
  3. Ranking functions to determine overall rank
  4. Partition functions to segment the rank by country
  5. Standard aggregates to get the sales
  6. Dump all of this to a work table in order to pull out the top 10 (if you don’t do this, you will lose the overall rank)

A typical SQL statement to do steps 1-5 would look like this:

With CTETerritory
   Select cr.Name as CountryName, CustomerID, 
                Sum(TotalDue) As TotalAmt
   From Sales.SalesOrderHeader soh 
   inner join Sales.SalesTerritory ter
   on soh.TerritoryID=ter.TerritoryID
   inner join Person.CountryRegion cr 
   on cr.CountryRegionCode=ter.CountryRegionCode
   Group By cr.Name, CustomerID
Select *, Rank() Over (Order by TotalAmt DESC) as OverallRank,
Rank() Over
     (Partition By CountryName Order By TotalAmt DESC,
            CustomerID DESC) As NationalRank
From CTETerritory

Argh! No wonder developers hate SQL and want to use ORMs! (I challenge the best ORM to make this query!)

Instead you can create a table, lets call it SalesRankByRegion, with the fields: CountryName, CustomerID, TotalSales, OverallRank, and NationalRank, and use the above SQL as part of a synchronization/load script to fill your reporting table on a regular basis. Then your SQL statement for the above query looks like this:

SELECT * FROM SalesRankByRegion
WHERE CustomerNationalRank Between 1 and 10
ORDER BY CountryName, CustomerNationalRank

The results look like:


That is more like it! A simple select statement is easier for the developer to write, the ORM to map, and the system to execute.

The SalesRankByRegion table is a vast improvement over having to query all of the OLTP tables (by my count there are three tables plus the temp table). While this approach has its appeal, very quickly, your reporting tables will start to proliferate.

Your best option is to follow one of the two industry standards for data warehouse tables, a “star” or a “snowflake’ schema. Using a schema like this gives you a few advantages. They are more generic than the SalesRankByRegion, which was a table built for one query/report, giving you the ability to run many different reports off each table. Another advantage is that you will have the ability to build cubes very easily off of a star or snowflake schema as opposed to a bunch of SalesRankByRegion tables.

The design pattern for building true data warehouse tables are to build a “fact” table, or a table that contains detail level (or aggregated) “facts” about something in the real world, like an order or customer for example. Inside of the fact table you will also have “measures” or a numeric value that represents a “fact.” To support your fact table you will have “dimension” tables. Dimensions are a structure that will categorize your data, usually in the form of a hierarchy. A dimension table for example could be “time” with a hierarch of OrderYear, OrderQuarter, OrderMonth, OrderDate, OrderTime.

There are tons of tutorials on the internet that show you how to build a star or snowflake schema and the difference between them, so I will not repeat them here. (You may want to start here.) I’ll give you the high level on a simple star schema here.

Let’s say we have an order entry system, such as Northwind (in the Microsoft SQL Server sample database.) You can have a fact table that revolves around an order. You can then have three (or more) fact tables that focus on: time, product, and salesperson. The time dimension would roll-up the order date by year, quarter, month, and date. The product dimension would roll-up the product by the product and category. (In most systems you would have a much deeper hierarchy for products.) The salesperson dimension would be roll-up of the employee, the employee manager and the department they work in. The key in each of these tables would then be foreign keys in the fact table, along with the measure (or numerical data describing the fact.)

There is an example similar to this in Programming SQL Server 2008, a book where I am a co-author. Here is modified version of that demo:

Dimension tables:

CREATE TABLE [dwh].[DimTime] (
[TimeKey] [int] IDENTITY (1, 1) NOT NULL Primary Key,
[OrderDate] [datetime] NULL ,
[Year] [int] NULL ,
[Quarter] [int] NULL ,
[Month] [int] NULL 

CREATE TABLE [dwh].[DimProduct] (
[ProductID] [int] not null Primary Key,
[ProductName] nvarchar(40) not null,
[UnitPrice] [money] not null,
[CategoryID] [int] not null,
[CategoryName] nvarchar(15) not null

CREATE TABLE [dwh].[DimEmployee] (
EmployeeID int not null Primary Key,
EmployeeName nvarchar(30) not null,
EmployeeTitle nvarchar(30),
ManagerName nvarchar(30)

Fact table:
CREATE TABLE [dwh].FactOrder (
[PostalCode] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[ProductID] [int] NOT NULL ,
[EmployeeId] [int] NOT NULL ,
[ShipperId] [int] NOT NULL ,
[Total Sales] [money] NULL ,
[Discount] [float] NULL ,
[Unit Sales] [int] NULL ,
[TimeKey] [int] NOT NULL 

We have the basis of a star schema. Now we have to fill those tables and keep them up to date. That is a topic for Part III.

Tuesday, 31 August 2010 07:44:02 (Eastern Daylight Time, UTC-04:00)
"Argh! No wonder developers hate SQL and want to use ORMs!"

What's wrong with that query? To a SQL developer (like me), absolutely nothing. Its well crafted, formatted well for readability and readily demonstrates exactly what it is doing.

To people that use SQL there is absolutely nothing wrong with it. Perhaps your allusion should be:
"If you don't know or like SQL, get someone in who does"
as opposed to
"If you don't know or like SQL, use an ORM"

That latter opinion is one I see all the time and it angers me to the core. Hiding complexity beneath a layer of abstraction does not remove a problem, it only masks it (and often exacerbates it).

Let me state I have nothing against ORMs when used with good reason but if the justification for using them is "I don't like SQL" then one should be taking a long hard look at themselves and question whether they really should be working on that project. I'd thank you too, Stephen, for not propogating the myth that ORMs let .Net devs build what ordinarily they wouldn't be able to.
Saturday, 11 September 2010 09:20:00 (Eastern Daylight Time, UTC-04:00)
Good article. I'm glad to have some backing here. I've had some intense arguments with pure OLTP people who were not familiar with the fact that de-normalization is 'the norm' in data warehouses. When one in particular saw my flattened data warehouse, he started talking to me as if I were a child and begain to explain 3rd normal form to me as if I'd never heard of it. Oh, brother.

Our detail level data is flattened to include the various redundant header-level pieces and these tables are generated nightly in the ETL using the more complex sql that one could never hope to teach a CFO, for instance. The requests for ad-hoc reports that used to be the bane of my daily existence has dropped to next to nothing.

Well, what about the wasted space, he incredulously asked. I had anticipated this question, and had done the math. At our cost for server-class Raid space, taking the the redundant data bytes times the # of records, I'm looking at in the $100 range for just the extra space required on redundant fields for 10 years of data! Oooo...I hope we don't break the budget!

In my situtation I am the only software/dba/developer person in a 500 mil. company, so instead of building hundreds of canned reports or a comprehensive reporting system, which I don't have time for in this lifetime, I invested time training department heads on very simple sql which they have no problem with since they never need to do a join.

Comments are closed.