# Friday, September 3, 2010

See also:

In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. In Part III, we looked at where to put the data warehouse tables. In Part IV, we are going to look at how to populate those tables and keep them in sync with your OLTP system. Today, our last part in this series, we will take a quick look at the benefits of building the data warehouse before we need it for cubes and BI by exploring our reporting and other options.

As I said in Part I, you should plan on building your data warehouse when you architect your system up front. Doing so gives you a platform for building reports, or even application such as web sites off the aggregated data. As I mentioned in Part II, it is much easier to build a query and a report against the rolled up table than the OLTP tables.

To demonstrate, I will make a quick Pivot Table using SQL Server 2008 R2 PowerPivot for Excel (or just PowerPivot for short!). I have showed how to use PowerPivot before on this blog, however, I usually was going against a SQL Server table, SQL Azure table, or an OData feed. Today we will use a SQL Server table, but rather than build a PowerPivot against the OLTP data of Northwind, we will use our new rolled up Fact table.

To get started, I will open up PowerPivot and import data from the data warehouse I created in Part II. I will pull in the Time, Employee, and Product dimension tables as well as the Fact table.


Once the data is loaded into PowerPivot, I am going to launch a new PivotTable.


PowerPivot understands the relationships between the dimension and fact tables and places the tables in the designed shown below. I am going to drag some fields into the boxes on the PowerPivot designer to build a powerful and interactive Pivot Table. For rows I will choose the category and product hierarchy and sum on the total sales. I’ll make the columns (or pivot on this field) the month from the Time dimension to get a sum of sales by category/product by month. I will also drag in Year and Quarter in my vertical and horizontal slicers for interactive filtering. Lastly I will place the Employee field in the Report Filter pane, giving the user the ability to filter by employee.


The results look like this, I am dynamically filtering by 1997, third quarter and employee name Janet Leverling.


This is a pretty powerful interactive report build in PowerPivot using the four data warehouse tables. If there was no data warehouse, this Pivot table would have been very hard for an end user to build. Either they or a developer would have to perform joins to get the category and product hierarchy as well as more joins to get the order details and sum of the sales. In addition, the breakout and dynamic filtering by Year and Quarter, and display by month, are only possible by the DimTime table, so if there were no data warehouse tables, the user would have had to parse out those DateParts. Just about the only thing the end user could have done without assistance from a developer or sophisticated query is the employee filter (and even that would have taken some PowerPivot magic to display the employee name, unless the user did a join.)

Of course Pivot Tables are not the only thing you can create from the data warehouse tables you can create reports, ad hoc query builders, web pages, and even an Amazon style browse application. (Amazon uses its data warehouse to display inventory and OLTP to take your order.)

I hope you have enjoyed this series, enjoy your data warehousing.

posted on Friday, September 3, 2010 4:14:33 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Thursday, September 2, 2010

See also:

In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. In Part III, we looked at where to put the data warehouse tables. Today we are going to look at how to populate those tables and keep them in sync with your OLTP system.

No matter where your data warehouse is located, the biggest challenge with a data warehouse, especially one where you are going to do real time reporting off of, is that the data is published from the transactional system (OLTP). By definition, the data warehouse is out of date compared to the OLTP system.

Usually when you ask the boss, “How much latency can you accept between the OLTP and data warehouse systems” the boss will reply: none. While that is impossible, the more time to develop and the more money you have will to develop said system, the closer to real time you can get. Always bring up the American Express example used in Part I for proof that your system can accept at least some latency.

If you remember the code examples from Part II, most of the time you have to query OLTP data, aggregate it, and then load it into your new schema. This is known as the process of extraction, transformation, and loading (ETL) data from the OLTP system into the data warehouse. The workflow usually goes like this: at a pre-set time (at every change, hourly, nightly, or weekly) query the OLTP data (extraction) and then aggregate and flatten it out (transformation) and then copy the transformed data to the data warehouse (star or snowflake) tables (load).

You have several options ranging from near real time ETL (very complex and expensive) to nightly data dumps (least complex and least expensive.) You can also perform full ETL (wipe out the data warehouse tables completely and refill on each operation) and partial ETL (only add incremental data as per a time series.) The actual load can be done via database triggers on an add/update/delete for near real time to simple scheduled SQL batches that wipe data and run your load scripts. Most likely you will want to take a hybrid approach, depending on the maturity of your system. You have three basic options, ranging from least complex to most complex:

  • Direct database dump
  • ETL tool
  • Database triggers

When you have a long time series to publish your data, say nightly, or weekly, you can do a direct database dump. The process would be pretty straightforward. At a regular interval (or manually) a process would start that would query the OLTP database and perform all of the aggregations, etc and then load it into a staging data warehouse database, then wipe out the production data warehouse and load the data in.

Another option is to use an ETL tool. A good example is SQL Server Integration Services (SSIS) if you are using Microsoft SQL Server. (Actually SSIS will work with multiple database, you just need a SQL Server host.) A modern ETL tool will give you the ability to segment the work into logical groups, have a control flow based on success and failure of a condition, and allows rollbacks.


A typical workflow with an ETL tool is that the ETL will run on a schedule (or based on a condition, such as a message arriving from a queue or a record written to a admin table) and have a parameter(s) passed to it. This parameter is usually a time series and the ETL will perform all of the extraction on data from the OLTP database filtered by that parameter. An ETL tool is the most likely solution you will employ, especially if you have to make frequent updates from your OLTP system to your data warehouse.

Another option is to use database triggers. For those of you that don’t know a lot about triggers, well they can lead to evil. ;) That said, they are events that fire when data changes. You can then write SQL code to run when the data changes, even ETL code. Triggers are hard to debug and difficult to maintain, so I would only suggest that you use a trigger when you need “real time” updates to your data warehouse and even then, the trigger should only write a record into an admin table that your ETL process is polling to get started.

A common design pattern with a sophisticated ETL is to do all of the ETL to a staging data warehouse and allow a check mechanism to verify that the ETL was successful. Once this check is performed (either by computer or by human, depending on the system), you can then push the data from the staging tables to the production tables. There is a SQL operator specifically for this process, the MERGE operator. MERGE looks at two tables and looks at their joins and compares the data and will do add, updates, inserts, and deletes to keep the two tables in sync. Here is how we would do that with a MERGE statement and our fact table from Part II.

--MERGE the FACT Local to FACT Remote
MERGE dbo.FactOrder as t--target
  USING Northwind.dwh.FactOrder as s--source
  ON t.ProductID = s.ProductID AND
  t.EmployeeID = s.EmployeeID AND
  t.ShipperID = s.ShipperID AND
  t.TimeKey = s.TimeKey --joining keys

--record exists in target, but data different
WHEN MATCHED AND (s.OrderDate != t.OrderDate OR 
   s.PostalCode != t.PostalCode OR
   s.[Total Sales] != t.[Total Sales] OR
   s.Discount != t.Discount OR
   s.[Unit Sales] != t.[Unit Sales]) THEN
       UPDATE SET t.OrderDate = s.OrderDate, 
       t.PostalCode = s.PostalCode,
       t.[Total Sales] = s.[Total Sales],
       t.Discount = s.Discount,
t.[Unit Sales] = s.[Unit Sales]

   --only in target, get rid of it

   --record only exists in source
   INSERT VALUES (s.OrderDate, s.PostalCode, s.ProductID,s.EmployeeID,
   s.ShipperID, s.[Total Sales], s.Discount, s.[Unit Sales], s.[TimeKey]);--required!!

Tomorrow we will warp up with the application development options.

posted on Thursday, September 2, 2010 5:58:03 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Wednesday, September 1, 2010

See also:

In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. Today we are going to look at where to put your data warehouse tables.

Let’s look at the location of your data warehouse. Usually as your system matures, it follows this pattern:

  • Segmenting your data warehouse tables into their own isolated schema inside of the OLTP database
  • Moving the data warehouse tables to their own physical database
  • Moving the data warehouse database to its own hardware

When you bring a new system online, or start a new BI effort, to keep things simple you can put your data warehouse tables inside of your OLTP database, just segregated from the other tables. You can do this a variety of ways, most easily is using a database schema (ie dbo), I usually use dwh as the schema. This way it is easy for your application to access these tables as well as fill them and keep them in sync. The advantage of this is that your data warehouse and OLTP system is self-contained and it is easy to keep the systems in sync.

As your data warehouse grows, you may want to isolate your data warehouse further and move it to its own database. This will add a small amount of complexity to the load and synchronization, however, moving the data warehouse tables to their own table brings some benefits that make the move worth it. The benefits include implementing a separate security scheme. This is also very helpful if your OLTP database scheme locks down all of the tables and will not allow SELECT access and you don’t want to create new users and roles just for the data warehouse. In addition, you can implement a separate backup and maintenance plan, not having your date warehouse tables, which tend to be larger, slow down your OLTP backup (and potential restore!). If you only load data at night, you can even make the data warehouse database read only. Lastly, while minor, you will have less table clutter, making it easier to work with.

Once your system grows even further, you can isolate the data warehouse onto its own hardware. The benefits of this are huge, you can have less I/O contention on the database server with the OLTP system. Depending on your network topology, you can reduce network traffic. You can also load up on more RAM and CPUs. In addition you can consider different RAID array techniques for the OLTP and data warehouse servers (OLTP would be better with RAID 5, data warehouse RAID 1.)

Once you move your data warehouse to its own database or its own database server, you can also start to replicate the data warehouse. For example, let’s say that you have an OLTP that works worldwide but you have management in offices in different parts of the world. You can reduce network traffic by having all reporting (and what else do managers do??) run on a local network against a local data warehouse. This only works if you don’t have to update the date warehouse more than a few times a day.

Where you put your data warehouse is important, I suggest that you start small and work your way up as the needs dictate.

posted on Wednesday, September 1, 2010 5:39:54 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback
# Tuesday, August 31, 2010

See also:

In Part I we looked at when you should build your data warehouse and concluded that you should build it sooner rather than later to take advantage of reporting and view optimization. Today we will look at your options to build your data warehouse schema.

When architecting a data warehouse, you have two basic options: build a flat “reporting” table for each operation you are performing, or build with BI/cubes in mind and implement a “star” or “snowflake” schema. Let’s take a quick look at the first option and then we will take a look at the star and snowflake schemas.

Whenever the business requests a complex report, developers usually slow down the system with a complex SQL statement or operation. For example, pretend in our order entry system (OLTP) the business wants a report that says this: show me the top ten customers in each market including their overall rank. You would usually have to perform a query like this:

  1. Complex joins for unique customer
  2. Rollup the sales
  3. Ranking functions to determine overall rank
  4. Partition functions to segment the rank by country
  5. Standard aggregates to get the sales
  6. Dump all of this to a work table in order to pull out the top 10 (if you don’t do this, you will lose the overall rank)

A typical SQL statement to do steps 1-5 would look like this:

With CTETerritory
   Select cr.Name as CountryName, CustomerID, 
                Sum(TotalDue) As TotalAmt
   From Sales.SalesOrderHeader soh 
   inner join Sales.SalesTerritory ter
   on soh.TerritoryID=ter.TerritoryID
   inner join Person.CountryRegion cr 
   on cr.CountryRegionCode=ter.CountryRegionCode
   Group By cr.Name, CustomerID
Select *, Rank() Over (Order by TotalAmt DESC) as OverallRank,
Rank() Over
     (Partition By CountryName Order By TotalAmt DESC,
            CustomerID DESC) As NationalRank
From CTETerritory

Argh! No wonder developers hate SQL and want to use ORMs! (I challenge the best ORM to make this query!)

Instead you can create a table, lets call it SalesRankByRegion, with the fields: CountryName, CustomerID, TotalSales, OverallRank, and NationalRank, and use the above SQL as part of a synchronization/load script to fill your reporting table on a regular basis. Then your SQL statement for the above query looks like this:

SELECT * FROM SalesRankByRegion
WHERE CustomerNationalRank Between 1 and 10
ORDER BY CountryName, CustomerNationalRank

The results look like:


That is more like it! A simple select statement is easier for the developer to write, the ORM to map, and the system to execute.

The SalesRankByRegion table is a vast improvement over having to query all of the OLTP tables (by my count there are three tables plus the temp table). While this approach has its appeal, very quickly, your reporting tables will start to proliferate.

Your best option is to follow one of the two industry standards for data warehouse tables, a “star” or a “snowflake’ schema. Using a schema like this gives you a few advantages. They are more generic than the SalesRankByRegion, which was a table built for one query/report, giving you the ability to run many different reports off each table. Another advantage is that you will have the ability to build cubes very easily off of a star or snowflake schema as opposed to a bunch of SalesRankByRegion tables.

The design pattern for building true data warehouse tables are to build a “fact” table, or a table that contains detail level (or aggregated) “facts” about something in the real world, like an order or customer for example. Inside of the fact table you will also have “measures” or a numeric value that represents a “fact.” To support your fact table you will have “dimension” tables. Dimensions are a structure that will categorize your data, usually in the form of a hierarchy. A dimension table for example could be “time” with a hierarch of OrderYear, OrderQuarter, OrderMonth, OrderDate, OrderTime.

There are tons of tutorials on the internet that show you how to build a star or snowflake schema and the difference between them, so I will not repeat them here. (You may want to start here.) I’ll give you the high level on a simple star schema here.

Let’s say we have an order entry system, such as Northwind (in the Microsoft SQL Server sample database.) You can have a fact table that revolves around an order. You can then have three (or more) fact tables that focus on: time, product, and salesperson. The time dimension would roll-up the order date by year, quarter, month, and date. The product dimension would roll-up the product by the product and category. (In most systems you would have a much deeper hierarchy for products.) The salesperson dimension would be roll-up of the employee, the employee manager and the department they work in. The key in each of these tables would then be foreign keys in the fact table, along with the measure (or numerical data describing the fact.)

There is an example similar to this in Programming SQL Server 2008, a book where I am a co-author. Here is modified version of that demo:

Dimension tables:

CREATE TABLE [dwh].[DimTime] (
[TimeKey] [int] IDENTITY (1, 1) NOT NULL Primary Key,
[OrderDate] [datetime] NULL ,
[Year] [int] NULL ,
[Quarter] [int] NULL ,
[Month] [int] NULL 

CREATE TABLE [dwh].[DimProduct] (
[ProductID] [int] not null Primary Key,
[ProductName] nvarchar(40) not null,
[UnitPrice] [money] not null,
[CategoryID] [int] not null,
[CategoryName] nvarchar(15) not null

CREATE TABLE [dwh].[DimEmployee] (
EmployeeID int not null Primary Key,
EmployeeName nvarchar(30) not null,
EmployeeTitle nvarchar(30),
ManagerName nvarchar(30)

Fact table:
CREATE TABLE [dwh].FactOrder (
[PostalCode] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[ProductID] [int] NOT NULL ,
[EmployeeId] [int] NOT NULL ,
[ShipperId] [int] NOT NULL ,
[Total Sales] [money] NULL ,
[Discount] [float] NULL ,
[Unit Sales] [int] NULL ,
[TimeKey] [int] NOT NULL 

We have the basis of a star schema. Now we have to fill those tables and keep them up to date. That is a topic for Part III.

posted on Tuesday, August 31, 2010 7:30:42 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [2] Trackback
# Monday, August 30, 2010

Most developers are scared of “Business Intelligence” or BI. Most think that BI consists of cubes, pivot/drill down apps, and analytical decision support systems. While those are very typical outcomes of a BI effort, many people forget about the first step, the data warehouse.

Typically this is what happens with a BI effort. A system is built, usually a system that deals with transactions. We call this an OLTP or on-line transaction processing system. Some time passes and reports are bolted on and some business analysts build some pivot tables from “raw dumps” of data. As the system grows, reports start to slow since the system is optimized to deal with one record at a time. Someone, usually a CTO type says: “we need a BI system.” A development effort is then spent to build a data warehouse and cubes, and some kind of analytical system on top of those cubes.

I make the argument that developers and project planners should embrace the data warehouse up front. When you design your OLTP system, also design the supporting data warehouse, even if you have no intention of building a full-fledged BI application with cubes and the like. This way you have two distinct advantages. First is that you have a separate system that is optimized for reporting. This system will allow the rapid creation of many new reports as well take the load off the OLTP system. Second, when you do decide to build a BI system based on cubes, you will already have the hard part done, building the data warehouse and supporting ETL.

Since a data warehouse uses more of a flatter data model (more on this in Part II), you can even design your application to use both the OLTP and data warehouse as data sources. For example, when you have highly normalized, 3rd normal form transactional tables to support transactions, it is never easy to use those tables for reporting and displaying of information. Those tables are optimized and indexed to support retrieving and editing (or adding/deleting) one record at a time. When you try to do things in aggregate, you start to stress your system, since it was designed to deal with one record at a time.

This design pattern is already in use today at many places. Consider your credit card company for example. I use American Express, and I never see my transactions show up for at least 24 hours. If go buy something and I phone American Express and say “what was my last transaction” they will tell you right away. If you look online, you will not see that transaction until the next business day. Why? When you call the customer service representative, they are looking at the OLTP system, pulling up one record at a time. When you are looking online, you are looking at the data warehouse, a system optimized for viewing lots of data in a reporting environment.

You can take this to an extreme, if you ran an e-commerce site, you can power your product catalog view portion of the site with the data warehouse and the purchase (inventory) system with the OLTP model. Optimize the site for browsing (database reads) and at the same time have super-fast e-commerce (database writes.) Of course you have to keep the purchasing/inventory (OLTP) and product display (data warehouse) databases in sync. I’ll talk about that in Part III. Next, I will take a look at how to build the data warehouse.

posted on Monday, August 30, 2010 7:33:02 PM (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Wednesday, August 25, 2010

A while ago I was asked by the publisher to be a tech editor of A Practical Guide to Distributed Scrum. Since agile luminaries like Ken Schwaber and Scott Ambler were also tech editors, I was honored to be chosen as well. Reviewing this book was a great experience and I have re-read the book since it was published (even thought I was paid to be a tech editor/reviewer, the publisher sent me a free copy when the book was published. Cool!)

8-25-2010 6-20-37 PM

You can learn a lot about using Scrum in a distributed environment from reading this book, it is the gold standard. If you have remote employees, off shore developers, or just a lot of offices where the product owner is in one location and the development team in another, this book is for you. The authors walk you through the process of setting up scrum in a distributed environment including planning, user stories, and the daily scrum. They give practical advice on how to deal with the problems specific to distributed teams using scrum, including most importantly communication and coordination. The authors are from IBM and show some of the techniques used at IBM with their remote employees, offices, and contractors.

I have been doing scrum in a distributed environment for almost 5 years now, and still learned quite a bit by reading this book. I encourage you to read it too.

posted on Wednesday, August 25, 2010 6:50:07 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Tuesday, August 24, 2010

I recently read Kanban by David J Anderson. David is credited with implementing some of the first Kanban agile systems at various companies. In Kanban, he gives a great overview of what Kanban is, how it grew out of the a physical manufacturing process at Toyota, and offers practical advice on how to implement Kanban at your organization. David also shows you how to set up a Kanban Board and provides several ways to model your system and manage the board.

In addition, David walks us through what the Lean movement is and how it relates to agile software development. He makes a very convincing case for tracking work in progress (WIP) and basing your system around that. Kanban attempts to limit WIP for better throughput. David freely admits that there is no actual scientific evidence as of yet that proves smaller WIP increases productivity and quality, however, he offers up his case studies as well as others.


What I found very helpful is that David reviews the popular Scrum agile methodology and pokes some holes in it. He shows some of the weaknesses of time boxing (the “sprint”), estimating,  and the daily scrum and offers up alternatives via Kanban. David reminds us that agile is a set of values, not a set of rules. (Some people using Scrum today don’t like any change, they are so invested in Scrum that they forget that Scrum is about change.)  Scrum forces you to throw out completely your current system and replace it with Scrum. Kanban allows you to keep your existing process and make changes, changes that revolve around communication, WIP, and flow. Kanban will let your current methodology evolve, not complete revolutionize it.

I used a crude, early version of Kanban a few years ago at my startup in New York. (A blog post will come on this next month.) I also used Scrum pretty extensively over the past few years and realize that neither system is perfect. Kanban is more flexible and Scrum (in my opinion) is easier to get estimates to managers who value “deadlines”.  (What managers don’t?) There are strengths and weaknesses of both and David points this out in his book. A few people mix and match and use a “Scrum-ban” system. Personally I have seen the best success with Kanban and doing system maintenance and Scrum for greenfield start-ups with new teams.

If you are practicing any agile methodology or want to improve your current system, read Kanban. It is worth a try, even if you only implement a few ideas from the book.

posted on Tuesday, August 24, 2010 3:35:54 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback
# Monday, August 16, 2010

I will be speaking at my 15th Software Developers Conference in the Netherlands on October 25th and 26th. For some reason the Dutch keep asking me to come back, even though I make fun of the Dutch pretty much full time. The SDC is special for me; the very first international conference that I ever spoke at was the SDC in 1998. I have been back every year (except 2000) and even did a few of the smaller one day conferences. Over the years I have done some crazy things, including showing up for my session after just coming back from the Red Light District in Amsterdam. (Hey what happens in Amsterdam, stays in Amsterdam…) Richard Campbell and I once did a session called “Mid-evening Technical session with Beer.” The abstract said “Bring beer and hear Richard and Steve talk about the latest technology.”


This year I will be doing a Scrum v Kanban v XP v Whatever smack down that will really be a Q&A lead by Remi, Joel, and me. I will also be doing a RIA Services 101 talk, no slides, just demos.  If you are in Europe this fall, swing by.

posted on Monday, August 16, 2010 4:03:16 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback