# Wednesday, 01 September 2010

See also:

In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. Today we are going to look at where to put your data warehouse tables.

Let’s look at the location of your data warehouse. Usually as your system matures, it follows this pattern:

  • Segmenting your data warehouse tables into their own isolated schema inside of the OLTP database
  • Moving the data warehouse tables to their own physical database
  • Moving the data warehouse database to its own hardware

When you bring a new system online, or start a new BI effort, to keep things simple you can put your data warehouse tables inside of your OLTP database, just segregated from the other tables. You can do this a variety of ways, most easily is using a database schema (ie dbo), I usually use dwh as the schema. This way it is easy for your application to access these tables as well as fill them and keep them in sync. The advantage of this is that your data warehouse and OLTP system is self-contained and it is easy to keep the systems in sync.

As your data warehouse grows, you may want to isolate your data warehouse further and move it to its own database. This will add a small amount of complexity to the load and synchronization, however, moving the data warehouse tables to their own table brings some benefits that make the move worth it. The benefits include implementing a separate security scheme. This is also very helpful if your OLTP database scheme locks down all of the tables and will not allow SELECT access and you don’t want to create new users and roles just for the data warehouse. In addition, you can implement a separate backup and maintenance plan, not having your date warehouse tables, which tend to be larger, slow down your OLTP backup (and potential restore!). If you only load data at night, you can even make the data warehouse database read only. Lastly, while minor, you will have less table clutter, making it easier to work with.

Once your system grows even further, you can isolate the data warehouse onto its own hardware. The benefits of this are huge, you can have less I/O contention on the database server with the OLTP system. Depending on your network topology, you can reduce network traffic. You can also load up on more RAM and CPUs. In addition you can consider different RAID array techniques for the OLTP and data warehouse servers (OLTP would be better with RAID 5, data warehouse RAID 1.)

Once you move your data warehouse to its own database or its own database server, you can also start to replicate the data warehouse. For example, let’s say that you have an OLTP that works worldwide but you have management in offices in different parts of the world. You can reduce network traffic by having all reporting (and what else do managers do??) run on a local network against a local data warehouse. This only works if you don’t have to update the date warehouse more than a few times a day.

Where you put your data warehouse is important, I suggest that you start small and work your way up as the needs dictate.

posted on Wednesday, 01 September 2010 05:39:54 (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback
# Tuesday, 31 August 2010

See also:

In Part I we looked at when you should build your data warehouse and concluded that you should build it sooner rather than later to take advantage of reporting and view optimization. Today we will look at your options to build your data warehouse schema.

When architecting a data warehouse, you have two basic options: build a flat “reporting” table for each operation you are performing, or build with BI/cubes in mind and implement a “star” or “snowflake” schema. Let’s take a quick look at the first option and then we will take a look at the star and snowflake schemas.

Whenever the business requests a complex report, developers usually slow down the system with a complex SQL statement or operation. For example, pretend in our order entry system (OLTP) the business wants a report that says this: show me the top ten customers in each market including their overall rank. You would usually have to perform a query like this:

  1. Complex joins for unique customer
  2. Rollup the sales
  3. Ranking functions to determine overall rank
  4. Partition functions to segment the rank by country
  5. Standard aggregates to get the sales
  6. Dump all of this to a work table in order to pull out the top 10 (if you don’t do this, you will lose the overall rank)

A typical SQL statement to do steps 1-5 would look like this:

With CTETerritory
As
(
   Select cr.Name as CountryName, CustomerID, 
                Sum(TotalDue) As TotalAmt
   From Sales.SalesOrderHeader soh 
   inner join Sales.SalesTerritory ter
   on soh.TerritoryID=ter.TerritoryID
   inner join Person.CountryRegion cr 
   on cr.CountryRegionCode=ter.CountryRegionCode
   Group By cr.Name, CustomerID
)
Select *, Rank() Over (Order by TotalAmt DESC) as OverallRank,
Rank() Over
     (Partition By CountryName Order By TotalAmt DESC,
            CustomerID DESC) As NationalRank
From CTETerritory

Argh! No wonder developers hate SQL and want to use ORMs! (I challenge the best ORM to make this query!)

Instead you can create a table, lets call it SalesRankByRegion, with the fields: CountryName, CustomerID, TotalSales, OverallRank, and NationalRank, and use the above SQL as part of a synchronization/load script to fill your reporting table on a regular basis. Then your SQL statement for the above query looks like this:

SELECT * FROM SalesRankByRegion
WHERE CustomerNationalRank Between 1 and 10
ORDER BY CountryName, CustomerNationalRank

The results look like:

clip_image001

That is more like it! A simple select statement is easier for the developer to write, the ORM to map, and the system to execute.

The SalesRankByRegion table is a vast improvement over having to query all of the OLTP tables (by my count there are three tables plus the temp table). While this approach has its appeal, very quickly, your reporting tables will start to proliferate.

Your best option is to follow one of the two industry standards for data warehouse tables, a “star” or a “snowflake’ schema. Using a schema like this gives you a few advantages. They are more generic than the SalesRankByRegion, which was a table built for one query/report, giving you the ability to run many different reports off each table. Another advantage is that you will have the ability to build cubes very easily off of a star or snowflake schema as opposed to a bunch of SalesRankByRegion tables.

The design pattern for building true data warehouse tables are to build a “fact” table, or a table that contains detail level (or aggregated) “facts” about something in the real world, like an order or customer for example. Inside of the fact table you will also have “measures” or a numeric value that represents a “fact.” To support your fact table you will have “dimension” tables. Dimensions are a structure that will categorize your data, usually in the form of a hierarchy. A dimension table for example could be “time” with a hierarch of OrderYear, OrderQuarter, OrderMonth, OrderDate, OrderTime.

There are tons of tutorials on the internet that show you how to build a star or snowflake schema and the difference between them, so I will not repeat them here. (You may want to start here.) I’ll give you the high level on a simple star schema here.

Let’s say we have an order entry system, such as Northwind (in the Microsoft SQL Server sample database.) You can have a fact table that revolves around an order. You can then have three (or more) fact tables that focus on: time, product, and salesperson. The time dimension would roll-up the order date by year, quarter, month, and date. The product dimension would roll-up the product by the product and category. (In most systems you would have a much deeper hierarchy for products.) The salesperson dimension would be roll-up of the employee, the employee manager and the department they work in. The key in each of these tables would then be foreign keys in the fact table, along with the measure (or numerical data describing the fact.)

There is an example similar to this in Programming SQL Server 2008, a book where I am a co-author. Here is modified version of that demo:

Dimension tables:

CREATE TABLE [dwh].[DimTime] (
[TimeKey] [int] IDENTITY (1, 1) NOT NULL Primary Key,
[OrderDate] [datetime] NULL ,
[Year] [int] NULL ,
[Quarter] [int] NULL ,
[Month] [int] NULL 
) 

CREATE TABLE [dwh].[DimProduct] (
[ProductID] [int] not null Primary Key,
[ProductName] nvarchar(40) not null,
[UnitPrice] [money] not null,
[CategoryID] [int] not null,
[CategoryName] nvarchar(15) not null
) 

CREATE TABLE [dwh].[DimEmployee] (
EmployeeID int not null Primary Key,
EmployeeName nvarchar(30) not null,
EmployeeTitle nvarchar(30),
ManagerName nvarchar(30)
)

Fact table:
CREATE TABLE [dwh].FactOrder (
[PostalCode] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[ProductID] [int] NOT NULL ,
[EmployeeId] [int] NOT NULL ,
[ShipperId] [int] NOT NULL ,
[Total Sales] [money] NULL ,
[Discount] [float] NULL ,
[Unit Sales] [int] NULL ,
[TimeKey] [int] NOT NULL 
)

We have the basis of a star schema. Now we have to fill those tables and keep them up to date. That is a topic for Part III.

posted on Tuesday, 31 August 2010 07:30:42 (Eastern Daylight Time, UTC-04:00)  #    Comments [2] Trackback
# Monday, 30 August 2010

Most developers are scared of “Business Intelligence” or BI. Most think that BI consists of cubes, pivot/drill down apps, and analytical decision support systems. While those are very typical outcomes of a BI effort, many people forget about the first step, the data warehouse.

Typically this is what happens with a BI effort. A system is built, usually a system that deals with transactions. We call this an OLTP or on-line transaction processing system. Some time passes and reports are bolted on and some business analysts build some pivot tables from “raw dumps” of data. As the system grows, reports start to slow since the system is optimized to deal with one record at a time. Someone, usually a CTO type says: “we need a BI system.” A development effort is then spent to build a data warehouse and cubes, and some kind of analytical system on top of those cubes.

I make the argument that developers and project planners should embrace the data warehouse up front. When you design your OLTP system, also design the supporting data warehouse, even if you have no intention of building a full-fledged BI application with cubes and the like. This way you have two distinct advantages. First is that you have a separate system that is optimized for reporting. This system will allow the rapid creation of many new reports as well take the load off the OLTP system. Second, when you do decide to build a BI system based on cubes, you will already have the hard part done, building the data warehouse and supporting ETL.

Since a data warehouse uses more of a flatter data model (more on this in Part II), you can even design your application to use both the OLTP and data warehouse as data sources. For example, when you have highly normalized, 3rd normal form transactional tables to support transactions, it is never easy to use those tables for reporting and displaying of information. Those tables are optimized and indexed to support retrieving and editing (or adding/deleting) one record at a time. When you try to do things in aggregate, you start to stress your system, since it was designed to deal with one record at a time.

This design pattern is already in use today at many places. Consider your credit card company for example. I use American Express, and I never see my transactions show up for at least 24 hours. If go buy something and I phone American Express and say “what was my last transaction” they will tell you right away. If you look online, you will not see that transaction until the next business day. Why? When you call the customer service representative, they are looking at the OLTP system, pulling up one record at a time. When you are looking online, you are looking at the data warehouse, a system optimized for viewing lots of data in a reporting environment.

You can take this to an extreme, if you ran an e-commerce site, you can power your product catalog view portion of the site with the data warehouse and the purchase (inventory) system with the OLTP model. Optimize the site for browsing (database reads) and at the same time have super-fast e-commerce (database writes.) Of course you have to keep the purchasing/inventory (OLTP) and product display (data warehouse) databases in sync. I’ll talk about that in Part III. Next, I will take a look at how to build the data warehouse.

posted on Monday, 30 August 2010 19:33:02 (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Wednesday, 25 August 2010

A while ago I was asked by the publisher to be a tech editor of A Practical Guide to Distributed Scrum. Since agile luminaries like Ken Schwaber and Scott Ambler were also tech editors, I was honored to be chosen as well. Reviewing this book was a great experience and I have re-read the book since it was published (even thought I was paid to be a tech editor/reviewer, the publisher sent me a free copy when the book was published. Cool!)

8-25-2010 6-20-37 PM

You can learn a lot about using Scrum in a distributed environment from reading this book, it is the gold standard. If you have remote employees, off shore developers, or just a lot of offices where the product owner is in one location and the development team in another, this book is for you. The authors walk you through the process of setting up scrum in a distributed environment including planning, user stories, and the daily scrum. They give practical advice on how to deal with the problems specific to distributed teams using scrum, including most importantly communication and coordination. The authors are from IBM and show some of the techniques used at IBM with their remote employees, offices, and contractors.

I have been doing scrum in a distributed environment for almost 5 years now, and still learned quite a bit by reading this book. I encourage you to read it too.

posted on Wednesday, 25 August 2010 06:50:07 (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Tuesday, 24 August 2010

I recently read Kanban by David J Anderson. David is credited with implementing some of the first Kanban agile systems at various companies. In Kanban, he gives a great overview of what Kanban is, how it grew out of the a physical manufacturing process at Toyota, and offers practical advice on how to implement Kanban at your organization. David also shows you how to set up a Kanban Board and provides several ways to model your system and manage the board.

In addition, David walks us through what the Lean movement is and how it relates to agile software development. He makes a very convincing case for tracking work in progress (WIP) and basing your system around that. Kanban attempts to limit WIP for better throughput. David freely admits that there is no actual scientific evidence as of yet that proves smaller WIP increases productivity and quality, however, he offers up his case studies as well as others.

image

What I found very helpful is that David reviews the popular Scrum agile methodology and pokes some holes in it. He shows some of the weaknesses of time boxing (the “sprint”), estimating,  and the daily scrum and offers up alternatives via Kanban. David reminds us that agile is a set of values, not a set of rules. (Some people using Scrum today don’t like any change, they are so invested in Scrum that they forget that Scrum is about change.)  Scrum forces you to throw out completely your current system and replace it with Scrum. Kanban allows you to keep your existing process and make changes, changes that revolve around communication, WIP, and flow. Kanban will let your current methodology evolve, not complete revolutionize it.

I used a crude, early version of Kanban a few years ago at my startup in New York. (A blog post will come on this next month.) I also used Scrum pretty extensively over the past few years and realize that neither system is perfect. Kanban is more flexible and Scrum (in my opinion) is easier to get estimates to managers who value “deadlines”.  (What managers don’t?) There are strengths and weaknesses of both and David points this out in his book. A few people mix and match and use a “Scrum-ban” system. Personally I have seen the best success with Kanban and doing system maintenance and Scrum for greenfield start-ups with new teams.

If you are practicing any agile methodology or want to improve your current system, read Kanban. It is worth a try, even if you only implement a few ideas from the book.

posted on Tuesday, 24 August 2010 03:35:54 (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback
# Monday, 16 August 2010

I will be speaking at my 15th Software Developers Conference in the Netherlands on October 25th and 26th. For some reason the Dutch keep asking me to come back, even though I make fun of the Dutch pretty much full time. The SDC is special for me; the very first international conference that I ever spoke at was the SDC in 1998. I have been back every year (except 2000) and even did a few of the smaller one day conferences. Over the years I have done some crazy things, including showing up for my session after just coming back from the Red Light District in Amsterdam. (Hey what happens in Amsterdam, stays in Amsterdam…) Richard Campbell and I once did a session called “Mid-evening Technical session with Beer.” The abstract said “Bring beer and hear Richard and Steve talk about the latest technology.”

image

This year I will be doing a Scrum v Kanban v XP v Whatever smack down that will really be a Q&A lead by Remi, Joel, and me. I will also be doing a RIA Services 101 talk, no slides, just demos.  If you are in Europe this fall, swing by.

posted on Monday, 16 August 2010 04:03:16 (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Friday, 13 August 2010

Thursday, August 19, 2010
Building Windows Phone 7 Games in 3D with XNA Game Studio 4.0

Subject: 
You must register at https://www.clicktoattend.com/invitation.aspx?code=149726 in order to be admitted to the building and attend.
Why would you be forced to buy a Mac and learn yet another language to write mobile games? The truth is you can reuse your finely honed .NET and C# skills to write games that will run on Windows, Xbox 360 and the hot new kid on the block: Windows Phone 7. Enter XNA Game Studio 4.0. Join ActiveNick in this session as your fast track to the world of mobile game development where we jump right away into the fun stuff. We’ll go through a quick recap of XNA Game Studio and dive right in. No, we won’t be building no Atari 2600-style 2D games, let’s mess around with the cool 3D stuff. We’ll cover designing games for mobile phones, adapting desktop & console XNA code for Windows Phone 7, tapping into the phone hardware, discuss media assets and the Content Processing Pipeline and basically cover as much demo code as the evening will allow. Forget SharePoint and Entity Framework, this is the kind of coding you signed up for when you decided to go pro as a coding geek.

Speaker: 
Nickolas Landry, Infusion
Nickolas Landry is Practice Manager in New York for Infusion Development, a Microsoft Gold Partner which offers quality software development services, developer training and consulting services for large corporations and agencies in the North America, the UK and Dubai (www.infusion.com). Known for his dynamic and engaging style, he is a frequent speaker at major software development conferences worldwide, a member of the INETA and MSDN Canada Speakers Bureaus, and a 6-year Microsoft MVP on Device Application Development. With over 18 years of professional experience, a software architect by trade and a career almost entirely dedicated to Microsoft technologies, Nick specializes in .NET mobility, Bing Maps & Location Intelligence, High-Performance Computing (HPC), Game Development with XNA, and Smart Clients. He wrote multiple articles for CoDe Magazine and several .NET mobility courses for Microsoft, has been a technical editor for many books, and holds several professional certifications from Microsoft and IBM. www.twitter.com/ActiveNick

Date: 
Thursday, August 19, 2010

Time: 
Reception 6:00 PM , Program 6:15 PM

Location:  
Microsoft , 1290 Avenue of the Americas (the AXA building - bet. 51st/52nd Sts.) , 6th floor

Directions:
B/D/F/V to 47th-50th Sts./Rockefeller Ctr
1 to 50th St./Bway
N/R/W to 49th St./7th Ave.

posted on Friday, 13 August 2010 03:42:24 (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback
# Thursday, 12 August 2010

This coming October, I will be speaking at DevReach in Sofia, Bulgaria. DevReach is a great event and will be entering its 5th year. It is a two day event with A list speakers (excluding myself of course). World famous Scott Stanfield is the keynote speaker this year and there will be some great BI content presented by Andrew Burst.  Joel, Remi, and I will be leading a Scrum/Agile/KanBan/Scrum-but “smackdown” talk/discussion. At only 200 euros, it is the best bargain in Europe! You can register here.

I have spoken at all previous DevReach events and will keep speaking there until they tell me they don’t want me anymore. DevReach is special to me, at the first DevReach, I was able to play a very small role in helping the conference founder Martin Kulov recruit some speakers and plan the event. I also met for the first time at that first DevReach, my current employer, Telerik.

I liked it so much, I stayed. ;) Watch out, it could happen to you….

posted on Thursday, 12 August 2010 06:41:58 (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback