# Sunday, September 12, 2010

On Monday I am leaving for a 3 week trek to visit the Hillary School in Kumjum, Nepal as well as Everest Base Camp. The goal of this trip is to raise money for the Education Elevated charity that I have been working with (see last year’s school work project write up here.)

We’ve raised a ton of money so far, but you can still donate here. For those of you who have donated already, thanks!

See you when I am off the mountain. ;)

posted on Sunday, September 12, 2010 11:57:10 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [2] Trackback
# Saturday, September 11, 2010

If you use Telerik software, why not tell the world how much by voting for the 2010 DevProConnections Community Choice Awards. Voting is open now and Telerik is nominated in 16 of the total 26 categories! While it’s great to take home awards like our recent Best of TechEd 2010 trophy, there is something unique about letting the voters on the Interwebs determine your fate.

What do you need to do? Vote today!

And then remind your co-workers, Twitter followers, Facebook friends, LinkedIn connections, DNUG members, and anyone else you see to vote in this year’s awards, too. To help you successfully cast your ballot for Telerik’s nominations this year, here is a quick voting guide for the categories where you’ll find us:

  • Category (Telerik Product)
  • Add-In (Telerik JustCode)
  • Charting & Graphics Tool (Telerik RadChart for ASP.NET AJAX)
  • Community Resource (Telerik’s Community )
  • Component Set (Telerik RadControls for ASP.NET AJAX)
  • Content Management System (Telerik Sitefinity Web CMS)
  • Grid (Telerik RadGrid for ASP.NET AJAX)
  • Navigation Control (Telerik RadMenu for ASP.NET AJAX)
  • Online Editor (Telerik RadEditor for ASP.NET AJAX)
  • Printing/Reporting Tool (Telerik Reporting)
  • Project Management/Defect Tracking (Telerik TeamPulse)
  • Scheduling/Calendar Tool (Telerik RadScheduler for ASP.NET AJAX)
  • Silverlight Product (Telerik RadControls for Silverlight)
  • Testing/QA Tool (Telerik WebUI Test Studio)
  • Training (Telerik Training)
  • Utility (Telerik OpenAccess ORM)
  • Free Tool (Telerik OpenAccess ORM Express)

Vote in the DevProConnections Community Choice Awards. (Voting closes September 21st.)

posted on Saturday, September 11, 2010 9:50:35 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Friday, September 10, 2010

On Saturday, November 6th, 2010, from 8:00 AM until 6:00 PM, the Fairfield / Westchester developer community will be holding their 4th annual Code Camp! The event will be hosted by CITI, a unit of The University of Connecticut School of Business, on the Stamford, CT Campus.

The continuing goal of the Code Camps series is to provide an intensive developer-to-developer learning experience that is fun and technically stimulating. The primary focus is on delivering programming information and sample code that can be put to practical use. The event is free.

This is an event by the developer community, for the developer community. The content is original and developed by you. Let's work together to make this event a success.
To apply for a speaking slot, please first register as a speaker here: http://bit.ly/fwccspeaker.

Then, with the email address you registered with on the speaker page, please add as many abstracts as you’d like here: http://bit.ly/fwccsession. Submit on anything you’d like related to .NET development.

posted on Friday, September 10, 2010 4:12:51 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback
# Monday, September 6, 2010

Thursday, September 16, 2010
Paul Sheriff: Unit Testing Basics in Visual Studio

Subject:
You must register athttps://www.clicktoattend.com/invitation.aspx?code=149443 in order to be admitted to the building and attend.
Everyone knows that they should be writing better test cases for their applications, but how many of us really do it? In Visual Studio unit testing is an integrated part of the development environment. So there is no longer any reason to avoid not doing test driven development and automated unit testing. In this seminar you will learn how to architect your applications to make testing quicker and easier. You will learn to use the tools in Visual Studio to help you do the testing.
You will Learn
1. How to architect for test driven development
2. Creating test cases
3. Using the Visual Studio Unit Testing tools.

Speaker:
Paul D. Sheriff is the President of PDSA, Inc. (www.pdsa.com), a Microsoft Partner in Southern California. Paul acts as the Microsoft Regional Director for Southern California assisting the local Microsoft offices with several of their events each year and being an evangalist for them. Paul has authored literally hundreds of books, webcasts, videos and articles on .NET, WPF, Silverlight and SQL Server. Paul can be reached via email at PSheriff@pdsa.com. Check out Paul's new code generator 'Haystack' at www.CodeHaystack.com.

Date:
Thursday, September 16, 2010

Time:
Reception 6:00 PM , Program 6:15 PM

Location: 
Microsoft , 1290 Avenue of the Americas (the AXA building - bet. 51st/52nd Sts.) , 6th floor

Directions:
B/D/F/V to 47th-50th Sts./Rockefeller Ctr
1 to 50th St./Bway
N/R/W to 49th St./7th Ave.

posted on Monday, September 6, 2010 10:03:29 PM (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback
# Friday, September 3, 2010

See also:

In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. In Part III, we looked at where to put the data warehouse tables. In Part IV, we are going to look at how to populate those tables and keep them in sync with your OLTP system. Today, our last part in this series, we will take a quick look at the benefits of building the data warehouse before we need it for cubes and BI by exploring our reporting and other options.

As I said in Part I, you should plan on building your data warehouse when you architect your system up front. Doing so gives you a platform for building reports, or even application such as web sites off the aggregated data. As I mentioned in Part II, it is much easier to build a query and a report against the rolled up table than the OLTP tables.

To demonstrate, I will make a quick Pivot Table using SQL Server 2008 R2 PowerPivot for Excel (or just PowerPivot for short!). I have showed how to use PowerPivot before on this blog, however, I usually was going against a SQL Server table, SQL Azure table, or an OData feed. Today we will use a SQL Server table, but rather than build a PowerPivot against the OLTP data of Northwind, we will use our new rolled up Fact table.

To get started, I will open up PowerPivot and import data from the data warehouse I created in Part II. I will pull in the Time, Employee, and Product dimension tables as well as the Fact table.

clip_image001

Once the data is loaded into PowerPivot, I am going to launch a new PivotTable.

clip_image002

PowerPivot understands the relationships between the dimension and fact tables and places the tables in the designed shown below. I am going to drag some fields into the boxes on the PowerPivot designer to build a powerful and interactive Pivot Table. For rows I will choose the category and product hierarchy and sum on the total sales. I’ll make the columns (or pivot on this field) the month from the Time dimension to get a sum of sales by category/product by month. I will also drag in Year and Quarter in my vertical and horizontal slicers for interactive filtering. Lastly I will place the Employee field in the Report Filter pane, giving the user the ability to filter by employee.

clip_image003

The results look like this, I am dynamically filtering by 1997, third quarter and employee name Janet Leverling.

clip_image005

This is a pretty powerful interactive report build in PowerPivot using the four data warehouse tables. If there was no data warehouse, this Pivot table would have been very hard for an end user to build. Either they or a developer would have to perform joins to get the category and product hierarchy as well as more joins to get the order details and sum of the sales. In addition, the breakout and dynamic filtering by Year and Quarter, and display by month, are only possible by the DimTime table, so if there were no data warehouse tables, the user would have had to parse out those DateParts. Just about the only thing the end user could have done without assistance from a developer or sophisticated query is the employee filter (and even that would have taken some PowerPivot magic to display the employee name, unless the user did a join.)

Of course Pivot Tables are not the only thing you can create from the data warehouse tables you can create reports, ad hoc query builders, web pages, and even an Amazon style browse application. (Amazon uses its data warehouse to display inventory and OLTP to take your order.)

I hope you have enjoyed this series, enjoy your data warehousing.

posted on Friday, September 3, 2010 4:14:33 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Thursday, September 2, 2010

See also:

In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. In Part III, we looked at where to put the data warehouse tables. Today we are going to look at how to populate those tables and keep them in sync with your OLTP system.

No matter where your data warehouse is located, the biggest challenge with a data warehouse, especially one where you are going to do real time reporting off of, is that the data is published from the transactional system (OLTP). By definition, the data warehouse is out of date compared to the OLTP system.

Usually when you ask the boss, “How much latency can you accept between the OLTP and data warehouse systems” the boss will reply: none. While that is impossible, the more time to develop and the more money you have will to develop said system, the closer to real time you can get. Always bring up the American Express example used in Part I for proof that your system can accept at least some latency.

If you remember the code examples from Part II, most of the time you have to query OLTP data, aggregate it, and then load it into your new schema. This is known as the process of extraction, transformation, and loading (ETL) data from the OLTP system into the data warehouse. The workflow usually goes like this: at a pre-set time (at every change, hourly, nightly, or weekly) query the OLTP data (extraction) and then aggregate and flatten it out (transformation) and then copy the transformed data to the data warehouse (star or snowflake) tables (load).

You have several options ranging from near real time ETL (very complex and expensive) to nightly data dumps (least complex and least expensive.) You can also perform full ETL (wipe out the data warehouse tables completely and refill on each operation) and partial ETL (only add incremental data as per a time series.) The actual load can be done via database triggers on an add/update/delete for near real time to simple scheduled SQL batches that wipe data and run your load scripts. Most likely you will want to take a hybrid approach, depending on the maturity of your system. You have three basic options, ranging from least complex to most complex:

  • Direct database dump
  • ETL tool
  • Database triggers

When you have a long time series to publish your data, say nightly, or weekly, you can do a direct database dump. The process would be pretty straightforward. At a regular interval (or manually) a process would start that would query the OLTP database and perform all of the aggregations, etc and then load it into a staging data warehouse database, then wipe out the production data warehouse and load the data in.

Another option is to use an ETL tool. A good example is SQL Server Integration Services (SSIS) if you are using Microsoft SQL Server. (Actually SSIS will work with multiple database, you just need a SQL Server host.) A modern ETL tool will give you the ability to segment the work into logical groups, have a control flow based on success and failure of a condition, and allows rollbacks.

clip_image002

A typical workflow with an ETL tool is that the ETL will run on a schedule (or based on a condition, such as a message arriving from a queue or a record written to a admin table) and have a parameter(s) passed to it. This parameter is usually a time series and the ETL will perform all of the extraction on data from the OLTP database filtered by that parameter. An ETL tool is the most likely solution you will employ, especially if you have to make frequent updates from your OLTP system to your data warehouse.

Another option is to use database triggers. For those of you that don’t know a lot about triggers, well they can lead to evil. ;) That said, they are events that fire when data changes. You can then write SQL code to run when the data changes, even ETL code. Triggers are hard to debug and difficult to maintain, so I would only suggest that you use a trigger when you need “real time” updates to your data warehouse and even then, the trigger should only write a record into an admin table that your ETL process is polling to get started.

A common design pattern with a sophisticated ETL is to do all of the ETL to a staging data warehouse and allow a check mechanism to verify that the ETL was successful. Once this check is performed (either by computer or by human, depending on the system), you can then push the data from the staging tables to the production tables. There is a SQL operator specifically for this process, the MERGE operator. MERGE looks at two tables and looks at their joins and compares the data and will do add, updates, inserts, and deletes to keep the two tables in sync. Here is how we would do that with a MERGE statement and our fact table from Part II.

--MERGE the FACT Local to FACT Remote
MERGE dbo.FactOrder as t--target
  USING Northwind.dwh.FactOrder as s--source
  ON t.ProductID = s.ProductID AND
  t.EmployeeID = s.EmployeeID AND
  t.ShipperID = s.ShipperID AND
  t.TimeKey = s.TimeKey --joining keys

--record exists in target, but data different
WHEN MATCHED AND (s.OrderDate != t.OrderDate OR 
   s.PostalCode != t.PostalCode OR
   s.[Total Sales] != t.[Total Sales] OR
   s.Discount != t.Discount OR
   s.[Unit Sales] != t.[Unit Sales]) THEN
       UPDATE SET t.OrderDate = s.OrderDate, 
       t.PostalCode = s.PostalCode,
       t.[Total Sales] = s.[Total Sales],
       t.Discount = s.Discount,
t.[Unit Sales] = s.[Unit Sales]

WHEN NOT MATCHED BY SOURCE THEN
   --only in target, get rid of it
   Delete

WHEN NOT MATCHED BY TARGET THEN 
   --record only exists in source
   INSERT VALUES (s.OrderDate, s.PostalCode, s.ProductID,s.EmployeeID,
   s.ShipperID, s.[Total Sales], s.Discount, s.[Unit Sales], s.[TimeKey]);--required!!

Tomorrow we will warp up with the application development options.

posted on Thursday, September 2, 2010 5:58:03 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [1] Trackback
# Wednesday, September 1, 2010

See also:

In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. Today we are going to look at where to put your data warehouse tables.

Let’s look at the location of your data warehouse. Usually as your system matures, it follows this pattern:

  • Segmenting your data warehouse tables into their own isolated schema inside of the OLTP database
  • Moving the data warehouse tables to their own physical database
  • Moving the data warehouse database to its own hardware

When you bring a new system online, or start a new BI effort, to keep things simple you can put your data warehouse tables inside of your OLTP database, just segregated from the other tables. You can do this a variety of ways, most easily is using a database schema (ie dbo), I usually use dwh as the schema. This way it is easy for your application to access these tables as well as fill them and keep them in sync. The advantage of this is that your data warehouse and OLTP system is self-contained and it is easy to keep the systems in sync.

As your data warehouse grows, you may want to isolate your data warehouse further and move it to its own database. This will add a small amount of complexity to the load and synchronization, however, moving the data warehouse tables to their own table brings some benefits that make the move worth it. The benefits include implementing a separate security scheme. This is also very helpful if your OLTP database scheme locks down all of the tables and will not allow SELECT access and you don’t want to create new users and roles just for the data warehouse. In addition, you can implement a separate backup and maintenance plan, not having your date warehouse tables, which tend to be larger, slow down your OLTP backup (and potential restore!). If you only load data at night, you can even make the data warehouse database read only. Lastly, while minor, you will have less table clutter, making it easier to work with.

Once your system grows even further, you can isolate the data warehouse onto its own hardware. The benefits of this are huge, you can have less I/O contention on the database server with the OLTP system. Depending on your network topology, you can reduce network traffic. You can also load up on more RAM and CPUs. In addition you can consider different RAID array techniques for the OLTP and data warehouse servers (OLTP would be better with RAID 5, data warehouse RAID 1.)

Once you move your data warehouse to its own database or its own database server, you can also start to replicate the data warehouse. For example, let’s say that you have an OLTP that works worldwide but you have management in offices in different parts of the world. You can reduce network traffic by having all reporting (and what else do managers do??) run on a local network against a local data warehouse. This only works if you don’t have to update the date warehouse more than a few times a day.

Where you put your data warehouse is important, I suggest that you start small and work your way up as the needs dictate.

posted on Wednesday, September 1, 2010 5:39:54 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [0] Trackback
# Tuesday, August 31, 2010

See also:

In Part I we looked at when you should build your data warehouse and concluded that you should build it sooner rather than later to take advantage of reporting and view optimization. Today we will look at your options to build your data warehouse schema.

When architecting a data warehouse, you have two basic options: build a flat “reporting” table for each operation you are performing, or build with BI/cubes in mind and implement a “star” or “snowflake” schema. Let’s take a quick look at the first option and then we will take a look at the star and snowflake schemas.

Whenever the business requests a complex report, developers usually slow down the system with a complex SQL statement or operation. For example, pretend in our order entry system (OLTP) the business wants a report that says this: show me the top ten customers in each market including their overall rank. You would usually have to perform a query like this:

  1. Complex joins for unique customer
  2. Rollup the sales
  3. Ranking functions to determine overall rank
  4. Partition functions to segment the rank by country
  5. Standard aggregates to get the sales
  6. Dump all of this to a work table in order to pull out the top 10 (if you don’t do this, you will lose the overall rank)

A typical SQL statement to do steps 1-5 would look like this:

With CTETerritory
As
(
   Select cr.Name as CountryName, CustomerID, 
                Sum(TotalDue) As TotalAmt
   From Sales.SalesOrderHeader soh 
   inner join Sales.SalesTerritory ter
   on soh.TerritoryID=ter.TerritoryID
   inner join Person.CountryRegion cr 
   on cr.CountryRegionCode=ter.CountryRegionCode
   Group By cr.Name, CustomerID
)
Select *, Rank() Over (Order by TotalAmt DESC) as OverallRank,
Rank() Over
     (Partition By CountryName Order By TotalAmt DESC,
            CustomerID DESC) As NationalRank
From CTETerritory

Argh! No wonder developers hate SQL and want to use ORMs! (I challenge the best ORM to make this query!)

Instead you can create a table, lets call it SalesRankByRegion, with the fields: CountryName, CustomerID, TotalSales, OverallRank, and NationalRank, and use the above SQL as part of a synchronization/load script to fill your reporting table on a regular basis. Then your SQL statement for the above query looks like this:

SELECT * FROM SalesRankByRegion
WHERE CustomerNationalRank Between 1 and 10
ORDER BY CountryName, CustomerNationalRank

The results look like:

clip_image001

That is more like it! A simple select statement is easier for the developer to write, the ORM to map, and the system to execute.

The SalesRankByRegion table is a vast improvement over having to query all of the OLTP tables (by my count there are three tables plus the temp table). While this approach has its appeal, very quickly, your reporting tables will start to proliferate.

Your best option is to follow one of the two industry standards for data warehouse tables, a “star” or a “snowflake’ schema. Using a schema like this gives you a few advantages. They are more generic than the SalesRankByRegion, which was a table built for one query/report, giving you the ability to run many different reports off each table. Another advantage is that you will have the ability to build cubes very easily off of a star or snowflake schema as opposed to a bunch of SalesRankByRegion tables.

The design pattern for building true data warehouse tables are to build a “fact” table, or a table that contains detail level (or aggregated) “facts” about something in the real world, like an order or customer for example. Inside of the fact table you will also have “measures” or a numeric value that represents a “fact.” To support your fact table you will have “dimension” tables. Dimensions are a structure that will categorize your data, usually in the form of a hierarchy. A dimension table for example could be “time” with a hierarch of OrderYear, OrderQuarter, OrderMonth, OrderDate, OrderTime.

There are tons of tutorials on the internet that show you how to build a star or snowflake schema and the difference between them, so I will not repeat them here. (You may want to start here.) I’ll give you the high level on a simple star schema here.

Let’s say we have an order entry system, such as Northwind (in the Microsoft SQL Server sample database.) You can have a fact table that revolves around an order. You can then have three (or more) fact tables that focus on: time, product, and salesperson. The time dimension would roll-up the order date by year, quarter, month, and date. The product dimension would roll-up the product by the product and category. (In most systems you would have a much deeper hierarchy for products.) The salesperson dimension would be roll-up of the employee, the employee manager and the department they work in. The key in each of these tables would then be foreign keys in the fact table, along with the measure (or numerical data describing the fact.)

There is an example similar to this in Programming SQL Server 2008, a book where I am a co-author. Here is modified version of that demo:

Dimension tables:

CREATE TABLE [dwh].[DimTime] (
[TimeKey] [int] IDENTITY (1, 1) NOT NULL Primary Key,
[OrderDate] [datetime] NULL ,
[Year] [int] NULL ,
[Quarter] [int] NULL ,
[Month] [int] NULL 
) 

CREATE TABLE [dwh].[DimProduct] (
[ProductID] [int] not null Primary Key,
[ProductName] nvarchar(40) not null,
[UnitPrice] [money] not null,
[CategoryID] [int] not null,
[CategoryName] nvarchar(15) not null
) 

CREATE TABLE [dwh].[DimEmployee] (
EmployeeID int not null Primary Key,
EmployeeName nvarchar(30) not null,
EmployeeTitle nvarchar(30),
ManagerName nvarchar(30)
)

Fact table:
CREATE TABLE [dwh].FactOrder (
[PostalCode] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[ProductID] [int] NOT NULL ,
[EmployeeId] [int] NOT NULL ,
[ShipperId] [int] NOT NULL ,
[Total Sales] [money] NULL ,
[Discount] [float] NULL ,
[Unit Sales] [int] NULL ,
[TimeKey] [int] NOT NULL 
)

We have the basis of a star schema. Now we have to fill those tables and keep them up to date. That is a topic for Part III.

posted on Tuesday, August 31, 2010 7:30:42 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [2] Trackback