Power Testing ETL with Power BI – The Process

On November 11, 2014July 15, 2019 By SteveIn Excel, Integration, Office 2013, Power BI, Power Pivot, Power Query, Power Testing with Power BI, Pragmatic Works, SSIS, Syndicated, Visual Studio2 Comments

This is a short blog series on using Power BI tools to support testing ETL processes. I have presented on this subject at few SQL Saturdays over the past few years and am finally succumbing to multiple request to turn it into a blog post. Realizing the amount of content is more than I typically would put into a single post, I will be putting together this short series to cover the material. The first post is this one. It will walk through the entire process at a high level. I will follow this post with a deeper look at Power Query’s role in the process. The third post will cover Power Pivot and building out test cases. Finally, we will wrap the series up with some visualization ideas for Excel and Power View. You can find all the posts as they come online here. Let’s get started.

The Problem Area

Why use Power BI to test ETL? While working as the architect on an ETL project for moving data from third party web service to an on-premise financial solution, we needed to put together a testing strategy that could be implemented by non-developers on the project. Our situation was that our project was “too small” to engage our QA team but the requirement for reusable testing needed to be fulfilled. Our project team consisted of a BI architect (that would be me), an ETL developer, and a business analyst (Chuck Whittemore).

NOTE: We are testing the data transformations and data load. This is not intended for auditing or performance. There are other tools for reviewing those including the built in reporting in SSIS and Pragmatic Works’ BI xPress tool. If you are tracking whether a package fails or succeeds, you should use either of these options not this process.

The Big Idea

The BA and I were discussing options for testing and we theorized that we could use a new add-in for Excel (Power Query, still in preview at the time) with Power Pivot to build out tests. The key to success on this project is that we needed to be able to test with non-developer tools, no SQL Server Management Studio or SSIS could be involved in the testing. The primary reason for this is that he would be doing the testing. We also did not want to recreate every step in the ETL process the same way. So, time to put theory into practice. We determined that we would create test cases in Visual Studio then build out tests to match those cases in Excel using the Power BI add-ins. He would do the work in Excel and we, the developer and I, would provide technical support as needed.

The Recommended Tools

Before we dig into the process, I want to lay out the tools used for development and for testing. While this solution can use other tools, it is important to know what we used in practice to create our solution.

ETL Development Tools

The ETL development was done using SQL Server Integration Services (SSIS). At the time, we needed to use Script tasks to consume the web service content. The financial system used a custom load process that we dumped formatted data into a file for the system to pick up and load.

In the examples, I use in the presentations and will lay out here, I will be using a text file to SQL Server implementation. While complex ETL problems are common and hard to test, this simplified version is easier to follow in examples. You should be able to apply the principles used here to test any solution.

Testing Tools

The testing development for the referenced project consisted of Excel with Power Query and Power Pivot. Power Query was in preview at the time, so we had some of the performance issues and early bugs to work through. None of these issues, prevented us from completing the project.

The presentation solution relies on the latest version of Power Query (which changes every month) and Power Pivot in Excel 2013. Most of the examples are easy to follow, but you should be able to solve most transformation tests with the combination of Power Query and Power Pivot. Definitely do not discount the capabilities of Power Query and the fact that new functionality is being added each month.

Team Foundation Server/Visual Studio Online

Both projects use the online version of TFS. If you are currently not using a source control and work tracking solution, I highly recommend you look at the online version of TFS. It will allow you up to 5 users free and give you ability to use source control, create test plans, create test cases, log bugs and track changes. These are key features necessary to complete a good solution that can be managed and tracked.

The Process

I am going to walk through my demo to build out the process steps. This will allow you to see examples. I will call out any thing of relevance related to the project here as well.

1. Business Rules

The first part of any project, especially in ETL, is to understand the business rules. If you are working with a data warehouse project, this may be fairly well documented in a dimensional model. In both of our cases here, we are moving data from one system to another. The transformations and business rules are primarily driven by the target system. Here are some examples of business rules in the media library sample project.

Author names are stored in separate columns – FirstName and LastName
If an author’s name include a middle name or initial or some variation, this combination should be stored in the first name column. For example, J.R.R. Tolkien would be stored as follows:
— FirstName: J.R.R.
— LastName: Tolkien
Copyright year should be stored as a 4 digit value
Page numbers should not exceed 1000

Every project has some type of business rules. It is hard to build out transformations and create test cases without these rules.

2. Source to Target Map

This is the single most important document for the tester. It tells the tester how the developer is getting from source to destination and what type of data massaging needs to be handled. Typically, people use some variation of the example created by the Kimball Group over the years.

3. Developing SSIS

The developer begins the process of creating the SSIS package. He will be using the Source to Target Map as his guide and will update that document to handle special cases in the data as needed. Ideally he is working in a development environment that will allow for test build outs as well.

4. Creating Test Plans and Test Cases

The tester creates test plans and test cases in TFS. These tests are based on business rules and the source to target map. Depending on both the complexity of the solution and the time to develop, some test cases could be did the table move the correct data field for field and row count. This method can be particularly useful when working with large tables or simple data flows. However, you should have a test case for every transformation that massages the data. This will insure that the data is being transformed as expected.

Keep in mind, this solution will support test cases for each field in a data load if required. The tester and architect should evaluate what is the appropriate amount of coverage to guarantee the highest level of quality in the data transform. As always, there is a diminishing rate of return if you “test everything” at the lowest level. It will be expensive in terms of cost of development when the chance for error is minimal. It will also take substantially longer to test everything. You need to understand and be able to articulate how the testing was accomplished and your level of confidence in the results.

5. Building the Tests

This is the most extensive part of the process besides the SSIS development. I will not go into all the details here, but will walk through the overall process and principles. I will provide detailed examples in the follow up posts as noted above.

Let’s start with the end result. Chuck and I were able to determine that we could use DAX to create comparative formulas on data that could be brought into Power Pivot from both the source and the destination. Essentially, we wanted to use math to determine the results of the tests. So in our example, we use a formula like “if Source.CopyrightYear = Destination.CopyrightYear, then it passes, else it fails.” Depending on how you want to measure, pass could be 0 or 1. Then we add the values up to determine if data passed or failed the test. We can even tell you failure rates.

In order to get the data in a comparative state, we needed each table in the destination with a table that matched from the source. However, it is very common that sources and destinations are not one-for-one table matches. This is where Power Query comes in. Using Power Query in our example set we bring in the text file and massage or shape the data to look like the destination. Most importantly, we need to apply all business rules and transformations to the source. Once this is done, we do no massaging on the destination data. This allows us to compare what the ETL process did with what our tests say it should have done.

A key part of being able to compare is the ability to relate the two tables in Power Pivot. You need to be able to match natural keys or derived keys between the two sources. The relationship should be from the destination table to the source table. Without this relationship, you will not be able to build the calculations for the tests. Keep in mind the goal is to get our source to look like expected results. Any data in the destination should match the source in our scenario.

Once both tables are created and loaded into Power Pivot, we can complete the tests using DAX. In some cases, we create calculations on both tables to be compared. A classic example is row count. We count the number of rows in the source table and the destination table. Then we create a calculation on the destination to compare values. This meets the requirements of a row count test case (e.g. all data was successfully imported).

Another example of a test is to compare the content in a field from source to destination. This is where we use a lot of conditional logic to verify the contents of a field in a row is the same in both tables. Calculated columns (not measures) are used to create the comparison results. The conditional statement should result in a number. This is important in order to create a measure that sums up the results to determine if errors exist or not. If you choose success to be 1, then you will check your results against the row count to determine if there are errors. If you choose failure to be 1, then a nonzero count means you have errors. There is no right or wrong way to handle this, you would choose based on visualization techniques. Most of the time, using 1 for failures is fine. However, if you want to create KPIs, you will likely need success to be one so you have a good target to work with.

6. Testing the Initial Load

Once you have created the tests, you are ready to test the initial load. You will connect to both sources. Ideally, your source will not change so you can redo the test multiple times, but this will work regardless. Refresh the data which may require rerunning the Power Query query. Once you have refreshed the data you should be able to check the calculations in a simple pivot table to determine what tests have succeeded or failed. This is the beauty of this solution. Each subsequent execution of SSIS, you will be able to refresh your data and review your results to determine how successful the ETL is.

A side effect of this work is that the developer can review the test results in Excel and Power Pivot with you to more easily find the discrepancy in the data transform. In some cases, the tests are in error as well. It is important that the developer and tester work together to determine cause as well. A good team will be able to work through issues rather quickly.

7. Recording Bugs and Issues

You will need to go back to Visual Studio to change the pass/fail for each test. If a test fails you can log a bug for the developer and you that information to determine if it was fixed prior to a subsequent run. It is likely that multiple sprints will be required to complete the work so you can work with your team to determine the best ways to communicate what is ready. If you track the work in TFS, you will queries are available to help you see what work has been completed.

You can determine if the fix worked and then set the test results accordingly. This will help show progress on the project as well.

8. Visualizing the Results

You can visualize your results using KPIs, conditional formatting and even Power View. If you have a project that needs to be easily evaluated you can publish your results to SharePoint and use charts and graphs to show how accurate the process is so far.

We will dig into visualization options more in a following blog post.

Tracking Test History

No solution is perfect and that is true here as well. One of the most common questions is how do we see the historical results? This solution does not easily provide for that. I am looking at options, but for the moment the idea is that the history will be tracked through TFS. However, you could save the workbook after each iteration. This will give you some history, but you would want to make sure that you don’t refresh data on a historical workbook or the results would be overwritten.

Some final thoughts.

Power Query is not an ETL tool. It’s target destination is always the same – Power Pivot. While it’s ease of use makes it appear to be a tool to be used for ETL, it is not there yet. However, it is in its ease of use that we have a place to work with it here.

My plan is to have some deeper technical dives into parts of the solution in the future.

T-SQL Window Functions – Part 4: Analytic Functions

On July 16, 2014July 16, 2014 By SteveIn Microsoft SQL Server, Oracle, SQL Server 2012, T-SQL

This is a reprint with some revisions of a series I originally published on LessThanDot. You can find the links to the original blogs on my Series page.

In the final installment of my series on SQL window functions, we will explore using analytic functions. Analytic functions were introduced in SQL Server 2012 with the expansion of the OVER clause capabilities. Analytic functions fall in to two primary categories: values at a position and percentiles. Four of the functions, LAG, LEAD, FIRST_VALUE and LAST_VALUE find a row in the partition and returns the desired value from that row. CUME_DIST and PERCENT_RANK break the partition into percentiles and return a rank value for each row. PERCENTILE_CONT and PERCENTILE_DISC a value at the requested percentile in the function for each row. All of the functions and examples in this blog will only work with SQL Server 2012.
Once again, the following CTE will be used as the query in all examples throughout the post:

with CTEOrders as
(select cast(1 as int) as OrderID, cast(‘3/1/2012’ as date) as OrderDate, cast(10.00 as money) as OrderAmt, ‘Joe’ as CustomerName
union select 2, ‘3/1/2012’, 11.00, ‘Sam’
union select 3, ‘3/2/2012’, 10.00, ‘Beth’
union select 4, ‘3/2/2012’, 15.00, ‘Joe’
union select 5, ‘3/2/2012’, 17.00, ‘Sam’
union select 6, ‘3/3/2012’, 12.00, ‘Joe’
union select 7, ‘3/4/2012’, 10.00, ‘Beth’
union select 8, ‘3/4/2012’, 18.00, ‘Sam’
union select 9, ‘3/4/2012’, 12.00, ‘Joe’
union select 10, ‘3/4/2012’, 11.00, ‘Beth’
union select 11, ‘3/5/2012’, 14.00, ‘Sam’
union select 12, ‘3/6/2012’, 17.00, ‘Beth’
union select 13, ‘3/6/2012’, 19.00, ‘Joe’
union select 14, ‘3/7/2012’, 13.00, ‘Beth’
union select 15, ‘3/7/2012’, 16.00, ‘Sam’
)
select OrderID
,OrderDate
,OrderAmt
,CustomerName
from CTEOrders;

Position Value Functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE

Who has not needed to use values from other rows in the current row for a report or other query? A prime example is needing to know what the last order value was to calculate growth or just show the difference in the results. This has never been easy in SQL Server until now. All of these functions require the use of the OVER clause and the ORDER BY clause. They all use the current row within the partition to find the row at the desired position.

The LAG and LEAD functions allow you to specify the offset or how many rows to look forward or backward and they support a default value in cases where the value returned would be null. These functions do not support the use of ROWS or RANGE in the OVER clause. The FIRST_VALUE and LAST_VALUE allow you to further define the partition using ROWS or RANGE if desired.

The following example illustrates all of the functions with various variations on the parameters and settings.

select OrderID
,OrderDate
,OrderAmt
,CustomerName
,LAG(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderID) as PrevOrdAmt
,LEAD(OrderAmt, 2) OVER (PARTITION BY CustomerName ORDER BY OrderID) as NextTwoOrdAmt
,LEAD(OrderDate, 2, ‘9999-12-31’) OVER (PARTITION BY CustomerName ORDER BY OrderID) as NextTwoOrdDtNoNull
,FIRST_VALUE(OrderDate) OVER (ORDER BY OrderID) as FirstOrdDt
,LAST_VALUE(CustomerName) OVER (PARTITION BY OrderDate ORDER BY OrderID) as LastCustToOrdByDay
from CTEOrders

Results (with shortened column names):

ID	OrderDate	Amt	Cust	PrevOrdAmt	NextTwoAmt	NextTwoDt	FirstOrd	LastCust
1	3/1/2012	10	Joe	NULL	12	3/3/2012	3/1/2012	Joe
2	3/1/2012	11	Sam	NULL	18	3/4/2012	3/1/2012	Sam
3	3/2/2012	10	Beth	NULL	11	3/4/2012	3/1/2012	Beth
4	3/2/2012	15	Joe	10	12	3/4/2012	3/1/2012	Joe
5	3/2/2012	17	Sam	11	14	3/5/2012	3/1/2012	Sam
6	3/3/2012	12	Joe	15	19	3/6/2012	3/1/2012	Joe
7	3/4/2012	10	Beth	10	17	3/6/2012	3/1/2012	Beth
8	3/4/2012	18	Sam	17	16	3/7/2012	3/1/2012	Sam
9	3/4/2012	12	Joe	12	NULL	12/31/9999	3/1/2012	Joe
10	3/4/2012	11	Beth	10	13	3/7/2012	3/1/2012	Beth
11	3/5/2012	14	Sam	18	NULL	12/31/9999	3/1/2012	Sam
12	3/6/2012	17	Beth	11	NULL	12/31/9999	3/1/2012	Beth
13	3/6/2012	19	Joe	12	NULL	12/31/9999	3/1/2012	Joe
14	3/7/2012	13	Beth	17	NULL	12/31/9999	3/1/2012	Beth
15	3/7/2012	16	Sam	14	NULL	12/31/9999	3/1/2012	Sam

If you really like subselects, you can also mix in some subselects and have a very creative SQL statement. The following statement uses LAG and a subselect to find the first value in a partition. I am showing this to illustrate some more of the capabilities of the function in case you have a solution that requires this level of complexity.

select OrderID
,OrderDate
,OrderAmt
,CustomerName
,LAG(OrderAmt, (
select count(*)-1
from CTEOrders c
where c.CustomerName = CTEOrders.CustomerName
and c.OrderID <= CTEOrders.OrderID), 0)
OVER (PARTITION BY CustomerName ORDER BY OrderDate, OrderID) as FirstOrderAmt
FROM CTEOrders

Percentile Functions: CUME_DIST, PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC

As I wrap up my discussion on window functions, the percentile based functions were the functions I knew the least about. While I have already used the position value functions above, I have not yet needed to use the percentiles. So, that meant I had to work with them for a while so I could share their usage and have some samples for you to use.

The key differences in the four function have to do with ranks and values. CUME_DIST and PERCENT_RANK return a ranking value while PERCENTILE_CONT and PERCENTILE_DISC return data values.

CUME_DIST returns a value that is greater than zero and lest than or equal to one (>0 and <=1) and represents the percentage group that the value falls into based on the order specified. PERCENT_RANK returns a value between zero and one as well (>= 0 and <=1). However, in PERCENT_RANK the first group is always represented as 0 whereas in CUME_DIST it represents the percentage of the group. Both functions return the last percent group as 1. In both cases, as the ranking percentages move from lowest to highest, each group’s percent value includes all of the earlier values in the calculation as well.

The following statement shows both of the functions using the default partition to determine the rankings of order amounts within our dataset.

select OrderID
,OrderDate
,OrderAmt
,CustomerName
,CUME_DIST() OVER (ORDER BY OrderAmt) CumDist
,PERCENT_RANK() OVER (ORDER BY OrderAmt) PctRank
FROM CTEOrders

Results:

OrderID	OrderDate	OrderAmt	CustomerName	CumDist	PctRank
1	3/1/2012	10	Joe	0.2	0
3	3/2/2012	10	Beth	0.2	0
7	3/4/2012	10	Beth	0.2	0
2	3/1/2012	11	Sam	0.33333333	0.214285714
10	3/4/2012	11	Beth	0.33333333	0.214285714
6	3/3/2012	12	Joe	0.46666667	0.357142857
9	3/4/2012	12	Joe	0.46666667	0.357142857
14	3/7/2012	13	Beth	0.53333333	0.5
11	3/5/2012	14	Sam	0.6	0.571428571
4	3/2/2012	15	Joe	0.66666667	0.642857143
15	3/7/2012	16	Sam	0.73333333	0.714285714
5	3/2/2012	17	Sam	0.86666667	0.785714286
12	3/6/2012	17	Beth	0.86666667	0.785714286
8	3/4/2012	18	Sam	0.93333333	0.928571429
13	3/6/2012	19	Joe	1	1

The last two functions, PERCENTILE_CONT and PERCENTILE_DISC, return the value at the percentile requested. PERCENTILE_CONT will return the true percentile value whether it exists in the data or not. For instance, if the percentile group has the values 10 and 20, it will return 15. If PERCENTILE_DISC, is applied to the same group it will return 10. It will return the smallest value in the percentile group, which in this case is 10. Both functions ignore NULL values and do not use the ORDER BY, ROWS, or RANGE clauses with the PARTITION BY clause. Instead, WITHIN GROUP is introduced which must contain a numeric data type and ORDER BY clause. Only one column can be specified here. Both functions need a percentile value which can be between 0.0 and 1.0.

The following script illustrates a couple of variations. The first two functions return the median of the default partition. Then next two return the median value for each day. Finally, the last two functions return the low and high values within the partition. The values segmented by the date partition highlight the key difference between the two functions.

select OrderID as ID
,OrderDate as ODt
,OrderAmt as OAmt
,CustomerName as CName
,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY OrderAmt) OVER() PerCont05
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY OrderAmt) OVER() PerDisc05
,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY OrderAmt) OVER(PARTITION BY OrderDate) PerContDt
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY OrderAmt) OVER(PARTITION BY OrderDate) PerDiscDt
,PERCENTILE_CONT(0) WITHIN GROUP (ORDER BY OrderAmt) OVER() PerCont0
FROM CTEOrders

Results

ID	ODt	OAmt	CName	PerCont05	PerDisc05	PerContDt	PerDiscDt	PerCont0
1	3/1/2012	10	Joe	13	13.00	10.5	10.00	10
2	3/1/2012	11	Sam	13	13.00	10.5	10.00	10
3	3/2/2012	10	Beth	13	13.00	15.0	15.00	10
4	3/2/2012	15	Joe	13	13.00	15.0	15.00	10
5	3/2/2012	17	Sam	13	13.00	15.0	15.00	10
6	3/3/2012	12	Joe	13	13.00	12.0	12.00	10
7	3/4/2012	10	Beth	13	13.00	11.5	11.00	10
10	3/4/2012	11	Beth	13	13.00	11.5	11.00	10
9	3/4/2012	12	Joe	13	13.00	11.5	11.00	10
8	3/4/2012	18	Sam	13	13.00	11.5	11.00	10
11	3/5/2012	14	Sam	13	13.00	14.0	14.00	10
12	3/6/2012	17	Beth	13	13.00	18.0	17.00	10
13	3/6/2012	19	Joe	13	13.00	18.0	17.00	10
14	3/7/2012	13	Beth	13	13.00	14.5	13.00	10
15	3/7/2012	16	Sam	13	13.00	14.5	13.00	10

As I wrap up this post, I have to give a shout out to my daughter, Kristy, who is an honors math student. She helped me get my head around this last group of functions. Her honors math work and some statistical work she had done in science helped provide additional insight into the math behind the functions. (Kristy – you rock!)

Series Wrap Up

I hope this series helps everyone understand the power and flexibility in the window functions made available in SQL Server 2012. If you happen to use Oracle, I know that many of these functions or there equivalent are also available in 11g and they also appear to be in 10g. I have to admit my first real production usage was with Oracle 11g but has since used them with SQL Server 2012. The expanded functionality in SQL Server 2012 is just one more reason to upgrade to the latest version.

T-SQL Window Functions – Part 3: Aggregate Functions

On July 7, 2014July 7, 2014 By SteveIn Microsoft SQL Server, SQL Server 2012, T-SQL

This is a reprint with some revisions of a series I originally published on LessThanDot. You can find the links to the original blogs on my Series page.

This is part 3 in my series on SQL window functions. In this post, we will explore using aggregation functions with T-SQL windows. SQL Server supports most of the aggregation functions such as SUM and AVG in this context with the exceptions of GROUPING and GROUPING_ID. However, prior to SQL Server 2012 only the PARTITION BY clause was supported which greatly limited the usability of aggregate window functions. When support for the ORDER BY clause was introduced in SQL Server 2012, more complex business problems such as running totals could be solved without the extensive use of cursors or nested select statement. In my experience, I used to try various ways to get around this limitation including pushing the data to .NET as it could solve this problem more efficiently. However, this was not always possible when working with reporting. Now that we are able to use SQL to solve the problem, more complex and low-performing solutions can be replaced with these window functions.

Once again, the following CTE will be used as the query in all examples throughout the post:

Using PARTITION BY with Aggregate Functions

SQL Server 2005 and the newer versions supports the usage of the PARTITION BY clause by itself. This allowed for some simple aggregate windows. The following example shows SUM and AVG for different partitions of data. The third function actually creates and average using a SUM and COUNT function.

select CustomerName
,OrderDate
,OrderAmt
,SUM(OrderAmt) OVER (PARTITION BY CustomerName) CustomerTotal
,AVG(OrderAmt) OVER (PARTITION BY OrderDate) AvgDailyAmt
,CAST(COUNT(OrderID) OVER (PARTITION BY OrderDate) as decimal(8,3)) / CAST(COUNT(OrderID) OVER() as decimal(8,3)) PctOfTotalPerDay
from CTEOrders
order by OrderDate;

NOTE: The COUNT aggregate returns an integer value. In order to return the decimal, the values need to be explicitly converted to decimal types. Otherwise, the result was rounding to zero for all results in this sample.

Results

CustomerName	OrderDate	OrderAmt	CustomerTotal	AvgDailyAmt	PctOfTotalPerDay
Joe	3/1/2012	10	68	10.5	0.133333333
Sam	3/1/2012	11	76	10.5	0.133333333
Sam	3/2/2012	17	76	14	0.2
Joe	3/2/2012	15	68	14	0.2
Beth	3/2/2012	10	61	14	0.2
Joe	3/3/2012	12	68	12	0.066666667
Joe	3/4/2012	12	68	12.75	0.266666667
Beth	3/4/2012	10	61	12.75	0.266666667
Beth	3/4/2012	11	61	12.75	0.266666667
Sam	3/4/2012	18	76	12.75	0.266666667
Sam	3/5/2012	14	76	14	0.066666667
Beth	3/6/2012	17	61	18	0.133333333
Joe	3/6/2012	19	68	18	0.133333333
Beth	3/7/2012	13	61	14.5	0.133333333
Sam	3/7/2012	16	76	14.5	0.133333333

Using Subselects

Subselect statements in SQL Server are supported, but harder to optimize in SQL Server versus Oracle. Until window functions were introduced all of the situations above could be solved by subselects, but performance would degrade as the results needed to work with larger sets of data. With the improved functionality in SQL Server 2012, you should not need to use subselects to return row-based aggregations. Besides the performance implications, maintenance will also be much simpler as the SQL becomes more transparent. For reference, here is the subselect syntax to return the same results as above:

select cte.CustomerName
, cte.OrderDate
, cte.OrderAmt
, (select SUM(OrderAmt) from CTEOrders where CustomerName = cte.CustomerName) CustomerTotal
, (select cast(COUNT(OrderID) as decimal(8,3)) from CTEOrders where OrderDate = cte.OrderDate) / (select cast(COUNT(OrderID) as decimal(8,3)) from CETOrders) AvgDailyAmt
from CETOrders cte
order by cte.OrderDate;

While it is possible to solve the same function using the subselects, the code is already getting messier and with data sets larger than what we have here, you would definitely see performance degradation.

Some Thoughts on GROUP BY

While I am digressing, I wanted to also highlight some details concerning GROUP BY. The one the biggest difficulties working with the GROUP BY clause and aggregates, every column must either be a part of the GROUP BY or have an aggregation associated with it. The window functions help solve this problem as well.
In the following examples, the first query returns the sum of the amount by day. This is pretty standard logic when working with aggregated queries in SQL.

select OrderDate
,sum(OrderAmt) as DailyOrderAmt
from CTEOrders
group by OrderDate;

However, if you wanted to see more details, but not include them in the aggregation, the following will not work.

select OrderDate
,OrderID
,OrderAmt
,sum(OrderAmt) as DailyOrderAmt
from CTEOrders
group by OrderDate
,OrderID
,OrderAmt;

This SQL statement will return each row individually with the sum at the detail level. You can solve this using the subselect above which is not recommended or you can use a window function.

select OrderDate
,OrderID
,OrderAmt
,sum(OrderAmt) OVER (PARTITION BY OrderDate) as DailyOrderAmt
from CTEOrders

As you can see here and in previous examples the OVER clause allows you to manage the grouping based on the context specified in relationship to the current row.

One other twist on the GROUP BY clause. First, I need to give credit to Itzik Ben-Gan for calling this to my attention at one of our Minnesota SQL Server User Group meetings. In his usual fashion he was showing some T-SQL coolness and he showed an interesting error when using the OVER clause with the GROUP BY clause.

The following will return an error because the first expression is an aggregate, but the second expression which is using the OVER clause is not. Also note that in this example the OVER clause is being evaluated for the entire set of data.

select sum(OrderAmt)
, sum(OrderAmt) over() as TotalOrderAmt
from CTEOrders
group by CustomerName

The expression above returns the following error:
Column ‘CTEOrders.OrderAmt’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

The goal of the statement above was to show the customer’s total order amount with the overall order amount. The following statement resolves this issue because it is aggregating the aggregates. The window is now summing the aggregated amount which are grouped on the customer name.

select sum(OrderAmt)
, sum(sum(OrderAmt)) over() as TotalOrderAmt
from CTEOrders
group by CustomerName

Thanks again to Itzik for bringing this problem and resolution to my attention.

Aggregates with ORDER BY

With the expansion of the OVER clause to include ORDER BY support with aggregates, window functions increased their value substantially. One of the key business problems this allowed us to solve was a running aggregate.

The first example shows how to get a running total by CustomreName based on OrderDate and OrderID.

select OrderID
,OrderDate
,OrderAmt
,CustomerName
,SUM(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderDate, OrderID) as RunningByCustomer
from CTEOrders
ORDER BY CustomerName, OrderDate;

Results

OrderID	OrderDate	OrderAmt	CustomerName	RunningByCustomer
3	3/2/2012	10	Beth	10
7	3/4/2012	10	Beth	20
10	3/4/2012	11	Beth	31
12	3/6/2012	17	Beth	48
14	3/7/2012	13	Beth	61
1	3/1/2012	10	Joe	10
4	3/2/2012	15	Joe	25
6	3/3/2012	12	Joe	37
9	3/4/2012	12	Joe	49
13	3/6/2012	19	Joe	68
2	3/1/2012	11	Sam	11
5	3/2/2012	17	Sam	28
8	3/4/2012	18	Sam	46
11	3/5/2012	14	Sam	60
15	3/7/2012	16	Sam	76

This next example is more creative. It begins to show how powerful the window functions are. In this statement, we are going to return the annual running total aggregated by day. The differentiator here is that we use a DATEPART function in the OVER clause to achieve the desired results.

select OrderID
,OrderDate
,OrderAmt
,CustomerName
,SUM(OrderAmt) OVER (PARTITION BY datepart(yyyy, OrderDate) ORDER BY OrderDate) as AnnualRunning
from CTEOrders
ORDER BY OrderDate;

Results

OrderID	OrderDate	OrderAmt	CustomerName	AnnualRunning
1	3/1/2012	10	Joe	21
2	3/1/2012	11	Sam	21
3	3/2/2012	10	Beth	63
4	3/2/2012	15	Joe	63
5	3/2/2012	17	Sam	63
6	3/3/2012	12	Joe	75
7	3/4/2012	10	Beth	126
8	3/4/2012	18	Sam	126
9	3/4/2012	12	Joe	126
10	3/4/2012	11	Beth	126
11	3/5/2012	14	Sam	140
12	3/6/2012	17	Beth	176
13	3/6/2012	19	Joe	176
14	3/7/2012	13	Beth	205
15	3/7/2012	16	Sam	205

The ORDER BY clause creates an expanding group within the partition. In the examples above, the partition was the customer. Within each partition, ordered groups based on OrderDate and OrderID are “created”. At each row, the OrderDate and OrderID groups are aggregated up to the current row’s group thus producing the running total. If more than one row has the same order grouping, all of the rows in the group are aggregated into the total as shown in the second example above with the days and years.

Aggregates with ROWS

The ROWS clause is used to further define the partition by specifying which physical rows to include based on their proximity to the current row. As noted in the first post in the series, ROWS requires the ORDER BY clause as this determines the orientation of the partition.

The following example uses the FOLLOWING keywords to find the next two purchases that the customer made.

select OrderID
,OrderDate
,OrderAmt
,CustomerName
,SUM(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderDate, OrderID ROWS BETWEEN 1 FOLLOWING and 2 FOLLOWING) as NextTwoAmts
from CTEOrders
order by CustomerName, OrderDate, OrderID;

Results

OrderID	OrderDate	OrderAmt	CustomerName	NextTwoAmts
3	3/2/2012	10	Beth	21
7	3/4/2012	10	Beth	28
10	3/4/2012	11	Beth	30
12	3/6/2012	17	Beth	13
14	3/7/2012	13	Beth	NULL
1	3/1/2012	10	Joe	27
4	3/2/2012	15	Joe	24
6	3/3/2012	12	Joe	31
9	3/4/2012	12	Joe	19
13	3/6/2012	19	Joe	NULL
2	3/1/2012	11	Sam	35
5	3/2/2012	17	Sam	32
8	3/4/2012	18	Sam	30
11	3/5/2012	14	Sam	16
15	3/7/2012	16	Sam	NULL

As we noted in the first blog, the last two rows in the partition only contain partial values. For example, order 12 contains the sum of only one order, 14, and order 14 has now rows following it in the partition and returns NULL as a result. When working with the ROWS clause this must be taken into account.

Aggregates with RANGE

Lastly, adding the RANGE to the OVER clause allows you to create aggregates which go to the beginning or end of the partition. RANGE is commonly used with UNBOUNDED FOLLOWING which goes to the end of the partition and UNBOUNDED PRECEDING which goes to the beginning of the partition. One of the most common use would be to specify the rows from the beginning of the partition to the current row which allows for aggregations such as year to date.

In the example below, we are calculating the average order size over time to the current row. This could be a very effective in a trending report.

select OrderID
,OrderDate
,OrderAmt
,CustomerName
,AVG(OrderAmt) OVER (ORDER BY OrderID RANGE BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as AvgOrderAmt
from CTEOrders
order by OrderDate;

Results

OrderID	OrderDate	OrderAmt	CustomerName	AvgOrderAmt
1	3/1/2012	10	Joe	10
2	3/1/2012	11	Sam	10.5
3	3/2/2012	10	Beth	10.333333
4	3/2/2012	15	Joe	11.5
5	3/2/2012	17	Sam	12.6
6	3/3/2012	12	Joe	12.5
7	3/4/2012	10	Beth	12.142857
8	3/4/2012	18	Sam	12.875
9	3/4/2012	12	Joe	12.777777
10	3/4/2012	11	Beth	12.6
11	3/5/2012	14	Sam	12.727272
12	3/6/2012	17	Beth	13.083333
13	3/6/2012	19	Joe	13.538461
14	3/7/2012	13	Beth	13.5
15	3/7/2012	16	Sam	13.666666

As you can see, the latest versions of OVER clause supports powerful yet simple aggregations which can help in a multitude of reporting and business solutions. Up next, the last blog in the series – Analytic Functions which are all new in SQL Server 2012.

T-SQL Window Functions – Part 2: Ranking Functions

On June 26, 2014June 27, 2014 By SteveIn Microsoft SQL Server, SQL Server 2012, T-SQL

This is a reprint with some revisions of a series I originally published on LessThanDot. You can find the links to the original blogs on my Series page.

This is part 2 in my series on SQL window functions. In this post, we will explore using ranking functions. SQL Server support four different ranking functions which are supported in SQL Server versions 2005 and forward. All of these functions require the use of the OVER clause. The following functions are classified as ranking functions: ROW_NUMBER, RANK, DENSE_RANK, NTILE.

Once again, the following CTE will be used as the query in all examples throughout the post:

ROW_NUMBER

The ROW_NUMBER function will return a row number for each row within the partition based on the partition and order. This function requires the use of the ORDER BY clause. However, it is often used without the PARTITION BY clause as it will number the entire result set. If PARTITION BY is used, then the row numbering starts over within the partition. The following code shows how both of these work.

select CustomerName
, OrderDate
, OrderAmt
,ROW_NUMBER() OVER(ORDER BY CustomerName) RowNumByCust
,ROW_NUMBER() OVER(PARTITION BY OrderDate ORDER BY CustomerName) RowNumPart
from CTEOrders
order by CustomerName;

RANK and DENSE_RANK

While these are different functions with even different rules, it is easier to understand the difference when put side by side. RANK and DENSE_RANK will order the rows based on the specified partition and apply a rank or number to them. Both RANK and DENSE_RANK will assign the same rank to “ties”. For example if rows 3 and 4 have the same value in the partition, they will have the same rank. The difference is how it handles the next rank number in the series. RANK does a “true” ordering and will apply the ranking based on the number of rows and skip numbers that are ties. So, if you have a tie between the third and fourth row and the first two rows and the final row are unique the ranking is as follows: 1, 2, 3, 3, 5. As you can see, 4 is missing. DENSE_RANK keeps the tie as well, however it does not skip any numbers in the sequence. Here is the same example set based on using DENSE_RANK: 1, 2, 3, 3, 4. As with the ROW_NUMBER function, the ORDER BY is required for these functions.

select CustomerName
, OrderDate
, OrderAmt
,RANK() OVER (ORDER BY CustomerName) RankByCust
,DENSE_RANK() OVER (ORDER BY CustomerName) DenseByCust
,RANK() OVER (PARTITION BY CustomerName ORDER BY OrderDate) RankByCustDt
,DENSE_RANK() OVER (PARTITION BY CustomerName ORDER BY OrderDate) DenseByCustDt
from CTEOrders
order by CustomerName;

NTILE

The last of the ranking functions is NTILE. NTILE groups the data into ordered and ranked groups based on the ORDER BY clause. The number of groups used in the ranking are specified in the function itself. So if you specify four groups to produce a quartile ranking, four ranked values, one through four, will be assigned to each group based on the order. In many cases, the total numbers of rows is not divisible by the number of groups chosen. For instance, if the number of groups is 4 but the total number of rows is 39, then the first three groups will return 10 rows and the final group will only return 9 rows. The function will always frontload the results so the earliest groups will have the “extra” rows. If a PARTITION BY clause is used, the NTILE ranking will applied within each partition. As with the other ranking functions, the ORDER BY clause is required to use this function.

select CustomerName
,OrderDate
,OrderAmt
,NTILE(4) OVER (ORDER BY OrderDate) NtileByDt
,NTILE(2) OVER (PARTITION BY OrderDate ORDER BY OrderAMt) NtileByDtAmt
from CTEOrders
order by CustomerName;

Up next, using aggregate functions with window functions.

T-SQL Window Functions – Part 1: The OVER() Clause

On June 26, 2014June 26, 2014 By SteveIn Microsoft SQL Server, SQL Server 2012, T-SQL

This is a reprint with some revisions of a series I originally published on LessThanDot. You can find the links to the original blogs on my Series page.

SQL Window functions were introduced in SQL Server 2005. At the time, only a small set of functionality was available. Window functions fill a need in the aggregation story for SQL Server. Window functions allow the developer to use row level aggregations without the penalty of using cursors to accomplish a similar task. Window functions allow you to segment a set of rows and then apply a function to that set of rows. In many cases, you may choose an aggregation function. However, other functions are also available including ranking and analytic functions. In this four-part series, I will start by breaking apart the OVER clause which is the key to understanding window functions in SQL Server. The following posts will expand on each group of window functions which use the OVER clause – ranking, aggregate, and analytic.

Up until 2008 R2, SQL Server only supported a subset of window functions which focused on ranking and some aggregation functions. With SQL Server 2012, Microsoft greatly expanded its support for window functions in T-SQL thus making fairly complex operations such as running totals much simpler to accomplish.

The following CTE will be used as the query in all examples throughout the post:

Defining the Window with the OVER() Clause

The purpose of the OVER clause is to define the window over which the function will be applied. The default functionality is to define the window for the entire table. As shown in the example below, the result is the same as if you had used an aggregation function without a GROUP BY clause. Some of what makes OVER unique includes the fact different windows can be specified in a single SELECT statement as will be shown in later sections.

…
select CustomerName
, SUM(OrderAmt) Over () as NoParm
from CTEOrders;

The OVER clause takes three additional arguments which change the scope of the clause — PARTITION BY, ORDER BY, and ROWS or RANGE.

PARTITION BY

The PARTITION BY clause is used to reduce the scope of the window to which the aggregation applies. This clause partitions the default or entire result set into partitions based on the criteria specified. The function will be applied to each partition independently. At this point, we begin to see the real power of window functions.

While similar to the GROUP BY clause, the PARTITION BY clause allows you to specify different partitions within the same select statement. The following example builds on the illustration above with no partition specified and a partition on the CustomerName field.

select CustomerName
, SUM(OrderAmt) OVER() as NoParm
, SUM(OrderAmt) OVER (PARTITION BY CustomerName) as PartByName
from CTEOrders;

ORDER BY

The ORDER BY clause is used to order how the partitions apply the window function. For instance, when an ORDER BY is present in our current illustration, it will sum all of Beth’s order amounts, but will then add Joe’s to the total as he is the next in order. In the following illustration, no partition is specified so the partition is still the whole table. You can now see how partitions and ordering affect the window function.

select CustomerName
,SUM(OrderAmt) OVER () as NoParm
,SUM(OrderAmt) OVER (PARTITION BY CustomerName) as PartByName
,SUM(OrderAmt) OVER (ORDER BY CustomerName) as OrderByName
from CTEOrders;

As you can see, if no partition is specified the order will apply to the entire result set or the default partition. This is particularly important to understand when using ranking functions with the OVER clause.

Before SQL Server 2012, the ORDER BY clause could only be used with ranking functions. With the release of SQL Server 2012, the ORDER BY clause can now be used with ranking, aggregate, and analytic functions.

The following image visually illustrates the how the PARTITION BY and ORDER BY clauses aggregate and rank data.

ROWS or RANGE

The final clauses that can be used to define a partition are the ROWS and RANGE functions. This functionality was introduced in SQL Server 2012. These functions will specify a set of rows based on the position of the current row and as a result both functions require the use of the ORDER BY clause. Because each partition is anchored to the current row, the rows or range specifications are based on their proximity to the current row. A number of key words can be used with this clause to define the window frame used by the window functions.

• CURRENT ROW: That’s what it means. It identifies the current row as part of the ROWS or RANGE. This key word is supported by ROWS and RANGE.
• UNBOUNDED PRECEDING: Go to the beginning of the partition. This is supported by both ROWS and RANGE.
• UNBOUNDED FOLLOWING: Go to the end of the partition. This is supported by both ROWS and RANGE.
• n PRECEDING: This is used to specify the number of rows before the current row. This is only supported by the ROWS clause.
• n FOLLOWING: This is used to specify the number of rows after the current row. This is only supported by the ROWS clause.

select CustomerName
,SUM(OrderAmt) OVER () as NoParm
,SUM(OrderAmt) OVER (PARTITION BY CustomerName) as PartByName
,SUM(OrderAmt) OVER (ORDER BY CustomerName) as OrderByName
,SUM(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderDate, OrderID RANGE BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as RangeByDate
,SUM(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderDate, OrderID ROWS BETWEEN 1 FOLLOWING and 2 FOLLOWING) as NextTwoAmts
from CTEOrders;

A couple of other notes regarding ROWS and RANGE. In cases where the partition returns no rows, the value returned is NULL by default as seen in rows 5 and 10 above. Also, if the partition only returns fewer rows than specified it will only apply to rows within the range. In the example above, row 4 NextTwoAmts column will only return the sum of row 5 as it is the only row included in the partition.

Finally, ROWS and RANGE react differently when only the CURRENT ROW is specified. When CURRENT ROW is used by itself with the RANGE clause it will apply the function based on the partitioning and ordering. However, when used with the ROWS clause, it will only return the current row regardless of the partition specification.

select CustomerName
,OrderDate
,OrderAmt
,SUM(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderDate RANGE CURRENT ROW) as RangeRow
,SUM(OrderAmt) OVER (PARTITION BY CustomerName ORDER BY OrderDate ROWS CURRENT ROW) as RowRow
from CTEOrders;

The next three related posts I will discuss ranking functions, aggregate functions and analytic functions.

The Problem Area

The Big Idea

The Recommended Tools

ETL Development Tools

Testing Tools

Team Foundation Server/Visual Studio Online

The Process

1. Business Rules

2. Source to Target Map

3. Developing SSIS

4. Creating Test Plans and Test Cases

5. Building the Tests

6. Testing the Initial Load

7. Recording Bugs and Issues

8. Visualizing the Results

Tracking Test History

Some final thoughts.

Share this:

Position Value Functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE

Results (with shortened column names):

Percentile Functions: CUME_DIST, PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC

Results:

Results

Series Wrap Up

Share this:

Using PARTITION BY with Aggregate Functions

Using Subselects

Some Thoughts on GROUP BY

Aggregates with ORDER BY

Aggregates with ROWS

Aggregates with RANGE

Share this:

ROW_NUMBER

RANK and DENSE_RANK

NTILE

Share this:

Defining the Window with the OVER() Clause

PARTITION BY

ORDER BY

ROWS or RANGE

Share this: