This blog post digs into the details of shaping the data with Power Query and Power Pivot in order to build out the test cases. In the previous post, you were able to get a sense of the bigger picture and how the pieces work together. This post will focus entirely on creating the source table that will be used.
One of the most difficult parts of testing the data in an ETL process is that the data needs to be transformed to match the results of the ETL process. Typically this is done using a combination of tools including SQL, Excel, and even Access. The solution I am proposing will use Power Query to do the initial massaging of the data and Power Pivot to put any finishing touches in place.
Understanding the Requirements
The first thing that has to be understood are the requirements. Those requirements are driven from the business rules and the Source to Target Map. Because we are focusing on a non-developer to deliver this work, we need to move away from developer centric tools and into the world of Excel and Power BI.
Building Out the Power Query Query
Power Query is an excellent choice for this work. It allows us to transform or shape the data through a series of steps. What really makes this compelling is that Power Query is a “no code” solution. Once the tester or analyst is familiar with the tool, they understand that most operations can be accomplished using short cut or right-click menus during the design process. Here is the indepth look at what it will take to take the multiple authors in the source and separate them into multiple columns using Power Query.
Step 1 – Find the data source
In our case the data source is a CSV file. You can download that file here. This link will opens an Excel file with the pipe-delimited values that will be used as the source. I would recommend saving it as a .csv file as it is easier to work with in Power Query.
Here is the data raw so you can see what we will be working with:
Wish List|John Locke Books|Thriller|0011|215|John Locke
Riders of the Pale Horse|Bethany House Publishers|Christian Fiction|1994|348|T. Davis Bunn
HTML Pocket Reference|O’Reilly|Technology|2000|92000|Jennifer Niederst
Renegade|Thomas Nelson|Juvenile Fiction|2007|245|Ted DekKer
Gutenberg to Google|Billion Soul Publishing|Missions|2009|272|James Davis
Sinner|Thomas Nelson|Suspense|2008|386|Ted DekKer
SQL Server Analysis Services 2012 Cube Development Cookbook|Packt Publishing|Technology|2013|324|Baya Dewald, Steve Hughes, Paul Turley
Troubleshooting SQL Server – A Guide for the Accidental DBA|Redgate Books|Technology|2011|358|Jonathan Kehaysias, Ted Krueger
Step 2 – Open Power Query in Excel and Connect to the CSV File
Select the Power Query tab and select the From File option on the ribbon. Pick the From CSV option. Select the booklist.csv file and click OK. The result will be a preview of the data, which in our case is all the data. You can see it has created the Source, First Row as Header and Changed Type steps. If it did not do this for you automatically, you may need to set the delimiter and specify that the header is the first row.
Step 3 – Shape the Data in Power Query to Match Our ETL Process
In Power Query, we are going to split the author list and the author names. We also will apply some trimming to the data. In all we will apply ten (10) steps to query. Power Query works like an ETL tool as it shapes or transforms the data a step at a time.
Splitting the AuthorNames column
In this step, we will create a column for each author name. Our destination supports up to five authors. Our source has up to three. Right click on the AuthorNames column, select Split Column, then By Delimiter.
You can leave the defaults in the dialog and click OK.
This will result in three columns being created as AuthorNames.1, AuthorNames.2 and AuthorNames.3. Power Query does the next step which changes the data type to match what it sees in the resulting data.
Splitting the Author’s Names into First and Last Name Column
You will need to repeat this three times, once for each AuthorNames column. What is different is that we need to match a couple of business rules:
1. Author names will be stored as AuthorFName and AuthorLName for up to 5 authors (e.g. AuthorFName1).
2. Authors with middle initials or middle names or variations thereof should store these values with the first name. For example, J.R.R. Tolkien would store “J.R.R.” in the AuthorFName column and his last name, “Tolkien”, will be stored in the AuthorLName column.
Understanding these rules clarify how we should split these columns. Like before we will select to split the AuthorNames.1 column. However, in the delimiter dialog we will use a space as a delimiter and we will also choose the right most delimiter. This will pick the first space from the right, essentially the last name and everything else will be separated.
We will repeat the process for each column. The last step for this process is to rename columns to something meaningful for us to reference later such as the target field names like AuthorFName1. This will make the steps later simpler to follow.
Trim Author First Names for Authors after First Author
The final step we need to do is to apply a trim to the AuthorFName2 and AuthorFName3 columns. When the data is split, leading spaces were retained. In my demos, this is “discovered” as a mismatch in the test scenario. This would be an example of an easy miss for someone not used to some of the nuances of ETL. Keep in mind that we will test the tests as well throughout this process. This is a simple fix in Power Query – Right Click the affected columns and select Transform then Trim. Problem solved.
At this point, we have completed our work in Power Query. Up to this point, you may have seen the results of your query in an Excel spreadsheet. You actually want to load the data to a Power Pivot model. If you right-click on the query in the Workbook Queries panel, you can change the Load To target.
Select Load to Data Model and then we will finish the source data using Power Pivot.
Step 4 – Fix Additional Issues Using Calculated Columns in Power Pivot
Open the Power Pivot model in Excel. You should see data from your Power Query query as one tab of the data. While we have massaged some of the data there are still a few data issues that need to be resolved to match business rules.
3 – Copyright years must be stored as 4 digit values.
4 – Page counts should not exceed 1000.
If you look at the source data you will notice that one of the books has a two digit year for the Copyright. This should not be imported as it does not meet the rule. In our case, we will set the value to NULL in the ETL process. The same is true for one of the book page counts, it is 92,000 which greatly exceeds the maximum page count allowed by the business rule. It too will be set to NULL. The idea here is that row value checks are easily handled in Power Pivot with DAX and calculated columns.
To resolve the copyright year issue we are using the following DAX to create a new column called “Copyright Year”:
=IF([Copyright] < 1900, BLANK(), [Copyright])
To resolve the page count issue, we use the following DAX and create a “Pages” column:
Now we have fixed the remaining issues that violate business rules in the Power Pivot model.
Step 5 – Add Some Calculated Measures and Columns that Can Be Used for Data Validation
The final step is to add some calculations that will help us do some basic load testing. The first is just the row count. In this case, I created two measures: Source Book Count and Source Distinct Book Count (This handles a business rule that says a title can only be imported once). We can use these measures to verify that the expected data made it from source to destination. Both of these measures were created in the calculation area in Power Pivot using the Autosum functions from the ribbon. The resulting DAX is noted below.
Source Book Count:
Source Book Count:=COUNTA([BookName])
Source Distinct Book Count:
Source Distinct Book Count:=DISTINCTCOUNT([BookName])
The last calculation we need to create is the Author Count calculated column. This needs to be a column as each row could have a different number of authors. Based on what we know with the data, we will count instances of AuthorLName columns that are not NULL to determine the number of authors.
=IF(ISBLANK([AuthorLName1]),0,1)+ IF(ISBLANK([AuthorLName2]),0,1)+ IF(ISBLANK([AuthorLName3]),0,1)
This calculation would need to be modified if the source had an row with more than three columns.
Shaping Is Complete
The source transformation is now complete in the test scenario. A key point is that no code per se was written. While some DAX was required, it was fairly straightforward and likely the most complicated part of setting up the source table for testing.
Next up, creating the tests with Power Pivot and DAX.