Power BI: Data Quality Checks Using Python & SQL

Picture this, you have a report in Power BI that someone passes off to you for data quality checks. There are a few ways to make sure your measures match what is in the source data system, but for this demo we are going to use python and excel to perform our data quality checks in one batch. In order to do that, we are going to build a python script that can run Power BI REST APIs, connect to a SQL Server, and connect to Excel to grab the formulas and to push back the quality check into Excel for final review. To find a sample Excel and the final python script, please refer to my GitHub.

To build this, you’ll need a few pieces:

  1. A way to run Power BI REST API’s
    • Requires an Azure AD App – if you can’t make one then you will need someone in your company to create one and give you the client ID for it and credentials to use it with
  2. Ability to run python scripts (Visual Studio Code will be used in this demo)
  3. Access to a Power BI dataset that is published to the Power BI service (Premium is not needed)
  4. Access to a SQL Server
  5. A list of DAX functions that you want to test in Excel
  6. A list of SQL queries that should have the same result as your DAX functions in Excel

Authentication

To start, you’ll need to register an application for Power BI within Azure. First, create & register your app in the Azure portal. This link will take you through the manual process step-by-step including how to add permissions so the application can access the PBI service. You’ll need to give your app the delegated permission of Dataset.ReadWrite.All or Dataset.Read.All as well as Tenant.Read.All or Tenant.ReadWrite.All which will require an admin to grant (check the required scope in API documentation). One important limitation for accessing the Azure AD app, the user that you use to access it must have no MFA on it. If that makes your IT security team squeemish, remind them that the app can be given read only access to only Power BI’s metadata. No proprietary data is at risk and write back is not necessary to accomplish our goals.

Let’s test out running the REST API using python scripts! To run this in python, we will need to generate an access token then call the REST API using the access token acquired earlier.

Adding Libraries in Python

If you have not used some of the libraries that will be imported for this demo, you will need to run “pip install [library name]” in your command prompt. If you don’t have pip, follow directions on this link and/or modify your Python application to include pip. Newer versions often have this enabled for ya by default, but triple check that you have added python to your environment variables or else you will get errors saying Python isn’t installed when you try to run stuff in the command prompt.

Getting the REST API Token

Alrighty, let’s go to Visual Studio Code and build out the authentication portion of our Python. We are going to use the adal library to make a REST API call that will get the token for us. Make sure to put single or double quotes around your Client ID, Username, and Password to format them as text. Keep in mind, our final code will get these values from the excel file instead of hard coding them into our code. That means you can also get these values from a separate Excel file if your IT team would prefer to keep these values out of the final code (which is highly recommended).

import adal
authority_url = 'https://login.windows.net/common'
resource_url = 'https://analysis.windows.net/powerbi/api'
client_id = <INSERT CLIENT ID>
username = <INSERT USERNAME>
password = <INSERT PASSWORD>
context = adal.AuthenticationContext(authority=authority_url,
                                     validate_authority=True,
                                     api_version=None)
token = context.acquire_token_with_username_password(resource=resource_url,
                                                     client_id=client_id,
                                                     username=username,
                                                     password=password)
access_token = token.get('accessToken')
print(access_token)

Awesome, now we have a way to generate an access token to use for our future API calls!

Alrighty, let’s try this out with a simple REST API call – getting a list of the workspaces (called groups in the API for some reason) in my tenant. I only have a couple workspaces, but this will let us test out passing our token to get back a response. Add the following lines to the end of the code above:

get_workspaces_url = 'https://api.powerbi.com/v1.0/myorg/groups'
header = {'Authorization': f'Bearer {access_token}'}
r = requests.get(url=get_workspaces_url, headers=header)
r.raise_for_status()
print(r.text)

You should see something similar to the screenshot below. You’ll notice there are a lot of elements in this response, but we can look at the name section and confirm that the code is working as expected. Exciting stuff!

Connecting to a SQL Server database

To connect to SQL, we are going to use the pymssql library. See below for the format we will need to connect to our SQL Server database and run a sample query.

#SQL Server Authentication
import pymssql 

sql_server = 'server.database.windows.net'
sql_user = 'Sample'
sql_password= 'Password'
sql_database = 'DatabaseName'

sql_con = pymssql.connect(sql_server,sql_user,sql_password,sql_database)
sql_cursor = sql_con.cursor(as_dict=True)

#execute SQL script
sql_cursor.execute('SELECT TOP 1 SalesKey FROM dbo.FactSales')
for row in sql_cursor: 
    print(row)

Your result will be a dictionary with the column name then the value. This will be important when we try to compare it to our Power BI DAX query result.

Comparing DAX Query with SQL Query

Alrighty now to the main event! Running a DAX query against our Power BI data model and comparing it to a SQL query to ensure our measure is working as expected. For this demo, I’ll keep it pretty simple. I have a measure in my dataset, Sales Amount = SUM(FactSales[SalesAmount]). This should equal the result from the SQL query SELECT SUM(SalesAmount) ‘Sales Amount’ FROM FactSales. If you have errors using the execute queries REST API, please review the limitations section of the Microsoft Execute Queries Documentation.

Python Code to Run SQL Script

#SQL Script Execution
sql_cursor.execute('SELECT SUM(SalesAmount) as SalesAmount FROM dbo.FactSales')
for row in sql_cursor: 
    print(row["SalesAmount"])

Python Code to Run PBI Execute Queries REST API

#PBI Execute Queries 
pbi_dataset_id = '10c3f357-9b2a-42f4-8438-748295b47b9b'
execute_queries_url = f'https://api.powerbi.com/v1.0/myorg/datasets/{pbi_dataset_id}/executeQueries'
dax_query_json_body = {
                "queries":
                        [
                            {
                            "query": 'EVALUATE ROW("Sales_Amount", \'_Measures\'[Sales Amount])'
                            }
                        ]
                        
                }
eq = requests.post(url=execute_queries_url, headers=header,json=dax_query_json_body)
eq.raise_for_status()
query_json = eq.text.encode().decode('utf-8-sig')
print("JSON Response ", json.loads(query_json))

Awesome! Looks like now I can visually confirm that my Power BI DAX measure returns the same result as my SQL query. But what if I want to do more than one at a time? In the next section, we will flip the script to use an Excel file as our input and our output. That way, all you would need to do is have two columns for input – one with the SQL query and one with the corresponding DAX formula. Then the output would be a third column that returns the difference between the two (ideally, 0).

Use Excel File as Input and Output

To start, we are going to build an Excel template to use with this script. For ease of use, we are going to use formatted tables so end users can append new measures as needed (a copy is available on our GitHub for download). We will make two sheets- one for credentials and one for the data quality checks. On the credentials sheet, make one table for the SQL credentials and for the PBI credentials. I’m going to name the table “Creds”.

On the Quality Check tab, we will make one more table that has a column for Power BI DAX formula, the SQL query, the result of the DAX formula, the result of the SQL query, and the difference between the two. Let’s avoid using spaces in our column and table names to keep our coding simple. We’ll call this table “Quality”.

Alrighty, now to pull these values into python! Let’s start with grabbing our creds and pulling them in. For this section, we need to import the openpyxl library and use it to pull in our tables. To start, we are going to grab our file then go to the worksheet named “Credentials”. From there, we will see what tables are in that worksheet.

#Grab data from our excel file
from openpyxl import load_workbook

#Read file
wb = load_workbook(r"C:\Users\KristynaHughes\Downloads\PythonDataQualityChecker.xlsx") 
#we need the r at the beginning since "\" is a special character in python
sheets = wb.sheetnames
print(sheets) #see the sheets in our workbook
#access specific sheet called Credentials
cred_ws = wb['Credentials']
print(cred_ws.title)
cred_tables = cred_ws.tables.values()
for table in cred_tables:
    print(table.displayName)

Next, we are going to use the mapping function to create an object that contains a dataframe for our excel file. If you’re not familiar with python, a dataframe is essentially Python’s version of a SQL view or Excel table that can store and manipulate tabular data. Once we get our two tables into dataframes, we can start to pull out values as inputs for our earlier equations.

Don’t worry, at the end of this blog is the entire final script.

cred_mapping = {} #this will help us build a dataframe from our table
for entry, data_boundary in cred_ws.tables.items():
    #parse the data within the ref boundary
    data = cred_ws[data_boundary]
    #extract the data 
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] #ent makes our code cleaner by handling nested data sources
               for ent in data
          ]
    header = content[0]
    #the contents excluding the header (aka column names)
    rest = content[1:]
    #create dataframe with the column names
    #and pair table name with dataframe
    cred_df = pandas.DataFrame(rest, columns = header)
    cred_mapping[entry] = cred_df
print(cred_df)

server = cred_df._get_value(0,"SQL_Server")
db = cred_df._get_value(0,"SQL_Database")
print(server)
print(db)

Sweet, now we can get our SQL_Server and SQL_Database values directly from the spreadsheet instead of hard coding it in Python! To code out the rest of the variables, simply copy then replace “server” and “SQL_Server” in the code above with the variable name and column name that you want to return.

Alrighty, now we need to pull in our data quality spreadsheet and set up some for loops that can help iterate through the list of Power BI and SQL queries and return some outputs to our Excel sheet.

Note: Your SQL query will need to have every column named or else you will get an error when you run the cursor over it. Also, for this section we will need to import a couple more libraries. If you don’t already have win32.com installed, you’ll need to run pip install pypiwin32 in your command prompt.

This script is pretty awesome to see work because we will finally get some values back into our Excel file! The first section will use the win32com library to essentially open excel in edit mode instead of read only like we have been doing. After we open the worksheet to edit it, we will grab our SQL queries then iterate through them to get the results from the query and put it back into Excel. If that doesn’t get the nerd in you excited, I’m not sure what will!

from win32com.client import Dispatch; import os
#Using the win32com library, we can open excel in edit mode
xl = Dispatch("Excel.Application") #opens excel for us to edit
xl.Visible = True
edit_wb = xl.Workbooks.Open(excel_file)
edit_qc_sh = edit_wb.Worksheets("Quality Check")

#Running SQL query and putting result back into Excel
sql_df = quality_df.dropna() #removing any blank rows so we don't run blank sql queries lol
sql_queries = sql_df.loc[:,"SQL_Query"] 
#the first argument (:) signifies which rows we want to index (: indicates all columns), and the second argument lets us index the column we want
#print(sql_queries) #this returns the rows in our sql query column
rownumber = 1
for query in sql_queries:
    sql_cursor.execute(str(query))
    #print (sql_cursor.fetchall()[0])
    for row in sql_cursor:
        #print(row) #this is our dictionary created from the sql query result
        key = list(row.items())[0]
        rownumber += 1 #puts our sql responses in the right rows
        sql_result = key[1]
        quality_df.at[rownumber,"SQL_Result"] = sql_result #this will put our results into the proper cell in our dataframe
        edit_qc_sh.Range(f"D{rownumber}").Value = sql_result #this will put our results in the right excel cell
        #print(key[1]) #returns just our result values
        
edit_wb.Save() #saving our values back to excel

Okay, time to do the same thing with our DAX queries and wrap this project up. You may have noticed that the DAX queries in the sample data is not the exact same you would use in your measure. What we are doing is telling DAX how we want the data back, in this case we want it to return one row for us (hence, EVALUATE ROW at the beginning) based on measures I have in a table called “_Measures”. If you aren’t sure how to make the DAX query work for this API, feel free to put the measure you want to test in a card visual then use the performance analyzer to get the query. A good way to see if your DAX query will work is to test it out in DAX Studio.

Python time! We are going to tweak our code from earlier by renaming our variable “header” to “pbiheader” so we can reference it more clearly in our JSON request code. Again, feel free to skip to the end of this post for the final code if you have any questions (or need this to just work and fast). First, we will need all the pieces to execute our post script. Once we have the URL, the DAX query, and the json body, we can iterate through the list of queries from our Excel file.

#Running queries and putting results back into Excel
qdf = quality_df.dropna() #removing any blank rows so we don't run blank sql queries lol
sqlrownumber = 1
pbirownumber = -1

#Running SQL Queries
sql_queries = qdf.loc[:,"SQL_Query"] 
for query in sql_queries:
    sql_cursor.execute(str(query))
    #print (sql_cursor.fetchall()[0])
    for row in sql_cursor:
        #print(row) #this is our dictionary created from the sql query result
        key = list(row.items())[0] #gets the first item from the row dictionary
        sqlrownumber += 1 #puts our sql responses in the right rows
        sql_result = key[1] #grabs just the result of our query
        quality_df.at[sqlrownumber,"SQL_Result"] = sql_result #this will put our results into the proper cell in our dataframe
        edit_qc_sh.Range(f"D{sqlrownumber}").Value = sql_result #this will put our results in the right excel cell
        #print(key[1]) #returns just our result values
        
#Running PBI Queries
pbi_queries = quality_df.loc[:,"PBI_DAX_Query"]
execute_queries_url = f'https://api.powerbi.com/v1.0/myorg/datasets/{pbidatasetid}/executeQueries'
for items in pbi_queries:
    pbirownumber += 1
    pbiexcelrownumber = pbirownumber+2
    list_pbiquery = list(pbi_queries.items())[pbirownumber]
    #print(list_pbiquery)
    item_pbiquery = list_pbiquery[1]
    dax_query_json_body = {
                "queries":
                        [
                            {
                            "query": item_pbiquery
                            }
                        ]  
                }
    pbi_response = requests.post(url=execute_queries_url, headers=pbiheader,json=dax_query_json_body)
    query_json = pbi_response.text.encode().decode('utf-8-sig') #allows us to read the json response
    pbi_values = query_json.replace("}]}]}]}","") #drops the trailing encoding from json
    #print(query_json)
    pbi_result = pbi_values.split(":")[4] #grabs just the result of our query
    print(pbi_result) #grabs just the result of our query
    quality_df.at[pbiexcelrownumber,"PBI_Result"] = pbi_result #this will put our results into the proper cell in our dataframe
    edit_qc_sh.Range(f"C{pbiexcelrownumber}").Value = pbi_result #this will put our results in the right excel cell

edit_wb.Save() #saving our values back to excel

Pretty great! Now we have a functioning data quality testing tool where Excel holds all our inputs and outputs and all we need to do in Python is hit Run.

Final Query

'''---------------README------------------
Full instructions available: https://dataonwheels.wordpress.com/?s=Power+BI%3A+Data+Quality+Checks+Using+Python+%26+SQL
The goal of this script is to compare DAX measures to corresponding SQL queries. 
To accomplish this, it uses an excel file containing authentication variables and query values.
This idea came from seeing the Execute Queries REST API in Power BI and wanting to use python to perform our data quality checks efficiently.
To connect to a Power BI Data Model, we need to pass an authentication token generated through an Azure AD App. 
To run a DAX query and SQL query, we need to loop through our excel file then put the queries into the API and SQL calls respectively.
Finally, this script takes those query results and puts it back into the excel file.

I have left some print() lines commented out, feel free to uncomment them to troubleshoot or walk through the code step-by-step.

You will need to swap out the excel_file variable with your excel file path. Other than that, the rest of the variables are managed inside your excel file. 
Once you have your variables and queries in excel, hit run and you're good to go. 
'''

#---------------------------------------#
#       Import libraries needed  
#---------------------------------------#
import requests
import adal
import json
import pymssql 
import pandas
import openpyxl
from win32com.client import Dispatch; import os

#---------------------------------------#
#    Build out authentication steps  
#---------------------------------------#

#----- Authentication Variables from Excel -----#

#Grab authentication data from our excel file
from openpyxl import load_workbook
#we need the r at the beginning since "\" is a special character in python
excel_file = r"C:\Users\KristynaHughes\Downloads\PythonDataQualityChecker.xlsx"
wb = load_workbook(excel_file) 
sheets = wb.sheetnames
#print(sheets) #see the sheets in our workbook
cred_ws = wb['Credentials'] #access specific sheet called Credentials
cred_mapping = {} #this will help us build a dataframe from our table
for entry, data_boundary in cred_ws.tables.items():
    #parse the data within the ref boundary
    data = cred_ws[data_boundary]
    #extract the data 
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] #ent makes our code cleaner by handling nested data sources
               for ent in data
          ]
    header = content[0] #the contents excluding the header (aka column names)
    rest = content[1:] #create dataframe with the column names
    cred_df = pandas.DataFrame(rest, columns = header) #this is the dataframe we will use to get our creds
    cred_mapping[entry] = cred_df
#Use the dataframe to set up credential variables we can use later in the script
sqlserver = str(cred_df._get_value(0,"SQL_Server"))
sqldb = str(cred_df._get_value(0,"SQL_Database"))
sqluser = str(cred_df._get_value(0,"SQL_User"))
sqlpassword = str(cred_df._get_value(0,"SQL_Password"))
pbiclientid = str(cred_df._get_value(0,"PBI_ClientID"))
pbiusername = str(cred_df._get_value(0,"PBI_Username"))
pbipassword = str(cred_df._get_value(0,"PBI_Password"))
pbidatasetid = str(cred_df._get_value(0,"PBI_DatasetID"))
#check to make sure your variables are correct by uncommenting the next line
#print(sqlserver,sqldb,sqluser,sqlpassword,pbiclientid,pbiusername,pbipassword,pbidatsetid)


#----- Power BI REST API Authentication -----#

authority_url = 'https://login.windows.net/common'
resource_url = 'https://analysis.windows.net/powerbi/api'
context = adal.AuthenticationContext(authority=authority_url,
                                     validate_authority=True,
                                     api_version=None)
token = context.acquire_token_with_username_password(resource=resource_url,                                                     
                                                     username=pbiusername,
                                                     password=pbipassword,
                                                     client_id=pbiclientid)
access_token = token.get('accessToken')
pbiheader = {'Authorization': f'Bearer {access_token}'}
#try out the get workspaces REST API using our access token by uncommenting the next few lines
#get_workspaces_url = 'https://api.powerbi.com/v1.0/myorg/groups'
#r = requests.get(url=get_workspaces_url, headers=header)
#r.text will give us the response text for our get request, pretty neat!
#print(r.text)


#----- SQL Server Authentication -----#

try:
    sql_con = pymssql.connect(sqlserver,sqluser,sqlpassword,sqldb)
    sql_cursor = sql_con.cursor(as_dict=True)
except Exception as e:
    raise Exception(e)


#---------------------------------------#
#  Build out data quality check steps 
#---------------------------------------#

#----- Read excel to get quality check queries into a dataframe -----#

quality_ws = wb['Quality Check']
quality_mapping = {} #this will help us build a dataframe from our table
for entry, data_boundary in quality_ws.tables.items(): #grabs data dynamically from our table
    data = quality_ws[data_boundary] #parse the data within the ref boundary
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] #ent makes our code cleaner by handling nested data sources
               for ent in data
          ]
    header = content[0] #the contents excluding the header (aka column names)
    rest = content[1:] #create dataframe with the column names
    qualitydf = pandas.DataFrame(rest, columns = header) #this is the dataframe we will use to get our quality check queries
    quality_df = qualitydf.fillna(' ') #helps remove blank records from our excel file
    quality_mapping[entry] = quality_df
#print(quality_df)


#----- Open excel file in edit mode -----#

xl = Dispatch("Excel.Application") #opens excel for us to edit
xl.Visible = True
edit_wb = xl.Workbooks.Open(excel_file)
edit_qc_sh = edit_wb.Worksheets("Quality Check")

#----- Set variables to use in our iterators -----#

qdf = quality_df.dropna() #removing any blank rows so we don't run blank sql queries lol
sqlrownumber = 1
pbirownumber = -1

#----- Run SQL queries and put results back into excel -----#

sql_queries = qdf.loc[:,"SQL_Query"] 
for query in sql_queries:
    sql_cursor.execute(str(query))
    #print (sql_cursor.fetchall()[0])
    for row in sql_cursor:
        #print(row) #this is our dictionary created from the sql query result
        key = list(row.items())[0] #gets the first item from the row dictionary
        sqlrownumber += 1 #puts our sql responses in the right rows
        sql_result = key[1] #grabs just the result of our query
        quality_df.at[sqlrownumber,"SQL_Result"] = sql_result #this will put our results into the proper cell in our dataframe
        edit_qc_sh.Range(f"D{sqlrownumber}").Value = sql_result #this will put our results in the right excel cell
        #print(key[1]) #returns just our result values

#----- Run PBI DAX queries and put results back into excel -----#

pbi_queries = quality_df.loc[:,"PBI_DAX_Query"]
#print(pbi_queries)
execute_queries_url = f'https://api.powerbi.com/v1.0/myorg/datasets/{pbidatasetid}/executeQueries'
for items in pbi_queries:
    pbirownumber += 1
    pbiexcelrownumber = pbirownumber+2
    list_pbiquery = list(pbi_queries.items())[pbirownumber]
    #print(list_pbiquery)
    item_pbiquery = list_pbiquery[1]
    dax_query_json_body = {
                "queries":
                        [
                            {
                            "query": item_pbiquery
                            }
                        ]  
                }
    pbi_response = requests.post(url=execute_queries_url, headers=pbiheader,json=dax_query_json_body)
    query_json = pbi_response.text.encode().decode('utf-8-sig') #allows us to read the json response
    pbi_values = query_json.replace("}]}]}]}","") #drops the trailing encoding from json
    #print(query_json)
    pbi_result = pbi_values.split(":")[4] #grabs just the result of our query
    #print(pbi_result) #grabs just the result of our query
    quality_df.at[pbiexcelrownumber,"PBI_Result"] = pbi_result #this will put our results into the proper cell in our dataframe
    edit_qc_sh.Range(f"C{pbiexcelrownumber}").Value = pbi_result #this will put our results in the right excel cell

#----- Save our changes back to the excel file -----#

edit_wb.Save() #saving our values back to excel
print("All done")

Additional Resources:

This was my first experience with Python, and below are all the links that helped me build out this demo and learn more about how Python operates. Hope you find these as useful as I did!

Last Non-NULL Date in SQL Server

The simplest of requests are often the most difficult to execute. For example, a finance team needs to know every time a customer did not invoice for 90 days in the past 2 years. The simplicity of the ask is deceiving. Tracking differences across multiple dimensions (customer and invoice date in this case) and accounting for NULL values in the changing dimension (aka when a customer did not invoice on a day) appears to be hopeless without the support of a CRM code change. But have no fear, complicated SQL is here!

Testing Scenario: the business would like you to create a customer attrition report. To do this, you need to find gaps in invoice dates per customer and determine when and how many customers go “inactive” and are “reactivated” in the past two years. A customer is deemed “inactive” whenever there are greater than 90 days since the last invoice. This can occur multiple times in one year, so a customer can be “reactivated” multiple times in one year.

Resources Needed:

  1. SQL Server Access to the needed data elements
    • In this scenario, this consists of invoice date by customer. You can swap this out for any other date range or any other unique ID.
  2. Business logic
    • In this scenario, activations in a year = anytime a customer has invoiced first the first time in a 90 day period. You can swap customer field for any dimension such as sales rep, carrier, business segment, etc. You can also swap out invoice date for any date field such as creation date, pickup date, paid date, delivery date, etc.
  3. Start and End dates
  4. Ability to use CTE’s/Temp Tables
    • This really comes into play if you are trying to create a Direct Query based report in Power BI or using any other reporting tools that do not allow calling Temp Tables. If you hit this limitation, then you will need to leverage a database/code solution instead of the method below.

Notes:

  • If your SQL server instance is after 2016, then you will not need to use the custom date temp table and can use IGNORE NULL within the MAX OVER statement (see alternative line in the final SQL code below).
  • The process below lays out each portion of the final query, but feel free to skip ahead to the end for the final sql statement if you don’t need each section explained.

Process:

  1. Set up parameters
    • DECLARE @StartDate DATE = '2019-01-01'
      DECLARE @EndDate DATE = GETDATE()
      DECLARE @ActivationRange INT = 90 --notates how many days can be between invoice dates before a customer is deemed "inactive".
  2. Create a date/calendar table. Check with your DBA’s first to make sure they haven’t already created something similar that you can use, all you need is a list of sequential calendar dates with no gaps.
    • ;WITH cte AS (
      SELECT @StartDate AS myDate
      UNION ALL|
      SELECT DATEADD(day,1,myDate) as myDate
      FROM cte
      WHERE DATEADD(day,1,myDate) <= @EndDate
      )
      SELECT myDate 'CalendarDate'
      INTO #Calendar
      FROM cte
      OPTION (MAXRECURSION 0) –this works around the usual 100 recursion row limit
  3. If you need to partition by a dimension other than date, such as customer in this scenario, you will need to create a table to grab that dimension’s values as well. After this, you’ll need to create a bridge table that will have a value for every date in your range and every customer (or other dimension) value as well.
    • –Customer Table
      SELECT DISTINCT
      DA.AccountsKey
      ,DA.CompanyID
      ,DA.CompanyName
      ,MIN(FSF.InvoiceDateKey) 'FirstInvoiceDate'
      INTO #Companies
      FROM DimAccount DA
      JOIN ShipmentFacts FSF ON FSF.AccountKey = DA.AccountsKey
      WHERE FSF.InvoiceDateKey IS NOT NULL
      GROUP BY
      DA.AccountsKey
      ,DA.CompanyID
      ,DA.CompanyName
    • –Bridge Table that combines both Customer and Date values
      SELECT DISTINCT
      C.CalendarDate
      ,Comp.CompanyID
      ,Comp.CompanyName
      ,MIN(Comp.FirstInvoiceDate) 'FirstInvoiceDate'
      ,CONCAT(C.CalendarDate,Comp.CompanyID) 'ID'
      INTO #Bridge
      FROM #Calendar C, #Companies Comp
      GROUP BY
      C.CalendarDate
      ,Comp.CompanyID
      ,Comp.CompanyName
      ,CONCAT(C.CalendarDate,Comp.CompanyID)
  4. Next, we need to create our unique ID’s that combine all the dimensions we are hoping to account for in our “IGNORE NULLS” scenario. In this test case, we need to create one ID that grabs the actual dates a customer invoiced on and another for all the dates in our range that a customer could have possibly invoiced on. Then, we join the two together to grab the last time a customer invoiced and get ignore those pesky NULL values. This is the section where having SQL Server 2016 and later will do you a lot of favors (see code below).
    • –Actual Invoiced Dates by Customer
      SELECT DISTINCT
      FSF.InvoiceDateKey
      ,DA.CompanyName
      ,DA.CompanyID
      ,CONCAT(FSF.InvoiceDateKey,DA.CompanyId) 'ID'
      INTO #ShipmentData
      FROM ShipmentFacts FSF
      JOIN #Companies DA ON DA.AccountsKey = FSF.AccountKey
      WHERE FSF.InvoiceDateKey BETWEEN @StartDate AND @EndDate
    • –Joining together and filling in the NULLS with the previous invoiced date by customer
      SELECT DISTINCT
      C.ID
      ,S.ID 'ShipData'
      ,CAST( SUBSTRING( MAX( CAST (C.ID AS BINARY(4)) + CAST(S.ID AS BINARY(20))) OVER (PARTITION BY C.CompanyID ORDER BY C.ID ROWS UNBOUNDED PRECEDING),5,20) AS varchar) 'PreviousInvoiceDateKey'
      --ALTERNATIVE FOR POST SQL Server 2012--
      --,CAST( SUBSTRING( MAX( CAST (C.ID AS BINARY(4)) + CAST(S.ID AS BINARY(20))) IGNORE NULLS OVER (PARTITION BY C.CompanyID ORDER BY C.ID ROWS UNBOUNDED PRECEDING),5,20) AS varchar) 'PreviousInvoiceDateKey'

      INTO #RunningDates
      FROM #Bridge C
      LEFT JOIN #ShipmentData S ON S.ID = C.ID
  5. The rest of the code is based on business logic, so please use at your discretion and edit for your own needs.

Full SQL Code:

DECLARE @StartDate DATE = '2019-01-01'
DECLARE @EndDate DATE = GETDATE()
DECLARE @ActivationRange INT = 90 --notates how many days can be between invoice dates before a customer is deemed "inactive"
;WITH cte AS (
SELECT @StartDate AS myDate
UNION ALL
SELECT DATEADD(day,1,myDate) as myDate
FROM cte
WHERE DATEADD(day,1,myDate) <= @EndDate
)
SELECT myDate 'CalendarDate'
INTO #Calendar
FROM cte
OPTION (MAXRECURSION 0)


SELECT DISTINCT
DA.AccountsKey
,DA.CompanyID
,DA.CompanyName
,MIN(FSF.InvoiceDateKey) 'FirstInvoiceDate'
INTO #Companies
FROM DimAccount DA
JOIN ShipmentFacts FSF ON FSF.AccountKey = DA.AccountsKey
WHERE FSF.InvoiceDateKey >= '2000-01-01'
GROUP BY
DA.AccountsKey
,DA.CompanyID
,DA.CompanyName


SELECT DISTINCT
C.CalendarDate
,Comp.CompanyID
,Comp.CompanyName
,MIN(Comp.FirstInvoiceDate) 'FirstInvoiceDate'
,CONCAT(C.CalendarDate,Comp.CompanyID) 'ID'
INTO #Bridge
FROM #Calendar C, #Companies Comp
GROUP BY
C.CalendarDate
,Comp.CompanyID
,Comp.CompanyName
,CONCAT(C.CalendarDate,Comp.CompanyID)

SELECT DISTINCT
FSF.InvoiceDateKey
,DA.CompanyName
,DA.CompanyID
,CONCAT(FSF.InvoiceDateKey,DA.CompanyId) 'ID'
INTO #ShipmentData
FROM ShipmentFacts FSF
JOIN #Companies DA ON DA.AccountsKey = FSF.AccountKey
WHERE FSF.InvoiceDateKey BETWEEN @StartDate AND @EndDate

SELECT DISTINCT
C.ID
,S.ID 'ShipData'
,CAST( SUBSTRING( MAX( CAST (C.ID AS BINARY(4)) + CAST(S.ID AS BINARY(20))) OVER (PARTITION BY C.CompanyID ORDER BY C.ID ROWS UNBOUNDED PRECEDING),5,20) AS varchar) 'PreviousInvoiceDateKey'
--ALTERNATIVE FOR POST SQL Server 2012--
--,CAST( SUBSTRING( MAX( CAST (C.ID AS BINARY(4)) + CAST(S.ID AS BINARY(20))) IGNORE NULLS OVER (PARTITION BY C.CompanyID ORDER BY C.ID ROWS UNBOUNDED PRECEDING),5,20) AS varchar) 'PreviousInvoiceDateKey'
INTO #RunningDates
FROM #Bridge C
LEFT JOIN #ShipmentData S ON S.ID = C.ID


SELECT DISTINCT
R.ID
,R.ShipData
,R.PreviousInvoiceDateKey
,LEFT(R.PreviousInvoiceDateKey,10) 'PreviousInvoiceDate'
,LEFT(R.ID,10) 'DateKey'
,RIGHT(R.ID,5) 'CompanyId'
,B.FirstInvoiceDate
INTO #ActivationData
FROM #RunningDates R
LEFT JOIN #Bridge B ON B.ID = R.ID

SELECT DISTINCT
A.ID
,A.DateKey
,A.CompanyId
,A.PreviousInvoiceDate
,YEAR(A.DateKey) 'Year'
,YEAR(A.FirstInvoiceDate) 'InitialActivationYear'
,CASE WHEN YEAR(A.DateKey) = YEAR(A.FirstInvoiceDate) THEN 1 ELSE 0 END 'IsActivationYear'
,DATEDIFF(Day,A.PreviousInvoiceDate,A.DateKey) 'DaysSinceInvoice'
,CASE WHEN DATEDIFF(Day,A.PreviousInvoiceDate,A.DateKey) = @ActivationRange THEN 1 ELSE 0 END 'IsInactive'
,CASE WHEN DATEDIFF(Day,A.PreviousInvoiceDate,A.DateKey) = @ActivationRange THEN A.DateKey ELSE NULL END 'InactiveDate'
INTO #ActivationDetails
FROM #ActivationData A

SELECT DISTINCT
D.Year
,D.CompanyId
,SUM(D.IsInactive) 'InactivatedPeriods'
,MAX(D.IsActivationYear) 'IsFirstActivationYear'
,MAX(D.DaysSinceInvoice) 'BiggestGapInInvoicing (Days)'
,MAX(D.InactiveDate) 'LastInactiveDate'
,MAX(D.PreviousInvoiceDate) 'LastInvoiceDate'
,CASE WHEN MAX(D.InactiveDate) > MAX(D.PreviousInvoiceDate) THEN -1 ELSE 0 END 'NotActiveAtEndOfYear'

--to grab the activations per customer per year follow equation below
-- Activations = [InactivatedPeriods] + [NotActiveAtEndOfYear] + [IsFirstActivationYear] --this part will be done in Power BI
FROM #ActivationDetails D
GROUP BY
D.Year
,D.CompanyId


DROP TABLE #Calendar
DROP TABLE #Companies
DROP TABLE #Bridge
DROP TABLE #ShipmentData
DROP TABLE #RunningDates
DROP TABLE #ActivationData
DROP TABLE #ActivationDetails

Additional Resource:

I Wrote a Book – Hands-On SQL Server 2019 Analysis Services

While not the first time I have authored, this is the first book that I wrote as the sole author. Analysis Services is the product I built my career in business intelligence on and was happy to take on the project when I was approached by Packt.

I think one of my favorite questions is about how much research time did I put in for this book. The right answer is almost 20 years. I started working with Analysis Services when it was called OLAP Services and that was a long time ago. Until Power Pivot for Excel and tabular model technology was added to the mix, I worked in the multidimensional model. I was one of the few, or so it seems, that enjoyed working in the multidimensional database world including working with MDX (multidimensional expressions). However, I was very aware that tabular models with the Vertipaq engine were the model of the future. Analysis Services has continued to be a significant part of the BI landscape and this book give you the opportunity to try it out for yourself.

This book is designed for those who are most recently involved in business intelligence work but have been working more in the self-service or end user tools. Now you are ready to take your model to the next level and that is where Analysis Services comes into play. As part of Packt’s Hands On series, I focused on getting going with Analysis Services from install to reporting. Microsoft has developer editions of the software which allow you to do a complete walk through of everything in the book in a step by step fashion. You will start the process by getting the tools installed, downloading sample data, and building out a multidimensional model. Once you have that model built out, then we do build a similar model using tabular model technology. We follow that up by building reports and visualizations in both Excel and Power BI. No journey is complete without working through security and administration basics. If you want learn by doing, this is the book for you.

If you are interested in getting the book, you can order it from Amazon or Packt. From November 20, 2020 through December 20, 2020, you can get a 25% discount using the this code – 25STEVEN or by using this link directly.

I want to thank the technical editors that worked with me to make sure the content and the steps worked as expected – Alan Faulkner, Dan English, and Manikandan Kurup. Their attention to detail raised the quality of the book significantly and was greatly appreciated.

I have to also thank Tazeen Shaikh who was a great content editor to work with. When she joined the project, my confidence in the quality of the final product increased as well. She helped me sort out some of the formatting nuances and coordinated the needed changes to the book. Her work on the book with me was greatly appreciated. Finally, many thanks to Kirti Pisat who kept me on track in spite of COVID impacts throughout the writing of the book this year.

I hope you enjoy the book!

Starting and Stopping SQL Server with PowerShell

Have you ever had this issue? You are trying to explore new features of SQL Server or you want to install the latest version for some testing, but SQL Server is a resource hog and kills your the performance of your PC. I have been evaluating SQL Server 2019 including both flavors of Analysis Services. That means I am installing the Developer edition of SQL Server on my laptop (a Yoga 900) running Windows 10.

See the source image

I have dealt with this issue in the past by opening the services window and setting the Startup Type to Manual and then turning it on when I wanted to work with them. In my current use case, I did not want to have to manage that with a lot of clicks. So, my thought was this should be easy with PowerShell. Of course, nothing is easy when it is not something you do everyday. So I started digging.

I found it is definitely possible but you need to know the steps that matter including making sure that PowerShell will execute the scripts on your PC. So here are the steps and code.

Empowering Your PC to Run Your PowerShell

The first issue is that PowerShell will not execute on your PC until you allow it. You will need to open PowerShell as an Administrator. You can leave the window open once we are done here. The following code will allow code created on your PC to be executed on your PC.

Set-ExecutionPolicy RemoteSigned

Shutting Down SQL Server with PowerShell

I chose to create two PowerShell files that will allow me to execute these steps for all the services I am managing in. In my examples, I have three instances of SQL Server – Data Engine Services, Analysis Services – Multidimensional Model, and Analysis Services – Tabular Mode.

The code required to startup SQL Server is noted below.

Set-Service 'MSSQL$DOWSQL2019' -StartupType Disabled
Stop-Service -Name 'MSSQL$DOWSQL2019' -Force

Let’s break it down. The first line will actually set the StartupType to Disabled. This will prevent it from restarting via a reboot. You will need to be intentional about restarting the service. The second line is the command to stop the service. The “Force” flag will shut down dependant services like SQL Agent if you have that running.

You will need to know the service name. SQL Server Data Engine Services are typically named “MSSQL$” followed by the instance name. If you are using Analysis Services, the naming using “MSOLAP$” as the prefix.

You can run these scripts directly in your PowerShell window. I did this while testing them. I wrote them in Visual Studio Code, but had issues executing them. You may be fine, I wanted to let you know my experience. Once I had my three instances scripted, I saved them in a file called SQLServerOff.ps1.

Turning Them on with PowerShell

The process and the code is similar. You first need to enable the services then start them up.

Set-Service 'MSSQL$DOWSQL2019' -StartupType Manual
Start-Service -Name 'MSSQL$DOWSQL2019' 

I chose the Startup Type “Manual” so I still need to turn them on. It is possible to set that to Automatic if you want it to start up on a reboot for instance. I then saved these in a file called SQLServerOn.ps1.

Running the PS1 Files

There are a couple of options to execute your file. While I was testing and I used the PowerShell window I had open. In order to execute your file replace “YOUR PATH” with the full path to the script.

PS C:\WINDOWS\system32> & "YOUR PATH\SQLServicesOn.ps1"

While a good pattern, I was still looking for the “one-click” solution. Some more digging resulted in me finding the way to execute PowerShell file from a shortcut. I did this by going to the desktop, right clicking an empty space and creating a new, blank short cut. Here is the command you can use for the shortcut:

powershell.exe -noexit -File “YOUR PATH\SQLServicesOn.ps1”

Add the snippet above with the proper path for your code to the shortcut location. Once you have named your shortcut and saved it, you need to open the properties on the shortcut. Go to the advanced settings and run this as Administrator (this will not work without that setting turned on). You can now add the shortcut to a place convenient for you such as your task bar.

While I focused on SQL Server services, this will work other services as well. Enjoy!

Cosmos DB for the Data Professional

Cosmos DB LogoCosmos DB is one of the fastest growing Azure services in 2018. As its popularity grows, data professionals are faced with a changing reality in the world of data. Data is no longer contained in relational databases as general rule. We saw the start of this with Hadoop data storage, but no one ever referred to Hadoop as a database. Sure Hive and other Hadoop based technologies made the data look like a database, but we (data professionals) were able to keep our distance. What’s changed?

The Cloud, Data, and Databases

As cloud reaches more and more businesses, traditional data stores are being reconsidered. We now have data stored in Azure – Azure Data Lake, Azure Storage, Azure Database Services (SQL, PostgreSQL, MySQL), Azure Data Warehouse, and now Cosmos DB. Cosmos DB is the globalized version of Azure Document DB (more about that later). If we are to grow our skillset and careers to a cloud data professional, we need to know more about other ways the data is stored and used. I want to summarize some things that we need to be aware of about Cosmos DB. If your business uses it or plans to and you are a data pro, you will need to know this.

Introducing Cosmos DB

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database.

Cosmos DB Overview 201804

Source: https://docs.microsoft.com/en-us/azure/cosmos-db/introduction 

I will break down key components of Cosmos DB with a data professional in mind. There are a lot of aspects of Cosmos DB that make it very cool, but you will want to understand this when you get the call to fix the database.

Multi-model Database Service

Currently Cosmos DB supports four database models. This is like having for different database servers in one. I liken it to having SQL Server Database Engine and SQL Server Analysis Services using the same underlying engine and it only “looks different.” Cosmos DB refers to these as APIs. The API is chosen when the database is created. This optimizes the portal and database for use with that API. Other APIs can be used to query the data, but it is not optimal. Here are the four models supported and the APIs that support them.

Cosmos DB models

  • Key Value Pair: This is exactly as it sounds. The API is implemented with the Azure Table Storage APIs.
  • Wide Column or Column Family: This stores data similar to relational, but there is no row consistency (each row can look different). Cosmos DB uses the Cassandra API to support this model. (For more information on Cassandra click here.)
  • Documents: This model is based on JSON document storage. Cosmos DB currently supports two APIs for this model: SQL which is the Document DB API and Mongo DB. These are the most common models used in Cosmos DB today. Document DB is the “parent” to Cosmos DB which was rebranded.
  • Graph: Graph databases are used to map relationships in data and were made popular with Facebook for instance. Microsoft uses the open source Gremlin API to support the Graph Database Model.

None of these databases are traditional row/column stores. They are all variations of NoSQL databases.

Turnkey Global Distribution

This is a key attribute for Cosmos DB. Cosmos DB can be easily distributed around the world. Click the data center you want to replicate to and Cosmos DB takes care of the rest. Cosmos DB uses a single write node and multiple read nodes. However, because Cosmos DB was built with global distribution in mind, you can easily and safely move the write node as well. This allows you to “chase the sun” and keep write operations happening “locally”.

Data Consistency

Data consistency is a primary concern of any data professional. The following tables compare Cosmos DB Consistency Levels with SQL Server Isolation Levels. These are not a one for one match, but demonstrate the different concerns between the systems.

 

Cosmos DB

SQL Server

Consistency Level Guarantees Isolation Level Dirty Read Non- repeatable Read Phantom
Strong Reads are guaranteed to return the most recent version of an item. Serializable No No No
Bounded Staleness Consistent Prefix or read order. Reads lag behind writes by prefixes (K versions) or time (t) interval. Snapshot No No No
Session Consistent Prefix. Monotonic reads, monotonic writes, read-your-writes, write-follows-reads. Repeatable Read No No Yes
Consistent Prefix Updates returned are some prefix of all the updates, with no gaps. Reads are not read out of order. Read Committed No Yes Yes
Eventual Out of order reads. Read Uncommitted Yes Yes Yes

As you can see, there are some similarities. These options are important to understand. In the Cosmos DB, the more consistent you need the data, the higher the latency in the distributed data. As a result, most Cosmos DB solutions usually start with Session Consistency as this gives a good, consistent user experience while reducing latency in the read replicas.

Throughput

I am not going to dig into this much. But you need to understand that Request Units (RU) are used to guarantee throughput in Cosmos DB. As a baseline, Microsoft recommends thinking that a 1 KB JSON file will require 1 RU. The capacity is reserved for each second. You will pay for what you reserve, not what you use. If you exceed capacity in a second your request will be throttled. RUs are provisioned by region and can vary by region as a result. But they are not shared between regions. This will require you to understand usage patterns in each region you have a replica.

Scaling and Partitions

Within Cosmos DB, partitions are used to distribute your data for optimal read and write operations. It is recommended to create a granular key with highly distinct values. The partitions are managed for you. Cosmos DB will split or merge partitions to keep the data properly distributed. Keep in mind your key needs to support distributed writes and distributed reads.

Indexing

By default, everything is indexed. It is possible to use index policies to influence the index operations. Index policies are modified for storage, write performance, and read or query performance. You need to understand your data very well to make these adjustments. You can include or exclude documents or paths, configure the index type, and configure the index update mode.  You do not have the same level of flexibility in indexes found in traditional relational database solutions.

Security

Cosmos DB is an Azure data storage solution which means that the data at rest is encrypted by default and data is encrypted in transit. If you need RBAC, Azure Active Directory (AAD) is supported in Cosmos DB.

SLAs

I think that the SLAs Microsoft provides with Cosmos DB are a key differentiator for them. Here is the short summary of guarantees Microsoft provides:

  • Latency: 99.99% of P99 Latency Attainment (based on hours over the guarantee)
    • Reads under 10 ms
    • Writes under 15 ms
  • Availability
    • All up – 99.99% by month
    • Read – 99.999% by month
  • Throughput – 99.99% based on reserved RUs (number of failures to meet reserved amount)
  • Consistency – 99.99% based on setting

These are financially backed SLAs from Microsoft. Imagine you providing these SLAs for your databases. This is very impressive.

Wrap Up

For more information, check out Microsoft’s online documentation on Cosmos DB.

I presented this material at the April 2018 PASS MN User Group Meeting. The presentation can be found here.