SQL Saturday Atlanta 2024

Hey there happy coders! Last weekend I had the pleasure of speaking at the SQL Saturday Atlanta event in Georgia! It was an awesome time of seeing data friends and getting to make some new friends. If you live near a SQL Saturday event and are looking for a great way to learn new things, I can’t recommend SQL Saturday’s enough. They are free and an excellent way to meet people who can help you face challenges in a new way. Below are my notes from various sessions attended as well as the materials from my own session. Enjoy!

Link to the event – https://sqlsaturday.com/2024-04-20-sqlsaturday1072/#schedule

My session – Real-Time Analytics in Fabric

Thank you so much to everyone who came out to see my first ever session on Real-Time Analytics in Fabric! We had a couple of glitches in the Logic App creation, but had an excellent time troubleshooting together as a group. Please check out my GitHub for the slide deck as well as all the code used in the demonstration. Happy coding!

Link to my GitHub – https://github.com/Anytsirk12/DataOnWheels/tree/main/Real-Time%20Analytics%20in%20Fabric

Introduction to SQL Server Essential Concepts by Bradley Ball

Goal is to understand the core things about databases to enable deeper understanding.

ACID = atomicity, consistency, isolation, and durability.
Atomicity = either all of it commits or none of it commits. Consistency = my data must be in a consistent state before a transaction completes. Isolation = transaction must operate independently from other transactions. Durability = complex logging that is the transaction log. The log has all the commit history and ensures accurate data.

Transaction isolation levels – serializable, read committed (SQL server default), read uncommitted, repeatable read, snapshot isolation. You can set some of these at the db level. Serializable = blocks anything trying to get data on the same page, you can set this at the transaction level. Read committed = I can’t read data if a transaction is currently occurring. Read uncommitted = a dirty read, grabs data that isn’t committed. Repeatable read = nobody uses this lol. It’s a shared lock that holds for a longer period of time than the typical micro second. Shared lock means everyone can read the data. Snapshot isolation = Oracle and postgres use this. Every thing is an append and update. If I have 4 transactions, 1 update and 3 read, usually update would block reads, but this would redirect readers to a copy of the data in the TempDB at the point in time they requested to read it (aka before the update commits).

DMV = dynamic management view. Bradley used one that allows you to see active sessions in your database. We can see that a read query is blocked by an uncommitted transaction. We can see the wait_type = LCK_M_S and the blocking_session_id which is our uncommitted transaction. To get around this, he can run the read script and get a dirty read by setting the isolation level to read uncommitted. To unblock the original request, he can use ROLLBACK TRANSACTION to allow it to unlock that page of data.

How does SQL Server work on the inside? We have a relational engine and a storage engine. User interacts with a SNI which translates the request to the relational engine. User > SNI > Relational Engine [command parser > optimizer (if not in planned cache otherwise goes straight to storage engine) > query executer] > Storage Engine [access methods (knows where all data is) > buffer manager (checks the data cache but if not found then goes to the disk and pulls that into the buffer pool data cache). This gets extremely complicated for other processes like in-memory OLTP. The SQL OS is what orchestrates all these items.

SQL OS – pre-emptive scheduling (operating system) & cooperative pre-emptive scheduling (accumulates wait stats to identify why something is running slower).

Locks, latches, waits. Locks are like a stop light (row, page, and table escalation). If you lock a row, it will lock a page. Latches are who watches the locks/watchmen. It’s a lock for locks. Waits are cooperative scheduling. If a query takes too long, it will give up it’s place in line willingly. That creates a signal wait which signals there’s too much lined up.

SQL data hierarchy. Records are a row. Records are on a data page (8 k). Extents are 8 pages (64 k). It’s faster to read extents than pages. Allocation bit maps are 1s and 0s that signify data on a data page that enables even faster data reads – allows governing on 400 GB of data on 1 8KB page. IAM chains and allocation units allows quick navigation of pages. Every table is divided into in row data, row overflow data (larger than 8064 k), and lob data (large object like VARCHAR max and images).

Allocation units are made of 3 types:
1. IN_ROW_DATA (also known as HoBTs or Heap or B-Trees)
2. LOB_DATA (also known as LOBs or large object data)
3. ROW_OVERFLOW_DATA (also known as SLOBs, small large object data)

Heaps vs Tables. Oracle stores data as a heap which is super fast to insert. In SQL, these have bad performance due to clustered indexes and inserting new data. This is very situational. A table is either heap or clustered index, cannot be both. But heaps can have non-clustered indexes.

B-Tree allows you to get to the record with less reads by following a logic tree (think h is before j so we don’t need to read records after j). Heaps create a 100% table scan without a clustered index. Adding the clustered index dropped that significantly to only 1 read instead of the 8000 reads.

Recovery models – full, bulk logged, simple (on-prem). In the cloud everything is full by default. Full means everything is backed up. Bulk means you can’t recover the data but you can rerun the input process. Simple means you can get a snapshot but you can’t do any point in time restore. This will largely be determined by any SLAs you have.

Transaction log. This will constantly be overwritten. Your log should be at least 2.5 as large as your largest cluster. DBCC SQLPERF(logspace) will get you all the space available for logs in the various dbs. Selecting from the log is always not recommended since it creates a lock and logs are always running, so don’t do this in prod lol. Rebuilding indexes will grow your transaction log massively. To free up space in the transaction log, you have to a backup log operation which is why those are super important.

Fun tip, when creating a table you can put DEFAULT ‘some value’ at the end of a column name to provide it a default value if one is not provided. Pretty cool.

You can use file group or piecemeal restores to restore hot data much faster then go back and restore older, cold data afterward. To restore, you must have zero locks on the db. While restoring, the database is not online. Note, if you do a file group restore, you cannot query data that is in a unrestored file group so queries like SELECT * will not work.

Tales from the field has a ton of YouTube videos on these subjects as well.

Lessons Learned using Fabric Data Factory dataflow by Belinda Allen

What are dataflows? Dataflows are a low-code interface tool for ingesting data from hundreds of data sources, transforming your data using 300+ data transformations. The goal is to allow for more people to manipulate and use data within your organization. At the heart, it’s Power Query.

Why start with dataflows as a citizen developer? It’s power query and you know that. It’s low-code data transformation. Excellent for migrating Power BI reports to Fabric.

Lots of great discussion about when it makes sense to use a dataflow gen2.

You can copy and paste power query from Power BI by going into the advanced editor OR you can hold shift and select all the queries you want then ctrl c then go to power query for a dataflow gen2 in the online service and hit ctrl v and it will populate with all your tables! Pretty neat. You can also make your relationships within the online portal.

DBA Back to Basics: Getting Started with Performance Tuning by John Sterrett

For the code visit: https://johnsterrett.com/blog-series/sql-server-performance-root-cause-analysis/

Goal of today’s session – arm anyone who is new to performance tuning with processes and sills to solve common problems.

Basic query runtime. SQL has something called wait stats that tells you what caused the query to be slow. When you run a massive query, it will go into a suspended state which will require reading from disc instead of from memory cache (T1). After that, you’re in a runable state (T2). Finally, you get to run it (T3).

Basic bottlenecks = memory, disk, CPU, network, locking blocking & deadlocks. Adding memory is typically the fastest way to improve performance.

Identify performance problems happening right now:

EXEC sp_whoisactive. This is an open source script that gives you insight into who’s running what right now. You can get this from https://whoisactive.com. The cool thing about this is there are more ways to run it than just EXEC sp_whoisactive. Identify what’s consuming the most CPU from the column. There’s also some parameters you can use like @sort_order. EXEC sp_whoIsActive @sort_order = ‘[CPU] DESC’, @get_task_info = 2. The task info parameter will give more information in a wait_info column. The best command is exec sp_whoIsActive @help = 1. This provides ALL the documentation on what it does. Adam (the creator) also has a 30 day blog series on everything it can do for you! One option to make things run faster is to kill the process causing the issue lol.

How to handle blocking. You can do explicit transactions with BEGIN TRANSACTION which will lock the table. At the end, you need to either COMMIT or ROLLBACK or else that lock holds. SQL uses pessimistic locking as default so it won’t let you read data that’s locked – it will simply wait and spin until that lock is removed. You can use exec sp_whoisactive @get_plans = 1 to get the execution plan. Be careful, the wait_info can be deceptive since the thing that takes the most time may not be the problem. It may be blocked by something else, check the blocking_session_id to ve sure. Also check the status and open_tran_count to see if something is sleeping and not committed. Keep in mind that the sql_text will only show you the last thing that ran in that session. SO if you run a select in the same session (query window) as the original update script, it won’t be blocked and can run and THAT query will show up in the who is active which can be super confusing. To resolve this issue, you can use ROLLBACK in that session to drop that UPDATE statement.

To find blocking queries use EXEC sp_whoIsActive @find_block_leaders = 1, @sort_order = ‘[blocked_session_count] DESC’.

Identifying top offenders over the long term:

There’s a feature in SQL 2016 forward called Query Store which persists data for you even after you restart data. It’s essentially a black box for SQL. Query Store is on by default in SQL 2022 and online servers. It’s available for express edition as well. Be sure to triple check this is on, because if you migrated servers it will keep the original settings from the old server. If you right click on the DB, you can navigate to query store and turn it on via Operation Mode (Requested) to Read write. Out of the box is pretty good, but you can adjust how often it refreshes and for how much history. To see if it’s enabled, you should see Query Store as a folder under the db in SSMS.

Under query store, you can select Top Resource Consuming queries. There’s lots of configuration options including time interval and what metric. SQL Server 2017 and newer have a Query Wait Statistics report as well to see what was causing pain. It’ll show you what queries were running in the blocking session. You won’t get who ran the query from query store, but you can write sp_whoisactive to a table that automatically loops (very cool). This will have overhead on top of your db, so be mindful of that.

Intro to execution plans:

Keep in mind, SQL’s goal is to get you a “good enough” plan, not necessarily the best plan. Follow the thick lines. That’s where things are happening. Cost will tell you the percentage of the total time taken.

Key lookups. It’s a fancy way to say you have an index, so we can skip the table and go straight to the data you have indexed. BUT if there’s a nest loop, then there’s an additional columns in the select statement so it’s doing that key lookup for every value. More indexes can make your select statements worse if it’s using the wrong index that isn’t best for your query.

Index tuning process.
1. Identify tables in query
2. Identify columns being selected
3. Identify filters (JOIN and WHERE)
4. Find total rows for each table in the query
5. Find selectivity (rows with filter/table rows)
6. Enable statistics io, time, and the actual execution plan
7. Run the query and document your findings
8. Review existed indexes for filters and columns selected
9. Add index for lowest selectivity adding the selected columns as included columns
10. Run the query again and document findings
11. Compare findings with baseline (step 7)
12. Repeat last 5 steps as needed

To see existing indexes, you can run sp_help ‘tableName’. In the example, there’s an index key on OnlineSalesKey but that field is not used in our filter context (joins and where statements) in the query. Order of fields in indexes do matter because it looks in that order.

Brent Ozar made a SP you can use called sp_blitzIndex that will give you a ton of info on an index for a table including stats, usage, and compression. It also includes Create TSQL and Drop TSQL for that index to alter the table.

To turn on stats, use SET STATISTICS IO, TIME ON at the beginning of the query. Be sure to also include the actual execution plan (estimated doesn’t always match what actually happened). Now we can benchmark. Use SET STATISTICS IO OFF and SET STATISTICS TIME OFF. Create an non clustered index with our filter context columns.

Derby City Data Days

It was awesome to see the Kentucky data community come out for the first Derby City Data Days in Louisville, KY! Bringing together communities from Ohio, Tennessee, and Kentucky, the Derby City Data Days event was an excellent follow-up to Data Tune in March and deepened relationships and networks made at the Nashville event. In case you missed it, below are my notes from the sessions I attended as well as the resources for my session. Be sure to check out these speakers if you see them speaking at a conference near you!

Building Self-Service Data Models in Power BI by John Ecken

What and why: we need to get out of the way of business insights. If you try to build one size fits all, it fits none. Make sure you keep your data models simple and streamlined.

Security is paramount to a self-service data model. RLS is a great option so folks only see their own data. You can provide access to the underlying data model for read and build which enables them to create their own reports off the data they have access to. If you give your user contributor access, then RLS will go away for that user. Keep in mind, business users need pro license OR need to be in a premium workspace.

One really great option is for people to use the analyze in Excel option to interact with the most popular BI tool – Excel. This allows them to build pivot tables that can refresh whenever needed. You can also directly connect to Power BI datasets from their organization! You can set up the display field option as well to get information about the record you connect to. Pretty slick! With this, security still applies from RLS which is awesome.

Data modeling basics – clean up your model by hiding or removing unnecessary columns (ie sorting columns). Relationships matter. Configure your data types intentionally. Appropriate naming is vital to business user success. Keep in mind where to do your transformations – SQL vs DAX (think Roche’s Maxum). Be sure to default your aggregations logically as well (year shouldn’t be summed up).

Power BI Measures – creations, quick create, measure context, time-based functions. Whenever possible, make explicit measures (using DAX) and hide the column that it was created off of so people utilize the measure you intended. Make sure you add descriptions, synonyms (for Copilot and QA), featured tables, and folders of measures. The functionality of featured tables makes it wise to use folders of measures within your fact tables.

John likes to use LOOKUP to pull dims back into the fact table so he ends up with as few tables as possible. There are drawbacks to this such as slower performance and model bloat, but the goal is for end users who don’t have data modeling experience or understanding. Not sure I agree with this method since it’s not scalable at all and destroys the purpose of a data model. Make sure you hide columns you don’t want end users to interact with.

To turn on feature table, go to the model view then go to Properties pane and toggle that is featured table button. It will require a description, the label that will populate, and the key column (cannot be hidden) that the user will put in excel as a reference for the business user to call records off of.

The PIVOT() TSQL Operations by Jeff Foushee

GitHub: https://github.com/jbfoushee/MyPresentations/tree/main/TSQL_Pivot_Operators

Be sure to look at his GitHub for the awesome source code!

Come to Lousiville on May 9th to see a presentation on JSON and TSQL.

The goal of this is to avoid FULL OUTER JOINs. This is extremely unscalable since maintenance would be terrible. We will avoid this by using pivot. Pivot means less rows, more columns. PIVOT promotes data to column headers.

You get to decide how the tuple that’s created on the pivot is aggregated (count, min, max, sum, avg, etc.). Exactly one aggregate can be applied, one column can be aggregates, and one columns values can be promoted into the column header.

PIVOT ( SUM(Col1) FOR [ID] IN ([ID_value_1], [ID_value_2], etc.)
SUM = the aggregate, ID = the column that will become more columns, the IN values = the column values from ID that will be promoted into the column header.

Time for a 3 column pivot. For this, we are doing a two column pivot and ignoring one of the fields. You can even pivot on computed fields but make sure you include the values in that inclusion clause. Be careful about adding unnecessary data.

How do you manage the VTCs (the column values that end up as column headers)? Option 1 – don’t. Option 2 – explicitly request the ones of interest and provision for future extras. Option 3 – use dynamic SQL! You can use cursor, XML, etc. Check out his ppt deck from github for code samples!

An n-column PIVOT works by essentially creating a 2-column pivot (at the end of the day it’s only two columns that ever get pivoted) and knowing which you want split into new columns.

Ugly side of PIVOT = lookups. The more fields you need to add in from additional tables, the worse performance will be. Your best option there would be to do a group by, the pivot. Another limitation is you can’t use a function in your pivot aggregation (SUM() vs SUM() *10). Get your raw data clean then pivot.

Time for UNPIVOT!

Unpivot = convert horizontal data to vertical. Less columns, more rows. Unpivot demotes column headers back into data.

Be very very careful with your data type your about to create. Remember lowest common denominator, all the values must be able to fit in one common datatype without overflow, truncation, or collation.

UNPIVOT( newColPropertyValue FOR newColPropertyName IN ([originalCol1], [originalCol2],etc.)

You need to make sure all your original columns have the same datatype. NULLs get automatically dropped out. If they are needed, you can convert them using an ISNULL function to a string value or int value depending on your need.

There’s also an option for XML-based unpivot.

Cross Apply = acquires a subset of data for each row found by the outer query. Behaves like an inner join, if no subset is found then the outer row disappears from the result set. Outer Apply is similar but it’s more like a left join. Cross Apply does keep your NULL values. You can also use a STRING_SPLIT with cross apply.

Multi-Unpivot normalizes hard-core denormalized data. You just add more UNPIVOT lines after your initial FROM statement! Make sure you have a WHERE statement to drop any records that don’t align at a column level. Something like WHERE LEFT(element1,8) = LEFT(element2, 8).

My Session – Time for Power BI To Git CI/CD

Thanks to everyone that attended my session! Had some great questions and conversations! Here’s the link to my github with the slide deck from today: https://github.com/Anytsirk12/DataOnWheels/tree/main/Power%20BI%20CICD

Medallion Architecture for Fabric by Steve Hughes

This session was awesome! We were able to watch a series of Fabric 5 minute videos and had an amazing discussion between them about options for building out a Fabric infrastructure using medallion techniques. Check out Steve’s YouTube channel for his Fabric 5 playlist and to learn more about his experience working with ALS.

Last Non-NULL Date in SQL Server

The simplest of requests are often the most difficult to execute. For example, a finance team needs to know every time a customer did not invoice for 90 days in the past 2 years. The simplicity of the ask is deceiving. Tracking differences across multiple dimensions (customer and invoice date in this case) and accounting for NULL values in the changing dimension (aka when a customer did not invoice on a day) appears to be hopeless without the support of a CRM code change. But have no fear, complicated SQL is here!

Testing Scenario: the business would like you to create a customer attrition report. To do this, you need to find gaps in invoice dates per customer and determine when and how many customers go “inactive” and are “reactivated” in the past two years. A customer is deemed “inactive” whenever there are greater than 90 days since the last invoice. This can occur multiple times in one year, so a customer can be “reactivated” multiple times in one year.

Resources Needed:

  1. SQL Server Access to the needed data elements
    • In this scenario, this consists of invoice date by customer. You can swap this out for any other date range or any other unique ID.
  2. Business logic
    • In this scenario, activations in a year = anytime a customer has invoiced first the first time in a 90 day period. You can swap customer field for any dimension such as sales rep, carrier, business segment, etc. You can also swap out invoice date for any date field such as creation date, pickup date, paid date, delivery date, etc.
  3. Start and End dates
  4. Ability to use CTE’s/Temp Tables
    • This really comes into play if you are trying to create a Direct Query based report in Power BI or using any other reporting tools that do not allow calling Temp Tables. If you hit this limitation, then you will need to leverage a database/code solution instead of the method below.

Notes:

  • If your SQL server instance is after 2016, then you will not need to use the custom date temp table and can use IGNORE NULL within the MAX OVER statement (see alternative line in the final SQL code below).
  • The process below lays out each portion of the final query, but feel free to skip ahead to the end for the final sql statement if you don’t need each section explained.

Process:

  1. Set up parameters
    • DECLARE @StartDate DATE = '2019-01-01'
      DECLARE @EndDate DATE = GETDATE()
      DECLARE @ActivationRange INT = 90 --notates how many days can be between invoice dates before a customer is deemed "inactive".
  2. Create a date/calendar table. Check with your DBA’s first to make sure they haven’t already created something similar that you can use, all you need is a list of sequential calendar dates with no gaps.
    • ;WITH cte AS (
      SELECT @StartDate AS myDate
      UNION ALL|
      SELECT DATEADD(day,1,myDate) as myDate
      FROM cte
      WHERE DATEADD(day,1,myDate) <= @EndDate
      )
      SELECT myDate 'CalendarDate'
      INTO #Calendar
      FROM cte
      OPTION (MAXRECURSION 0) –this works around the usual 100 recursion row limit
  3. If you need to partition by a dimension other than date, such as customer in this scenario, you will need to create a table to grab that dimension’s values as well. After this, you’ll need to create a bridge table that will have a value for every date in your range and every customer (or other dimension) value as well.
    • –Customer Table
      SELECT DISTINCT
      DA.AccountsKey
      ,DA.CompanyID
      ,DA.CompanyName
      ,MIN(FSF.InvoiceDateKey) 'FirstInvoiceDate'
      INTO #Companies
      FROM DimAccount DA
      JOIN ShipmentFacts FSF ON FSF.AccountKey = DA.AccountsKey
      WHERE FSF.InvoiceDateKey IS NOT NULL
      GROUP BY
      DA.AccountsKey
      ,DA.CompanyID
      ,DA.CompanyName
    • –Bridge Table that combines both Customer and Date values
      SELECT DISTINCT
      C.CalendarDate
      ,Comp.CompanyID
      ,Comp.CompanyName
      ,MIN(Comp.FirstInvoiceDate) 'FirstInvoiceDate'
      ,CONCAT(C.CalendarDate,Comp.CompanyID) 'ID'
      INTO #Bridge
      FROM #Calendar C, #Companies Comp
      GROUP BY
      C.CalendarDate
      ,Comp.CompanyID
      ,Comp.CompanyName
      ,CONCAT(C.CalendarDate,Comp.CompanyID)
  4. Next, we need to create our unique ID’s that combine all the dimensions we are hoping to account for in our “IGNORE NULLS” scenario. In this test case, we need to create one ID that grabs the actual dates a customer invoiced on and another for all the dates in our range that a customer could have possibly invoiced on. Then, we join the two together to grab the last time a customer invoiced and get ignore those pesky NULL values. This is the section where having SQL Server 2016 and later will do you a lot of favors (see code below).
    • –Actual Invoiced Dates by Customer
      SELECT DISTINCT
      FSF.InvoiceDateKey
      ,DA.CompanyName
      ,DA.CompanyID
      ,CONCAT(FSF.InvoiceDateKey,DA.CompanyId) 'ID'
      INTO #ShipmentData
      FROM ShipmentFacts FSF
      JOIN #Companies DA ON DA.AccountsKey = FSF.AccountKey
      WHERE FSF.InvoiceDateKey BETWEEN @StartDate AND @EndDate
    • –Joining together and filling in the NULLS with the previous invoiced date by customer
      SELECT DISTINCT
      C.ID
      ,S.ID 'ShipData'
      ,CAST( SUBSTRING( MAX( CAST (C.ID AS BINARY(4)) + CAST(S.ID AS BINARY(20))) OVER (PARTITION BY C.CompanyID ORDER BY C.ID ROWS UNBOUNDED PRECEDING),5,20) AS varchar) 'PreviousInvoiceDateKey'
      --ALTERNATIVE FOR POST SQL Server 2012--
      --,CAST( SUBSTRING( MAX( CAST (C.ID AS BINARY(4)) + CAST(S.ID AS BINARY(20))) IGNORE NULLS OVER (PARTITION BY C.CompanyID ORDER BY C.ID ROWS UNBOUNDED PRECEDING),5,20) AS varchar) 'PreviousInvoiceDateKey'

      INTO #RunningDates
      FROM #Bridge C
      LEFT JOIN #ShipmentData S ON S.ID = C.ID
  5. The rest of the code is based on business logic, so please use at your discretion and edit for your own needs.

Full SQL Code:

DECLARE @StartDate DATE = '2019-01-01'
DECLARE @EndDate DATE = GETDATE()
DECLARE @ActivationRange INT = 90 --notates how many days can be between invoice dates before a customer is deemed "inactive"
;WITH cte AS (
SELECT @StartDate AS myDate
UNION ALL
SELECT DATEADD(day,1,myDate) as myDate
FROM cte
WHERE DATEADD(day,1,myDate) <= @EndDate
)
SELECT myDate 'CalendarDate'
INTO #Calendar
FROM cte
OPTION (MAXRECURSION 0)


SELECT DISTINCT
DA.AccountsKey
,DA.CompanyID
,DA.CompanyName
,MIN(FSF.InvoiceDateKey) 'FirstInvoiceDate'
INTO #Companies
FROM DimAccount DA
JOIN ShipmentFacts FSF ON FSF.AccountKey = DA.AccountsKey
WHERE FSF.InvoiceDateKey >= '2000-01-01'
GROUP BY
DA.AccountsKey
,DA.CompanyID
,DA.CompanyName


SELECT DISTINCT
C.CalendarDate
,Comp.CompanyID
,Comp.CompanyName
,MIN(Comp.FirstInvoiceDate) 'FirstInvoiceDate'
,CONCAT(C.CalendarDate,Comp.CompanyID) 'ID'
INTO #Bridge
FROM #Calendar C, #Companies Comp
GROUP BY
C.CalendarDate
,Comp.CompanyID
,Comp.CompanyName
,CONCAT(C.CalendarDate,Comp.CompanyID)

SELECT DISTINCT
FSF.InvoiceDateKey
,DA.CompanyName
,DA.CompanyID
,CONCAT(FSF.InvoiceDateKey,DA.CompanyId) 'ID'
INTO #ShipmentData
FROM ShipmentFacts FSF
JOIN #Companies DA ON DA.AccountsKey = FSF.AccountKey
WHERE FSF.InvoiceDateKey BETWEEN @StartDate AND @EndDate

SELECT DISTINCT
C.ID
,S.ID 'ShipData'
,CAST( SUBSTRING( MAX( CAST (C.ID AS BINARY(4)) + CAST(S.ID AS BINARY(20))) OVER (PARTITION BY C.CompanyID ORDER BY C.ID ROWS UNBOUNDED PRECEDING),5,20) AS varchar) 'PreviousInvoiceDateKey'
--ALTERNATIVE FOR POST SQL Server 2012--
--,CAST( SUBSTRING( MAX( CAST (C.ID AS BINARY(4)) + CAST(S.ID AS BINARY(20))) IGNORE NULLS OVER (PARTITION BY C.CompanyID ORDER BY C.ID ROWS UNBOUNDED PRECEDING),5,20) AS varchar) 'PreviousInvoiceDateKey'
INTO #RunningDates
FROM #Bridge C
LEFT JOIN #ShipmentData S ON S.ID = C.ID


SELECT DISTINCT
R.ID
,R.ShipData
,R.PreviousInvoiceDateKey
,LEFT(R.PreviousInvoiceDateKey,10) 'PreviousInvoiceDate'
,LEFT(R.ID,10) 'DateKey'
,RIGHT(R.ID,5) 'CompanyId'
,B.FirstInvoiceDate
INTO #ActivationData
FROM #RunningDates R
LEFT JOIN #Bridge B ON B.ID = R.ID

SELECT DISTINCT
A.ID
,A.DateKey
,A.CompanyId
,A.PreviousInvoiceDate
,YEAR(A.DateKey) 'Year'
,YEAR(A.FirstInvoiceDate) 'InitialActivationYear'
,CASE WHEN YEAR(A.DateKey) = YEAR(A.FirstInvoiceDate) THEN 1 ELSE 0 END 'IsActivationYear'
,DATEDIFF(Day,A.PreviousInvoiceDate,A.DateKey) 'DaysSinceInvoice'
,CASE WHEN DATEDIFF(Day,A.PreviousInvoiceDate,A.DateKey) = @ActivationRange THEN 1 ELSE 0 END 'IsInactive'
,CASE WHEN DATEDIFF(Day,A.PreviousInvoiceDate,A.DateKey) = @ActivationRange THEN A.DateKey ELSE NULL END 'InactiveDate'
INTO #ActivationDetails
FROM #ActivationData A

SELECT DISTINCT
D.Year
,D.CompanyId
,SUM(D.IsInactive) 'InactivatedPeriods'
,MAX(D.IsActivationYear) 'IsFirstActivationYear'
,MAX(D.DaysSinceInvoice) 'BiggestGapInInvoicing (Days)'
,MAX(D.InactiveDate) 'LastInactiveDate'
,MAX(D.PreviousInvoiceDate) 'LastInvoiceDate'
,CASE WHEN MAX(D.InactiveDate) > MAX(D.PreviousInvoiceDate) THEN -1 ELSE 0 END 'NotActiveAtEndOfYear'

--to grab the activations per customer per year follow equation below
-- Activations = [InactivatedPeriods] + [NotActiveAtEndOfYear] + [IsFirstActivationYear] --this part will be done in Power BI
FROM #ActivationDetails D
GROUP BY
D.Year
,D.CompanyId


DROP TABLE #Calendar
DROP TABLE #Companies
DROP TABLE #Bridge
DROP TABLE #ShipmentData
DROP TABLE #RunningDates
DROP TABLE #ActivationData
DROP TABLE #ActivationDetails

Additional Resource:

I Wrote a Book – Hands-On SQL Server 2019 Analysis Services

While not the first time I have authored, this is the first book that I wrote as the sole author. Analysis Services is the product I built my career in business intelligence on and was happy to take on the project when I was approached by Packt.

I think one of my favorite questions is about how much research time did I put in for this book. The right answer is almost 20 years. I started working with Analysis Services when it was called OLAP Services and that was a long time ago. Until Power Pivot for Excel and tabular model technology was added to the mix, I worked in the multidimensional model. I was one of the few, or so it seems, that enjoyed working in the multidimensional database world including working with MDX (multidimensional expressions). However, I was very aware that tabular models with the Vertipaq engine were the model of the future. Analysis Services has continued to be a significant part of the BI landscape and this book give you the opportunity to try it out for yourself.

This book is designed for those who are most recently involved in business intelligence work but have been working more in the self-service or end user tools. Now you are ready to take your model to the next level and that is where Analysis Services comes into play. As part of Packt’s Hands On series, I focused on getting going with Analysis Services from install to reporting. Microsoft has developer editions of the software which allow you to do a complete walk through of everything in the book in a step by step fashion. You will start the process by getting the tools installed, downloading sample data, and building out a multidimensional model. Once you have that model built out, then we do build a similar model using tabular model technology. We follow that up by building reports and visualizations in both Excel and Power BI. No journey is complete without working through security and administration basics. If you want learn by doing, this is the book for you.

If you are interested in getting the book, you can order it from Amazon or Packt. From November 20, 2020 through December 20, 2020, you can get a 25% discount using the this code – 25STEVEN or by using this link directly.

I want to thank the technical editors that worked with me to make sure the content and the steps worked as expected – Alan Faulkner, Dan English, and Manikandan Kurup. Their attention to detail raised the quality of the book significantly and was greatly appreciated.

I have to also thank Tazeen Shaikh who was a great content editor to work with. When she joined the project, my confidence in the quality of the final product increased as well. She helped me sort out some of the formatting nuances and coordinated the needed changes to the book. Her work on the book with me was greatly appreciated. Finally, many thanks to Kirti Pisat who kept me on track in spite of COVID impacts throughout the writing of the book this year.

I hope you enjoy the book!

PASSMN June 2020 – Data Classification with SQL Server and Azure

I presented at the virtual Minnesota SQL Server User Group meeting on June 16, 2020. The topic was data classification with SQL Server 2019 and Azure SQL Database.

Data Classification Basics

Data classification in both SQL Server and Azure allow you to discover and label data based on information type and sensitivity. Information type is a way to describe the content of the data at high level. This includes types such as Address, Name, Networking, and Credit Card. By tagging your columns with types you will be able to easily see the types of data stored in your tables. You can also label the sensitivity. This includes labels such as Confidential and Confidential-GPDR.

Using SQL Server 2019 and SSMS 18.4+

For on premises implementations, you can use SQL Server Management Studio. I would recommend that you use SSMS 18.4 or greater. This has the most capability. SQL Server 2019 includes the sys.sensitivity_classifications system catalog view so you can query to see what field have been labeled.

To get started, open up SSMS. Right click the database and choose Tasks > Data Discovery and Classification > Classify Data. This will allow you to

Finding the Data Discovery and Classification Options in SSMS

view the Data Classification window in SQL Server. You will get a list of recommendations and the ability to add custom classifications in your SQL Server database.

The Data Classification view in SSMS

Once you have classified some of your data, you are able to view a report that shows the coverage of the classification work you have done.

Data Classification Report in SSMS

Adding Data Classification in Azure SQL Database

Azure SQL Database supports similar functionality for discovering and classifying data. The primary differences are (1) it requires Advanced Data Security which costs $15/month per server and (2) audit logging support is built in.

You can find this in the Azure portal with your SQL Database.

Advanced Data Security in Azure SQL Database

As you can see above, you get a visual here initially. Click the Data Discovery & Classification panel to open a similar classification window that we see in SSMS. This will allow you to discover and classify your data.

The key difference is turning on auditing and logging information about people querying the classified data. In the Security section in your SQL Database view in the Azure portal, choose Auditing. You can now add auditing to your server or database. (Click here for information about setting up Auditing.) I chose to use Log Analytics which is in preview. Log Analytics has a dashboard which shows activity in your database with this data.

Log Analytics Dashboard which Shows Access to Sensitive Data

You can click into the dashboard to dig into details. You can also use the Log Analytics query features to build your own queries to further analyze the data. The details contain who accessed the information, their IP address, and what was accessed. You can build more reports from that information to support more sophisticated auditing.

Final Thoughts

I think that there is still work to be done on SQL Server to better support auditing. Azure is ahead of the game in this area. More importantly, Azure logging is a platform level solution. You should be able to integrate your logging from the applications to the database in Azure.

You do have the ability to update the policy in SQL Server with a JSON file. I recommend you export the file and modify it. In Azure, you can update the information policy in the Security Center. Updating this policy allows you to discover data or information that you want to classify based on rules you set up. This should be part of your data governance plan.

One other follow up from the meeting. The question was raised about Visual Studio support in database projects. The answer is “sort of”. First, you need to make sure your project is targeting SQL Server 2019 or Azure SQL Database. Once that is set, you can use the following code to add the classification manually or you can apply it to your database and do a scheme compare to bring it in.

ADD SENSITIVITY CLASSIFICATION TO
    [SalesLT].[Customer].[FirstName]
    WITH (LABEL = 'Confidential - GDPR', LABEL_ID = 'fe62dcde-72c0-475c-b1af-fb8de4c8fc7e', INFORMATION_TYPE = 'Name', INFORMATION_TYPE_ID = '57845286-7598-22f5-9659-15b24aeb125e', RANK = MEDIUM);

You will need to know the GUIDs for the labels and types in your solution to do this manually. However, once this is done, you can see the information in the Properties window for the field as well.

Data Classification Properties in Visual Studio

The key thing to be aware of is that the properties are read only. You have to use the code to change them or do the changes in the database and use Schema Compare to bring them in.

Thanks again to those of you who joined us at the meeting. Here is the slide deck from that meeting. I look forward to sharing more with all of you later.