SQL Saturday Atlanta 2024

Hey there happy coders! Last weekend I had the pleasure of speaking at the SQL Saturday Atlanta event in Georgia! It was an awesome time of seeing data friends and getting to make some new friends. If you live near a SQL Saturday event and are looking for a great way to learn new things, I can’t recommend SQL Saturday’s enough. They are free and an excellent way to meet people who can help you face challenges in a new way. Below are my notes from various sessions attended as well as the materials from my own session. Enjoy!

Link to the event – https://sqlsaturday.com/2024-04-20-sqlsaturday1072/#schedule

My session – Real-Time Analytics in Fabric

Thank you so much to everyone who came out to see my first ever session on Real-Time Analytics in Fabric! We had a couple of glitches in the Logic App creation, but had an excellent time troubleshooting together as a group. Please check out my GitHub for the slide deck as well as all the code used in the demonstration. Happy coding!

Link to my GitHub – https://github.com/Anytsirk12/DataOnWheels/tree/main/Real-Time%20Analytics%20in%20Fabric

Introduction to SQL Server Essential Concepts by Bradley Ball

Goal is to understand the core things about databases to enable deeper understanding.

ACID = atomicity, consistency, isolation, and durability.
Atomicity = either all of it commits or none of it commits. Consistency = my data must be in a consistent state before a transaction completes. Isolation = transaction must operate independently from other transactions. Durability = complex logging that is the transaction log. The log has all the commit history and ensures accurate data.

Transaction isolation levels – serializable, read committed (SQL server default), read uncommitted, repeatable read, snapshot isolation. You can set some of these at the db level. Serializable = blocks anything trying to get data on the same page, you can set this at the transaction level. Read committed = I can’t read data if a transaction is currently occurring. Read uncommitted = a dirty read, grabs data that isn’t committed. Repeatable read = nobody uses this lol. It’s a shared lock that holds for a longer period of time than the typical micro second. Shared lock means everyone can read the data. Snapshot isolation = Oracle and postgres use this. Every thing is an append and update. If I have 4 transactions, 1 update and 3 read, usually update would block reads, but this would redirect readers to a copy of the data in the TempDB at the point in time they requested to read it (aka before the update commits).

DMV = dynamic management view. Bradley used one that allows you to see active sessions in your database. We can see that a read query is blocked by an uncommitted transaction. We can see the wait_type = LCK_M_S and the blocking_session_id which is our uncommitted transaction. To get around this, he can run the read script and get a dirty read by setting the isolation level to read uncommitted. To unblock the original request, he can use ROLLBACK TRANSACTION to allow it to unlock that page of data.

How does SQL Server work on the inside? We have a relational engine and a storage engine. User interacts with a SNI which translates the request to the relational engine. User > SNI > Relational Engine [command parser > optimizer (if not in planned cache otherwise goes straight to storage engine) > query executer] > Storage Engine [access methods (knows where all data is) > buffer manager (checks the data cache but if not found then goes to the disk and pulls that into the buffer pool data cache). This gets extremely complicated for other processes like in-memory OLTP. The SQL OS is what orchestrates all these items.

SQL OS – pre-emptive scheduling (operating system) & cooperative pre-emptive scheduling (accumulates wait stats to identify why something is running slower).

Locks, latches, waits. Locks are like a stop light (row, page, and table escalation). If you lock a row, it will lock a page. Latches are who watches the locks/watchmen. It’s a lock for locks. Waits are cooperative scheduling. If a query takes too long, it will give up it’s place in line willingly. That creates a signal wait which signals there’s too much lined up.

SQL data hierarchy. Records are a row. Records are on a data page (8 k). Extents are 8 pages (64 k). It’s faster to read extents than pages. Allocation bit maps are 1s and 0s that signify data on a data page that enables even faster data reads – allows governing on 400 GB of data on 1 8KB page. IAM chains and allocation units allows quick navigation of pages. Every table is divided into in row data, row overflow data (larger than 8064 k), and lob data (large object like VARCHAR max and images).

Allocation units are made of 3 types:
1. IN_ROW_DATA (also known as HoBTs or Heap or B-Trees)
2. LOB_DATA (also known as LOBs or large object data)
3. ROW_OVERFLOW_DATA (also known as SLOBs, small large object data)

Heaps vs Tables. Oracle stores data as a heap which is super fast to insert. In SQL, these have bad performance due to clustered indexes and inserting new data. This is very situational. A table is either heap or clustered index, cannot be both. But heaps can have non-clustered indexes.

B-Tree allows you to get to the record with less reads by following a logic tree (think h is before j so we don’t need to read records after j). Heaps create a 100% table scan without a clustered index. Adding the clustered index dropped that significantly to only 1 read instead of the 8000 reads.

Recovery models – full, bulk logged, simple (on-prem). In the cloud everything is full by default. Full means everything is backed up. Bulk means you can’t recover the data but you can rerun the input process. Simple means you can get a snapshot but you can’t do any point in time restore. This will largely be determined by any SLAs you have.

Transaction log. This will constantly be overwritten. Your log should be at least 2.5 as large as your largest cluster. DBCC SQLPERF(logspace) will get you all the space available for logs in the various dbs. Selecting from the log is always not recommended since it creates a lock and logs are always running, so don’t do this in prod lol. Rebuilding indexes will grow your transaction log massively. To free up space in the transaction log, you have to a backup log operation which is why those are super important.

Fun tip, when creating a table you can put DEFAULT ‘some value’ at the end of a column name to provide it a default value if one is not provided. Pretty cool.

You can use file group or piecemeal restores to restore hot data much faster then go back and restore older, cold data afterward. To restore, you must have zero locks on the db. While restoring, the database is not online. Note, if you do a file group restore, you cannot query data that is in a unrestored file group so queries like SELECT * will not work.

Tales from the field has a ton of YouTube videos on these subjects as well.

Lessons Learned using Fabric Data Factory dataflow by Belinda Allen

What are dataflows? Dataflows are a low-code interface tool for ingesting data from hundreds of data sources, transforming your data using 300+ data transformations. The goal is to allow for more people to manipulate and use data within your organization. At the heart, it’s Power Query.

Why start with dataflows as a citizen developer? It’s power query and you know that. It’s low-code data transformation. Excellent for migrating Power BI reports to Fabric.

Lots of great discussion about when it makes sense to use a dataflow gen2.

You can copy and paste power query from Power BI by going into the advanced editor OR you can hold shift and select all the queries you want then ctrl c then go to power query for a dataflow gen2 in the online service and hit ctrl v and it will populate with all your tables! Pretty neat. You can also make your relationships within the online portal.

DBA Back to Basics: Getting Started with Performance Tuning by John Sterrett

For the code visit: https://johnsterrett.com/blog-series/sql-server-performance-root-cause-analysis/

Goal of today’s session – arm anyone who is new to performance tuning with processes and sills to solve common problems.

Basic query runtime. SQL has something called wait stats that tells you what caused the query to be slow. When you run a massive query, it will go into a suspended state which will require reading from disc instead of from memory cache (T1). After that, you’re in a runable state (T2). Finally, you get to run it (T3).

Basic bottlenecks = memory, disk, CPU, network, locking blocking & deadlocks. Adding memory is typically the fastest way to improve performance.

Identify performance problems happening right now:

EXEC sp_whoisactive. This is an open source script that gives you insight into who’s running what right now. You can get this from https://whoisactive.com. The cool thing about this is there are more ways to run it than just EXEC sp_whoisactive. Identify what’s consuming the most CPU from the column. There’s also some parameters you can use like @sort_order. EXEC sp_whoIsActive @sort_order = ‘[CPU] DESC’, @get_task_info = 2. The task info parameter will give more information in a wait_info column. The best command is exec sp_whoIsActive @help = 1. This provides ALL the documentation on what it does. Adam (the creator) also has a 30 day blog series on everything it can do for you! One option to make things run faster is to kill the process causing the issue lol.

How to handle blocking. You can do explicit transactions with BEGIN TRANSACTION which will lock the table. At the end, you need to either COMMIT or ROLLBACK or else that lock holds. SQL uses pessimistic locking as default so it won’t let you read data that’s locked – it will simply wait and spin until that lock is removed. You can use exec sp_whoisactive @get_plans = 1 to get the execution plan. Be careful, the wait_info can be deceptive since the thing that takes the most time may not be the problem. It may be blocked by something else, check the blocking_session_id to ve sure. Also check the status and open_tran_count to see if something is sleeping and not committed. Keep in mind that the sql_text will only show you the last thing that ran in that session. SO if you run a select in the same session (query window) as the original update script, it won’t be blocked and can run and THAT query will show up in the who is active which can be super confusing. To resolve this issue, you can use ROLLBACK in that session to drop that UPDATE statement.

To find blocking queries use EXEC sp_whoIsActive @find_block_leaders = 1, @sort_order = ‘[blocked_session_count] DESC’.

Identifying top offenders over the long term:

There’s a feature in SQL 2016 forward called Query Store which persists data for you even after you restart data. It’s essentially a black box for SQL. Query Store is on by default in SQL 2022 and online servers. It’s available for express edition as well. Be sure to triple check this is on, because if you migrated servers it will keep the original settings from the old server. If you right click on the DB, you can navigate to query store and turn it on via Operation Mode (Requested) to Read write. Out of the box is pretty good, but you can adjust how often it refreshes and for how much history. To see if it’s enabled, you should see Query Store as a folder under the db in SSMS.

Under query store, you can select Top Resource Consuming queries. There’s lots of configuration options including time interval and what metric. SQL Server 2017 and newer have a Query Wait Statistics report as well to see what was causing pain. It’ll show you what queries were running in the blocking session. You won’t get who ran the query from query store, but you can write sp_whoisactive to a table that automatically loops (very cool). This will have overhead on top of your db, so be mindful of that.

Intro to execution plans:

Keep in mind, SQL’s goal is to get you a “good enough” plan, not necessarily the best plan. Follow the thick lines. That’s where things are happening. Cost will tell you the percentage of the total time taken.

Key lookups. It’s a fancy way to say you have an index, so we can skip the table and go straight to the data you have indexed. BUT if there’s a nest loop, then there’s an additional columns in the select statement so it’s doing that key lookup for every value. More indexes can make your select statements worse if it’s using the wrong index that isn’t best for your query.

Index tuning process.
1. Identify tables in query
2. Identify columns being selected
3. Identify filters (JOIN and WHERE)
4. Find total rows for each table in the query
5. Find selectivity (rows with filter/table rows)
6. Enable statistics io, time, and the actual execution plan
7. Run the query and document your findings
8. Review existed indexes for filters and columns selected
9. Add index for lowest selectivity adding the selected columns as included columns
10. Run the query again and document findings
11. Compare findings with baseline (step 7)
12. Repeat last 5 steps as needed

To see existing indexes, you can run sp_help ‘tableName’. In the example, there’s an index key on OnlineSalesKey but that field is not used in our filter context (joins and where statements) in the query. Order of fields in indexes do matter because it looks in that order.

Brent Ozar made a SP you can use called sp_blitzIndex that will give you a ton of info on an index for a table including stats, usage, and compression. It also includes Create TSQL and Drop TSQL for that index to alter the table.

To turn on stats, use SET STATISTICS IO, TIME ON at the beginning of the query. Be sure to also include the actual execution plan (estimated doesn’t always match what actually happened). Now we can benchmark. Use SET STATISTICS IO OFF and SET STATISTICS TIME OFF. Create an non clustered index with our filter context columns.

First Time Speaking with a Mask at an Event

Presentation team at the event

Over the past couple of months, I have started losing my ability to talk without a mask. One of the effects of a disease, ALS, is that I am losing the ability to breathe without support which also impacts my ability to talk without my ventilator. This past weekend I did my first speaking engagement at Derby City Data Days in Louisville, KY. As the saying goes, “It takes a village.”

As many of you know, I have been producing a video series called Fabric 5 which contains five minutes snippets about various Microsoft Fabric architectural topics. I was going to use some of that content to create a slide deck about medallion architectures in Fabric. As the day got closer, I was concerned I would not be able to maintain the ability to speak during an entire presentation in my condition. My wife and I thought that we could use my videos for the presentation, so we recruited my son, Alex, who does video editing, to put together the videos for the presentation. That only left the intro slides and the ending slides to be covered by me or someone else.

My daughter, Kristyna, was presenting right before me in the same room. She and my wife managed most of the introduction including the sponsor slides and the user groups. I was able to kick off with a microphone that was provided by the venue through Redgate. Then we kicked off the video and took questions during the session. I was able to answer questions with the help of my wife and using the mic provided in the room. We wrapped up with a final Q&A and a couple of references to help people in Fabric.

A quick video from the session

Overall, this was a wonderful experience for all of us and I appreciated the patience of everybody in the room as we worked through this process for the first time. Here are the Fabric 5 videos that were used during the presentation so you can follow up with it later and references are down below from the end of the presentation. I would like to thank the organizers – John Morehouse, Josh Higginbotham, Matt Gordon – once again for the support and the help as I presented masked up for the first time!

Fabric 5 Videos Used in the Presentation

Fabric 5 – Introduction to the Series and OneLake

Fabric 5 – Why Capacity Matters

Fabric 5 – Medallion Lakehouses

Fabric 5 – Medallion in a Lakehouse

Fabric 5 – Medallion Workspaces

Fabric 5 – Medallion with Compute Workspace

Fabric 5 – Centralized Data Lake

Fabric 5 – Dimensions and Master Data

Fabric 5 – Workspace Sources

Fabric 5 – ADF to Bronze

References

Microsoft Fabric Career Hub Page

Data On Wheels YouTube Channel – Fabric 5 Playlist

After Party Fun – Stayin’ Alive

During the after party my ventilator ran out of power and my car charger did not work. This presented a problem because the trip home was over an hour. When we went out to the car, we had a moment of panic because the ventilator ran completely out of power. We went back to the bar where we were at for the after party to plug in, but the charger wouldn’t work. I was without the ventilator for 15 to 20 minutes while they troubleshot the issue. We started to think that they may have to call 911! However, this issue was resolved due to a loose connection. I must thank all of individuals there that helped including John Morehouse who went to get a battery backup system to make sure I could make it home and to my son-in-law who went to our house to bring us our backup battery generator as well to help us get home.

Working with ALS is not always easy and all the help from everyone around me is genuinely appreciated!

SQLBits 2024 Recap

Had an absolutely amazing time at SQLBits this year! It was lovely to see all my data friends again and had the opportunity to introduce my husband to everyone as well! In case you missed a session, or are curious about what I learn at these conferences, below are my notes from the various sessions I attended.

Here is a link to my GitHub that contains the slides and code from my presentation on creating a M-agical Date Table. Thank you to everyone who attended! https://github.com/Anytsirk12/DataOnWheels/tree/main/SQLBits%202024

What’s new in Power Query in Power BI, Fabric and Excel? by Chris Webb

Power query online is coming to PBI desktop and will include diagram & schema views, the query plan, and step folding indicators! Currently in private preview, coming to public preview later this year.

Power query in Excel. Nested data types now work (very cool, check this out if you don’t know what data types are in Excel). Python in excel can reference power query queries now which means we don’t have to load data into Excel! That will allow analysis of even more data because you can transform it upstream of the worksheet itself. Very cool.

Example: import seaborn as sns

mysales = xl(“Sales”)

sns.barplot(mysales, x=”Product”, y = “sales”, ci=None)

Limitations – can’t use data in the current worksheet.

Power query is in Excel online! Later this year, you will be able to connect to data sources that require authentication (like SQL Server!).

Power query can be exported as a template! This will create a .pqt (power query template). You can’t import a template file in Excel, but you can in Dataflows Gen2 in Fabric.

Fabric Dataflow Gen2 – data is not stored in the dataflow itself. We now have Power Query copilot! It’s in preview, but it’s coming. You will need a F64 to use it.

Paginated reports are getting power query!! It will be released to public preview very soon. All data sources will be available that are in power bi power query. This will allow snowflake and big query connections annnddd excel! You can even pass parameters to the power query get data by creating parameters within the power query editor. Then you go to the paginated report parameters and create the parameter there. Finally go to dataset properties and add a parameter statement that sets those two equal to each other.

Rapid PBI Dev with ChatGPT, Tips & Tricks by Pedro Reis

Scenario 1: DAX with SVG & CSS.
SVG – Problem is to create a SVG that can have conditional formatting that doesn’t fill the entire cell. ChatGPT does best with some working code, so give it a sample of the SVG you want and as ChatGPT to make it dynamic. Demo uses GPT4 so we can include images if needed. He uses Kerry’s repo to get the code sample that we put into GPT (KerryKolosko.com/portfolio/progress-callout/).
CSS – uses HTML Content (lite) visual (it’s a certified visual). Prompt asks gpt to update the measure to increase and decrease the size of the visual every 2 seconds. Pretty awesome looking. He’s also asking it to move certain directions for certain soccer players (Ronaldo right and Neymar zigzag).

Scenario 2: Data Modeling and analysis. He uses Vertabelo to visualize a datamodel and it creates a SQL script to generate the model. He passes the screenshot of the datamodel schema to ChatGPT and it was able to identify that a PK was missing and a data type was incorrect (int for company name instead of varchar). Super neat. The image does need to be a good resolution. He also put data from excel into GPT to have it identify issues like blanks and data quality issues (mismatched data types in the same column). It could grab invalid entries and some outliers but it’s not very reliable now.

Scenario 3: Report Design Assessment. He grabbed four screenshots from a Power BI challenge and asked GPT to judge the submissions with a 1-10 rating based on the screenshots and create the criterion to evaluate the reports. Very neat! Then he asked it to suggest improvements. It wasn’t very specific at first so he asked for specfic items like alternatives to donut charts and other issues that lowered the score.

Scenario 4: troubleshooting. He asks it to act as a power bi expert. It was able to take questions he found on the forum, but sometimes it hallucinate some answers and it’s not trained on visual calculations yet.

Scenario 5: Knowledge improvement. He asked GPT to create questions to test his knowledge of Power BI models. It even confirmed what the correct answers. It even created a word doc for a shareable document that can be used to test others in the future.

Streamlining Power BI with Powershell by Sander Stad and Linda Torrang

Data-Masterminds/PSPowerBITools is an open source code that is in the powershell gallery as well as within GitHub. Github link: https://github.com/Data-Masterminds/PSPowerBITools.

There’s a module called MIcrosoftPowerBIMgmt that is used with a service account to wrap around the rest api and allow easier calls of those APIs through powershell. When running PowerShell, to see the things in a grid use the code ” | Out-GridView” at the end of your command and it creates a very nice view of the data. All demos are available at github. The first demo is to grab all the orphaned workspaces then assign an admin. Their module includes more information like the number of users, reports, datasets, etc. They also have a much easier module that will grab all the activity events from a given number of days (it will handle the loop for you).

Another great use case will be moving workspaces from a P SKU capacity to a F SKU capacity automatically. The admin management module doesn’t work for this, but the PSPowerBITools module can! To see help – use “get-help” in front of the command to get details and examples. For more details, add “-Detailed” at the end to see that. Adding “-ShowWindow” at the end of a script will pop open a separate window to see the response (note- doesn’t work with -Detailed but does contain the details). To export use this script: $workspaces | Export-Csv -NoVolbber -NoTypeInformation -Path C:\Temp\export_workspaces.csv.

The goal is to use the module for more ad hoc admin tasks like remove unused reports, personal workspaces, data sources, and licenses.

New to Powershell like me? Check out this article on how to automate Powershell scripts to run on your laptop: https://www.technewstoday.com/how-to-automate-powershell-scripts/.

Using ALLEXCEPT or ALL-VALUES by Marco Russo

ALL: removes all the filters from expanded table/columns specified.
REMOVEFILTERS: like ALL (it’s the same function). We use this for code readability.
ALLEXCEPT: removes filters from expanded table in the first argument, keeping the filter in the following table/columns arguments.

The following produce the same result:
ALLEXCEPT(customer, customer[state], customer[country])
REMOVEFILTERS(customer[customerkey], customer[name], customer[city])
The nice thing about ALLEXCEPT is that it’s future proof against additional fields being added to the table.

But is it the correct tool?

Demo – goal is to get % of continent sales by getting country or state or city sales / total continent sales

To get that total continent sales, we need sales amount in a different filter context. The simplest way to do this is using ALLEXCEPT since we always only want the continent field from the customer table.
CALCULATE([Sales Amount], ALLEXCEPT( customer, customer[continent]))

This works if we include fields from the same table – customer. If we pull in fields from another table, it won’t work as we want because it will be filtered by a different table’s filter context. To get rid of that, you can add REMOVEFILTERS(‘table’) for each OR we can change the ALLEXPECT to use the fact table.
CALCULATE([Sales Amount], ALLEXCEPT( Sales, customer[continent]))

One problem is if a user drops the continent from the visual, the filter context no longer includes continent so the ALLEXCEPT will no longer work properly as it will function as an ALL.

To fix this, use remove filters and values. Using values adds in the continent value in the current filter context, then removes the filter context on customer. Lastly, it applies the value back into the filter context so we end up with what we want which is just continent is filtered. This works quickly because we know each country will only have one continent.

CALCULATE( [sales amount], REMOVEFILTERS( Customer, VALUES (customer[continent]))

Performance difference? Yes – ALLEXCEPT will be faster, but the REMOVEFILTERS and VALUES should not be too much slower.

Analytics at the speed of direct lake by Phillip Seamark and Patrick LeBlanc

Phil did a 100 minute version of this session on Wednesday.

What is direct lake? We will evaluate over three categories – time to import data, model size, and query speed. Direct lake mode takes care of the bad things from both import and direct query modes. Direct lake only works with one data source. Data is paged into the semantic model on demand triggered by query. Tables can have resident and non-resident columns. Column data can get evicted for many reasons. Direct Lake fallback to SQL server for suitable sub-queries. “Framing” for data consistency in Power BI reports. Direct lake model starts life with no data in memory. Data is only pulled in as triggered by the DAX query. Data stays in that model once pulled for hours. We only grab the columns we need which saves a ton of memory.

Limitations – no calculated columns or tables. No composite models. Calc groups & field parameters are allowed. Can’t be used with views, can only be used with security defined in the semantic model (you can have RLS on the semantic model itself). Not all data types are supported. If any of these are true, Power BI will fall back to Direct Query mode.

SKU requirements: only PBI premium P and F skus. No PPU nor pro license.

Why parquet? column-oriented format is optimised for data storage and retrieval. Efficient data compression and encoding for data in bulk. Parquet is also available in languages including rust, java, c++, python, etc. Is lingua franca for data storage format. Enhanced with v-order inside of Fabric which gives you extra compression. Analysis services can read and even prefers that form of compression so we don’t need to burn compute on the compression. V-Order is still within open source standards so this still doesn’t lock you into Fabric only. Microsoft has been working on V-Order since 2009 as part of Analysis services. Row group within parquet files corresponds directly to the partitions/segments in a semantic model. Power BI only keeps one dictionary, but parquet has a different dictionary id for each row group/partition. When calling direct query, low cardinality is key because most of the compute will go to the dictionary values that need to be called across the various partitions. This is why it’s VITAL to only keep data you need for example, drop or separate times from date times and limit decimals to 2 places whenever possible.

You can only use a warehouse OR a data lake. So if you need tables from both, make a shortcut from the data lake to the warehouse and just use the warehouse. You can only create a direct lake model in the online service model viewer or in Tabular editor. You cannot create it in PBI desktop today.

Microsoft data demo. One file was 880 GB in CSV > 268 GB in Parquet > 84 GB with Parquet with V-ORDER.

Sempty is a python library that allows you to interact with semantic models in python.

In the demo, Phil is using sempty and semantic-link libraries. Lots of advantages to using the online service for building the semantic model – no loading of data onto our machines which makes authoring much faster. After building this, you can run a DMV in DAX studio. Next, create a measure/visual that has a simple max of a column. That will create a query plane to reach out to parquet and get the entire column of data needed for that action then filter as needed. After that, the column will now be in memory. Unlike direct query, direct lake will load data into model and keep it for a bit until eviction. The direct lake will also evict pages of columns as needed when running other fields. You can run DAX queries against the model using python notebook and you can see via DMV what happened.

Framing. Framing is a point in time way of tracking what data can be queried by direct lake. Why is this important? data consistency for PBI reports, delta-lake data is transient for many reasons. ETL Process – ingest data to delta lake tables. What this does is creates one version of truth of the data at that point of time. You can do time traveling because of this, but keep in mind that for default models it frames pretty regularly. If you have a big important report to manage, you need to make sure to include reframing in the ETL process so it reloads the data properly. This can be triggered via notebook, service, api, etc. You can also do this in TMSL via SSMS which gives you more granular table control.

DAX to SQL Fallback. Each capacity has guard rails around it. if you have more than a certain number of rows in a table, then it will fall back to direct query automatically. Optimization of parquet files will be key to stop it from falling back.

Deep (Sky)diving into Data Factory in Microsoft Fabric by Jeroen Luitwieler

Data Pipeline Performance – How to copy scales.

Pipeline processing – less memory & less total duration. No need to load everything into memory then right. Read and Write are done in parallel.

Producer and consumer design. Data is partitioned for multiple concurrent connections (even for single large files). Full node utlization. Data can be partitioned differently between source and sink to avoid any starving/idle connections.

Partitions come from a variety of sources including physical partitions on the DB side, dynamic partitions from different queries, multiple files, and multiple parts from a single file.

Copy Parallelism concepts: ITO (Intelligent Throughput Optimization), Parallel copy, max concurrent connections. ITO – A measure that represents the power used which is a combo of CPU, memory, and network resource allocation used for a single copy activity. Parallel copy – max number of threads within the copy activity that read from source and write to sink in parallel. Max concurrent connections – the upper limit of concurrent connections established to the data store during the activity run (helps avoid throttling).

Metadata-driven copy activity. Build large-scale copy pipelines with metadata-driven approach in copy inside Fabric Data pipeline. You can’t parameterize pipelines yet, but it is coming. Use case is to copy data from multiple SQL tables and land as CSV files in Fabric. Control Table will have the name of the objects (database, schema, table name, destination file name, etc). Design has a lookup to get this control table and pass that through a foreach loop and iterate through these to copy the tables into Fabric.

Dataflows Gen2 – performance optimization and AI infused experiences. 4 performance principles – delegate to the most capable resource, sometimes you have to do the most expensive thing first, divide and conquer, be lazy do as little work as possible.

Query folding – known as query delegation, push down, remote/distributed query evaluation. Wherever possible, the script in Power Query Editor should be translated to a native query. The native query is then executed by the underlying data source.

Staging – load data into Fabric storage (staging lakehouse) as a first step. The staged data can be referenced by downstream queries that benefit from SQL compute over the staged data. The staging data is referenced using the SQL endpoint on the lakehouse. Staging tips – data sources without query folding like files are great candidates for staging. For faster authoring, have a different dataflow for staging and another for a transformations.

Partitioning – a way to break down a big query into smaller pieces and run in parallel. Fast copy does this in the background.

Lazy evaluation – power query only evaluates and executes the necessary steps to get the final output. It checks step dependencies then applies query optimization to make the query as efficient as possible.

AI infused experiences – column by example, table by example, fuzzy merge, data profiling, column pair suggestions. Table by example web is pretty awesome. You can extract any data from any HTML page and have it figure out how to turn that HTML into the table you want to see. Basically allows you to screenscrape as part of your ETL process and it will refresh if any changes are done on the website. Table by example for a text/csv is also amazing, it can try to figure out how to take a human readable file and split it into multiple columns. Game changers!

Website Analytics in my pocket using Microsoft Fabric by Catherine Wilhelmsen

Catherine built her own website and blog hosted on Cloudflare. Cloudflare has an API that pulls all the stats. To build this, she worked with Cloudflare GraphQL API. They have one single endpoint for all API calls and queries are passed using a JSON object.

Data is returned as a json object. She built the orchestration inside of Fabric Data Factory.

First portion is to set the date, she has this check if she wants to use today’s date or the custom date.

After that, she accesses the data using a copy data activity. Last step is to copy that data into a lakehouse. The file name is dynamic so she can capture the date that the data was pulled from.

The destination is a Lakehouse where she can build a report off of it. The best part of the copy data is the ability to map that nested JSON response into destination columns within the lakehouse. As a bonus for herself, she made a mobile layout so she can open it on her phone and keep track of it on the go.

Next steps for her project – see country statistics, rank content by popularity, compare statistics across time periods.

Calculation Groups & C# the perfect couple by Paulina Jedrzejewska

Love story between calculation groups and C#.

All slides are available on her github: https://github.com/pjedrzejewska

What makes it a perfect couple? We can add C# to our calc group dev to make it much more flexible.

Calc groups don’t allow you to create a table with customizable sorting of the measures and have measures not using the calc group within the same visual. For example, if I have previous year and current year in my calc group, it would display all of my current year measures together and all the prior year measures together, but I may want to see current year sales right next to prior year sales instead of having current year sales, revenue, cost, etc. then have the prior year much further away. To solve this, we will create separate measures for each time intelligence option. But why do that by hand when we can automate?

What can we automate? Creating & deleting measures, renaming objects, moving objects to folders, formatting objects.

Why automate? Works through a bunch of objects in seconds, reusable, homogeneous structure and naming, and no more tedious tasks for you!

For this demo, we will use Tabular Editor 2, PBI Desktop, Code Editor (visual studio code). She uses a .cs file to script out the code. All the columns in the fact table that will be turned into a measure have a prefix of “KPI”.

column.DaxObjectFullName will give you the full table, column reference which allows you to reference that dynamically within C# code.

Calculation groups are essentially a template for your measures. It’s a feature in data modeling that allows you to take a calculation item and use it across multiple measures. These are very useful for time intelligence, currency conversion, dynamic measure formatting, and dynamic aggregations.

When creating a calc group, always be sure to include one for the actual measure value by using SELECTEDMEASURE(). You can also refer to other calculation items within others as a filter.

Now that we have a calc group, the C# can loop through every base measure and through all the calculation group items and generate our new measures off of that. Remember all the code is in her github 🙂

Another great use case for C# is to add descriptions to all the measures.

This allows for self-documented code.

2023 in Review – Steve’s Perspective

2023 was another transition year as I continue to navigate my ALS with work. It has become harder to write or at least more work is required. I use Voice Access day to day which allows me to operate my system and dictate everything from messaging in Teams to documents and blog posts such as this. While the technologies allow me to continue to contribute in a variety of ways they are not without their flaws. I find myself doing a lot of correction or relying on friends and family to correct and modify documents and PowerPoints in particular. Despite these limitations, I continue to find ways to contribute and be productive at work and in the community when I am able.

Where I Am Today

At the beginning of the year, I was using a smaller wheelchair but was able to get around well and was still able to go to Usergroup meetings and travel the country. My wife and I made a trip to Disney in May of last year. This was a great opportunity to experience what Disney had to offer for those of us with the accessibility issues. I will tell you they do a great job! As a year progressed, I started to lose more functionality in my legs. It was particularly bad after surgery in June. At that point it became increasingly difficult for me to participate in events further away in the region.

Steve and Sheila at Disney World

I now have a great new power chair which allows me to get around easily and we can use a mobility access van as well. There are still limitations on my ability to travel but we were able to make it out to Kristyna’s wedding in October using these tools.

Kristyna And Steve at her wedding

The other big change for me has been the effect on my breathing. This has limited my ability to talk in long sentences or for long periods of times without requiring a breathing break. Because of this new limitation, I have begun limiting my external speaking and webinars because I would not be able to maintain a long session over 15 minutes or so. As you can imagine this is very disappointing for me as I spent much of my career in the public speaking arena.

I mention all this not so that you can feel sorry for me, but so that you can understand why you may see me in different circumstances or contributing in different ways I’ve never done before. I have always loved working in the community and wish to do so where I am able to. And more about that next.

Contributing to YouTube

This year we launched the new YouTube channel for Data on Wheels. We launched this channel to give me a voice to support some of the working with ALS content I have been creating. My first content was related to using Voice Access. This is a series of videos that introduced users to how to use Voice Access in practical ways. It has a particular focus on enabling users like me on the capabilities on Voice Access. I use Voice Access for most of my navigation and dictation needs in the day-to-day work. I do however lean on Microsoft 365 dictation for longer content creation such as this blog post. Someday I hope to do all of it in one tool.

Fabric 5 video series logo

One of the big contributions I wanted to make was a series of discussions on Microsoft Fabric architectural decision points. That was how the Fabric 5 was born! I was able to maintain about 5 minutes of good conversation regarding Fabric and the various architectural decisions customers should make. This is allowing me to continue to contribute to the community despite the disabilities in front of me. I look forward to contributing even more this upcoming year as Fabric continues to change the landscape of data analytics as we know it.

Spirit of 3Cloud

This year was capped off with my company awarding me the Spirit of 3Cloud award. This award reflects the contributions I have made to the company while battling my disability but at the same time providing support for other team members and growing our organization. I believe this award represents my ability to continue to encourage others to give their best in their lives and in work. Thank you to all of those who continue to support me in this journey, and I hope to continue to represent 3Cloud well throughout it all.

Steve with the Spirit of 3Cloud Award

What’s Next for Data on Wheels

This year I am handing the primary reins all the Data on Wheels blog and YouTube channel over to my daughter, Kristyna, as well as our Data on Rails program for new bloggers. I look forward to seeing great things from her as she continues to grow in her experience and community involvement. You should continue to look for great content from her as she takes the primary role and voice for Data on Wheels. I have truly enjoyed contributing to the blog for over 12 years and will continue to contribute as I’m able to both the blog and YouTube channel.

I am not totally getting out of all of this but want to express my sincere thanks for all the support you all have given us through the years. As I continue to work through my ever-changing disease, your support continues to be appreciated and I will keep you all up to date.

Thank you and Happy New Year!

About 5 Minutes with Microsoft Fabric

Introducing the Fabric 5 from Data on Wheels

Microsoft Fabric went to public preview this past summer. There has been content created around using the various cool features that the product brings together. As I watched and learned from so many great presenters from Microsoft and the community, I really wanted to take a few minutes to get people thinking about how they would implement on Microsoft Fabric. For example, what would be the best way to implement a medallion architecture in the Fabric ecosystem? What are the options we could look at? Why would we pick a lakehouse over a data warehouse? These and other questions like these I hope to give a few minutes of thought to and spur on a conversation.

This will be a weekly video series posted on our YouTube channel. There may be a couple spot additions here and there if it makes sense and is relevant. Overall, the goal is to strike up the conversation that will hopefully lead to all of us thinking through the best implementations of our data estates on Microsoft Fabric. We plan to drop a video on the fifth day of the week at 5:00 o’clock somewhere that is about 5 minutes long.

A Quick Word about Our YouTube Channel

The Data on Wheels YouTube channel has a blend of technical and personal content. The technical content will consist of things like the Fabric 5 as well as content related to Working with ALS. I love to talk about technology in the data and analytics space where I have been working in for years. I hope that you will enjoy this content as much as I had fun creating it.

As many of you know, I am battling ALS. I continue to work, and I share the technology hardware and software that I use to be able to keep working and the technology I love. While this may not be interesting to you, if you know someone who could benefit from learning about these tools, please share that content as well.

I have also chosen to share personal content here. This is primarily content related to my ALS journey that is not directly related to working with ALS. Frankly, this is just a great place to share this with family and friends and you can count yourselves among them if you choose to watch it.

Wrapping it up

I hope you enjoy this Fabric 5 video series. The plan is to provide 5 minutes of content on Fabric on the fifth day of the week around 5:00 PM somewhere. As always, I am happy to share and hope that you find some value in the content we created.