Working with ALS – Insights from the Ability Summit

The 14th annual Ability Summit is a global event that I attended a few weeks ago. It is hosted by Microsoft, and it presents the latest technology innovations and best practices for accessibility and inclusion. The event has three main session tracks: Imagine, Build, and Include. Each track examines different aspects of how technology can enable people with disabilities and make the world more inclusive. The event is free, and anyone can register online to attend. All sessions are recorded and can be watched at any time on demand.

Ability Summit 2024 Highlights

As we think about our enduring commitment and goal at Microsoft, which is to build that culture of accessibility and embed it into everything we do, grounded always by the insights of people with disabilities. – Jenny Lay-Flurrie

In the first keynote, Microsoft CEO Satya Nadella and Chief Accessibility Officer Jenny Lay-Flurrie talked about how AI can remove obstacles and create more accessible experiences, while also addressing the challenges and concerns of responsible AI. The keynote showed several examples of how AI can help people with disabilities, such as voice banking for people with ALS, descriptive audio for people with low vision, and Copilot for people with diverse speech patterns. It was very impressive to see Team Gleason featured as a partner with Microsoft to work on AI to help the ALS community preserve their voice.

Team Gleason and Microsoft Team Up to Give an ALS Person His Voice Back

As a platform company, we have to absolutely lean into that and make sure that everything we’re doing, whether it’s Copilot and Copilot Extensibility or the Copilot stack in Azure is all ultimately helping our ISVs, our customers, our partners, all achieve their own goals around innovation, around accessibility. – Satya Nadella

Build Session: Bridging the Disability Divide with AI

The conference had many sessions and keynotes, but this one about the disability divide and AI was very interesting to me. These are three main points I learned from this session: 1) how people with disabilities are benefiting from AI in their personal and professional lives; 2) advice on how to begin and advance the AI journey with accessibility as a priority; 3) the significance of accessibility as a basic value for developing technologies that enable everyone.

This session also provided some resources and opportunities for us to learn more about AI and accessibility, such as the Accessibility Bot, which is a chatbot that can answer questions about Microsoft’s products and services regarding accessibility topics; the AI Studio, which is a platform that allows users to explore and build AI applications using various cognitive services and SDKs; and the AI Platform Team, which is a group of developers and researchers who work on making AI more accessible and inclusive.

In Real Life

I belong to the ALS community (I have ALS), and I rely on a lot of accessible technology both hardware and software to accomplish work. I used a combination of Voice Access in Windows 11, a Stream Deck foot pedal, a foot pedal joystick on my wheelchair and Microsoft 365 Copilot to write this blog post. Voice Access helps me with dictation and specific commands like selecting paragraphs or capitalization. A Stream Deck allows me to do backspace and deletes. A foot pedal joystick acts as a mouse. Copilot assists me with summarizing and rewriting content. As you can tell, we need a whole set of tools to suit our needs, and there is no single tool or method that works for us. I’m excited to see how AI will enhance accessibility for all of us. My goal is to keep sharing the tools and techniques I use to live and work with ALS through my blog and YouTube channel.

SQLBits 2024 Recap

Had an absolutely amazing time at SQLBits this year! It was lovely to see all my data friends again and had the opportunity to introduce my husband to everyone as well! In case you missed a session, or are curious about what I learn at these conferences, below are my notes from the various sessions I attended.

Here is a link to my GitHub that contains the slides and code from my presentation on creating a M-agical Date Table. Thank you to everyone who attended! https://github.com/Anytsirk12/DataOnWheels/tree/main/SQLBits%202024

What’s new in Power Query in Power BI, Fabric and Excel? by Chris Webb

Power query online is coming to PBI desktop and will include diagram & schema views, the query plan, and step folding indicators! Currently in private preview, coming to public preview later this year.

Power query in Excel. Nested data types now work (very cool, check this out if you don’t know what data types are in Excel). Python in excel can reference power query queries now which means we don’t have to load data into Excel! That will allow analysis of even more data because you can transform it upstream of the worksheet itself. Very cool.

Example: import seaborn as sns

mysales = xl(“Sales”)

sns.barplot(mysales, x=”Product”, y = “sales”, ci=None)

Limitations – can’t use data in the current worksheet.

Power query is in Excel online! Later this year, you will be able to connect to data sources that require authentication (like SQL Server!).

Power query can be exported as a template! This will create a .pqt (power query template). You can’t import a template file in Excel, but you can in Dataflows Gen2 in Fabric.

Fabric Dataflow Gen2 – data is not stored in the dataflow itself. We now have Power Query copilot! It’s in preview, but it’s coming. You will need a F64 to use it.

Paginated reports are getting power query!! It will be released to public preview very soon. All data sources will be available that are in power bi power query. This will allow snowflake and big query connections annnddd excel! You can even pass parameters to the power query get data by creating parameters within the power query editor. Then you go to the paginated report parameters and create the parameter there. Finally go to dataset properties and add a parameter statement that sets those two equal to each other.

Rapid PBI Dev with ChatGPT, Tips & Tricks by Pedro Reis

Scenario 1: DAX with SVG & CSS.
SVG – Problem is to create a SVG that can have conditional formatting that doesn’t fill the entire cell. ChatGPT does best with some working code, so give it a sample of the SVG you want and as ChatGPT to make it dynamic. Demo uses GPT4 so we can include images if needed. He uses Kerry’s repo to get the code sample that we put into GPT (KerryKolosko.com/portfolio/progress-callout/).
CSS – uses HTML Content (lite) visual (it’s a certified visual). Prompt asks gpt to update the measure to increase and decrease the size of the visual every 2 seconds. Pretty awesome looking. He’s also asking it to move certain directions for certain soccer players (Ronaldo right and Neymar zigzag).

Scenario 2: Data Modeling and analysis. He uses Vertabelo to visualize a datamodel and it creates a SQL script to generate the model. He passes the screenshot of the datamodel schema to ChatGPT and it was able to identify that a PK was missing and a data type was incorrect (int for company name instead of varchar). Super neat. The image does need to be a good resolution. He also put data from excel into GPT to have it identify issues like blanks and data quality issues (mismatched data types in the same column). It could grab invalid entries and some outliers but it’s not very reliable now.

Scenario 3: Report Design Assessment. He grabbed four screenshots from a Power BI challenge and asked GPT to judge the submissions with a 1-10 rating based on the screenshots and create the criterion to evaluate the reports. Very neat! Then he asked it to suggest improvements. It wasn’t very specific at first so he asked for specfic items like alternatives to donut charts and other issues that lowered the score.

Scenario 4: troubleshooting. He asks it to act as a power bi expert. It was able to take questions he found on the forum, but sometimes it hallucinate some answers and it’s not trained on visual calculations yet.

Scenario 5: Knowledge improvement. He asked GPT to create questions to test his knowledge of Power BI models. It even confirmed what the correct answers. It even created a word doc for a shareable document that can be used to test others in the future.

Streamlining Power BI with Powershell by Sander Stad and Linda Torrang

Data-Masterminds/PSPowerBITools is an open source code that is in the powershell gallery as well as within GitHub. Github link: https://github.com/Data-Masterminds/PSPowerBITools.

There’s a module called MIcrosoftPowerBIMgmt that is used with a service account to wrap around the rest api and allow easier calls of those APIs through powershell. When running PowerShell, to see the things in a grid use the code ” | Out-GridView” at the end of your command and it creates a very nice view of the data. All demos are available at github. The first demo is to grab all the orphaned workspaces then assign an admin. Their module includes more information like the number of users, reports, datasets, etc. They also have a much easier module that will grab all the activity events from a given number of days (it will handle the loop for you).

Another great use case will be moving workspaces from a P SKU capacity to a F SKU capacity automatically. The admin management module doesn’t work for this, but the PSPowerBITools module can! To see help – use “get-help” in front of the command to get details and examples. For more details, add “-Detailed” at the end to see that. Adding “-ShowWindow” at the end of a script will pop open a separate window to see the response (note- doesn’t work with -Detailed but does contain the details). To export use this script: $workspaces | Export-Csv -NoVolbber -NoTypeInformation -Path C:\Temp\export_workspaces.csv.

The goal is to use the module for more ad hoc admin tasks like remove unused reports, personal workspaces, data sources, and licenses.

New to Powershell like me? Check out this article on how to automate Powershell scripts to run on your laptop: https://www.technewstoday.com/how-to-automate-powershell-scripts/.

Using ALLEXCEPT or ALL-VALUES by Marco Russo

ALL: removes all the filters from expanded table/columns specified.
REMOVEFILTERS: like ALL (it’s the same function). We use this for code readability.
ALLEXCEPT: removes filters from expanded table in the first argument, keeping the filter in the following table/columns arguments.

The following produce the same result:
ALLEXCEPT(customer, customer[state], customer[country])
REMOVEFILTERS(customer[customerkey], customer[name], customer[city])
The nice thing about ALLEXCEPT is that it’s future proof against additional fields being added to the table.

But is it the correct tool?

Demo – goal is to get % of continent sales by getting country or state or city sales / total continent sales

To get that total continent sales, we need sales amount in a different filter context. The simplest way to do this is using ALLEXCEPT since we always only want the continent field from the customer table.
CALCULATE([Sales Amount], ALLEXCEPT( customer, customer[continent]))

This works if we include fields from the same table – customer. If we pull in fields from another table, it won’t work as we want because it will be filtered by a different table’s filter context. To get rid of that, you can add REMOVEFILTERS(‘table’) for each OR we can change the ALLEXPECT to use the fact table.
CALCULATE([Sales Amount], ALLEXCEPT( Sales, customer[continent]))

One problem is if a user drops the continent from the visual, the filter context no longer includes continent so the ALLEXCEPT will no longer work properly as it will function as an ALL.

To fix this, use remove filters and values. Using values adds in the continent value in the current filter context, then removes the filter context on customer. Lastly, it applies the value back into the filter context so we end up with what we want which is just continent is filtered. This works quickly because we know each country will only have one continent.

CALCULATE( [sales amount], REMOVEFILTERS( Customer, VALUES (customer[continent]))

Performance difference? Yes – ALLEXCEPT will be faster, but the REMOVEFILTERS and VALUES should not be too much slower.

Analytics at the speed of direct lake by Phillip Seamark and Patrick LeBlanc

Phil did a 100 minute version of this session on Wednesday.

What is direct lake? We will evaluate over three categories – time to import data, model size, and query speed. Direct lake mode takes care of the bad things from both import and direct query modes. Direct lake only works with one data source. Data is paged into the semantic model on demand triggered by query. Tables can have resident and non-resident columns. Column data can get evicted for many reasons. Direct Lake fallback to SQL server for suitable sub-queries. “Framing” for data consistency in Power BI reports. Direct lake model starts life with no data in memory. Data is only pulled in as triggered by the DAX query. Data stays in that model once pulled for hours. We only grab the columns we need which saves a ton of memory.

Limitations – no calculated columns or tables. No composite models. Calc groups & field parameters are allowed. Can’t be used with views, can only be used with security defined in the semantic model (you can have RLS on the semantic model itself). Not all data types are supported. If any of these are true, Power BI will fall back to Direct Query mode.

SKU requirements: only PBI premium P and F skus. No PPU nor pro license.

Why parquet? column-oriented format is optimised for data storage and retrieval. Efficient data compression and encoding for data in bulk. Parquet is also available in languages including rust, java, c++, python, etc. Is lingua franca for data storage format. Enhanced with v-order inside of Fabric which gives you extra compression. Analysis services can read and even prefers that form of compression so we don’t need to burn compute on the compression. V-Order is still within open source standards so this still doesn’t lock you into Fabric only. Microsoft has been working on V-Order since 2009 as part of Analysis services. Row group within parquet files corresponds directly to the partitions/segments in a semantic model. Power BI only keeps one dictionary, but parquet has a different dictionary id for each row group/partition. When calling direct query, low cardinality is key because most of the compute will go to the dictionary values that need to be called across the various partitions. This is why it’s VITAL to only keep data you need for example, drop or separate times from date times and limit decimals to 2 places whenever possible.

You can only use a warehouse OR a data lake. So if you need tables from both, make a shortcut from the data lake to the warehouse and just use the warehouse. You can only create a direct lake model in the online service model viewer or in Tabular editor. You cannot create it in PBI desktop today.

Microsoft data demo. One file was 880 GB in CSV > 268 GB in Parquet > 84 GB with Parquet with V-ORDER.

Sempty is a python library that allows you to interact with semantic models in python.

In the demo, Phil is using sempty and semantic-link libraries. Lots of advantages to using the online service for building the semantic model – no loading of data onto our machines which makes authoring much faster. After building this, you can run a DMV in DAX studio. Next, create a measure/visual that has a simple max of a column. That will create a query plane to reach out to parquet and get the entire column of data needed for that action then filter as needed. After that, the column will now be in memory. Unlike direct query, direct lake will load data into model and keep it for a bit until eviction. The direct lake will also evict pages of columns as needed when running other fields. You can run DAX queries against the model using python notebook and you can see via DMV what happened.

Framing. Framing is a point in time way of tracking what data can be queried by direct lake. Why is this important? data consistency for PBI reports, delta-lake data is transient for many reasons. ETL Process – ingest data to delta lake tables. What this does is creates one version of truth of the data at that point of time. You can do time traveling because of this, but keep in mind that for default models it frames pretty regularly. If you have a big important report to manage, you need to make sure to include reframing in the ETL process so it reloads the data properly. This can be triggered via notebook, service, api, etc. You can also do this in TMSL via SSMS which gives you more granular table control.

DAX to SQL Fallback. Each capacity has guard rails around it. if you have more than a certain number of rows in a table, then it will fall back to direct query automatically. Optimization of parquet files will be key to stop it from falling back.

Deep (Sky)diving into Data Factory in Microsoft Fabric by Jeroen Luitwieler

Data Pipeline Performance – How to copy scales.

Pipeline processing – less memory & less total duration. No need to load everything into memory then right. Read and Write are done in parallel.

Producer and consumer design. Data is partitioned for multiple concurrent connections (even for single large files). Full node utlization. Data can be partitioned differently between source and sink to avoid any starving/idle connections.

Partitions come from a variety of sources including physical partitions on the DB side, dynamic partitions from different queries, multiple files, and multiple parts from a single file.

Copy Parallelism concepts: ITO (Intelligent Throughput Optimization), Parallel copy, max concurrent connections. ITO – A measure that represents the power used which is a combo of CPU, memory, and network resource allocation used for a single copy activity. Parallel copy – max number of threads within the copy activity that read from source and write to sink in parallel. Max concurrent connections – the upper limit of concurrent connections established to the data store during the activity run (helps avoid throttling).

Metadata-driven copy activity. Build large-scale copy pipelines with metadata-driven approach in copy inside Fabric Data pipeline. You can’t parameterize pipelines yet, but it is coming. Use case is to copy data from multiple SQL tables and land as CSV files in Fabric. Control Table will have the name of the objects (database, schema, table name, destination file name, etc). Design has a lookup to get this control table and pass that through a foreach loop and iterate through these to copy the tables into Fabric.

Dataflows Gen2 – performance optimization and AI infused experiences. 4 performance principles – delegate to the most capable resource, sometimes you have to do the most expensive thing first, divide and conquer, be lazy do as little work as possible.

Query folding – known as query delegation, push down, remote/distributed query evaluation. Wherever possible, the script in Power Query Editor should be translated to a native query. The native query is then executed by the underlying data source.

Staging – load data into Fabric storage (staging lakehouse) as a first step. The staged data can be referenced by downstream queries that benefit from SQL compute over the staged data. The staging data is referenced using the SQL endpoint on the lakehouse. Staging tips – data sources without query folding like files are great candidates for staging. For faster authoring, have a different dataflow for staging and another for a transformations.

Partitioning – a way to break down a big query into smaller pieces and run in parallel. Fast copy does this in the background.

Lazy evaluation – power query only evaluates and executes the necessary steps to get the final output. It checks step dependencies then applies query optimization to make the query as efficient as possible.

AI infused experiences – column by example, table by example, fuzzy merge, data profiling, column pair suggestions. Table by example web is pretty awesome. You can extract any data from any HTML page and have it figure out how to turn that HTML into the table you want to see. Basically allows you to screenscrape as part of your ETL process and it will refresh if any changes are done on the website. Table by example for a text/csv is also amazing, it can try to figure out how to take a human readable file and split it into multiple columns. Game changers!

Website Analytics in my pocket using Microsoft Fabric by Catherine Wilhelmsen

Catherine built her own website and blog hosted on Cloudflare. Cloudflare has an API that pulls all the stats. To build this, she worked with Cloudflare GraphQL API. They have one single endpoint for all API calls and queries are passed using a JSON object.

Data is returned as a json object. She built the orchestration inside of Fabric Data Factory.

First portion is to set the date, she has this check if she wants to use today’s date or the custom date.

After that, she accesses the data using a copy data activity. Last step is to copy that data into a lakehouse. The file name is dynamic so she can capture the date that the data was pulled from.

The destination is a Lakehouse where she can build a report off of it. The best part of the copy data is the ability to map that nested JSON response into destination columns within the lakehouse. As a bonus for herself, she made a mobile layout so she can open it on her phone and keep track of it on the go.

Next steps for her project – see country statistics, rank content by popularity, compare statistics across time periods.

Calculation Groups & C# the perfect couple by Paulina Jedrzejewska

Love story between calculation groups and C#.

All slides are available on her github: https://github.com/pjedrzejewska

What makes it a perfect couple? We can add C# to our calc group dev to make it much more flexible.

Calc groups don’t allow you to create a table with customizable sorting of the measures and have measures not using the calc group within the same visual. For example, if I have previous year and current year in my calc group, it would display all of my current year measures together and all the prior year measures together, but I may want to see current year sales right next to prior year sales instead of having current year sales, revenue, cost, etc. then have the prior year much further away. To solve this, we will create separate measures for each time intelligence option. But why do that by hand when we can automate?

What can we automate? Creating & deleting measures, renaming objects, moving objects to folders, formatting objects.

Why automate? Works through a bunch of objects in seconds, reusable, homogeneous structure and naming, and no more tedious tasks for you!

For this demo, we will use Tabular Editor 2, PBI Desktop, Code Editor (visual studio code). She uses a .cs file to script out the code. All the columns in the fact table that will be turned into a measure have a prefix of “KPI”.

column.DaxObjectFullName will give you the full table, column reference which allows you to reference that dynamically within C# code.

Calculation groups are essentially a template for your measures. It’s a feature in data modeling that allows you to take a calculation item and use it across multiple measures. These are very useful for time intelligence, currency conversion, dynamic measure formatting, and dynamic aggregations.

When creating a calc group, always be sure to include one for the actual measure value by using SELECTEDMEASURE(). You can also refer to other calculation items within others as a filter.

Now that we have a calc group, the C# can loop through every base measure and through all the calculation group items and generate our new measures off of that. Remember all the code is in her github 🙂

Another great use case for C# is to add descriptions to all the measures.

This allows for self-documented code.

DataTune Nashville 2024 Recap

DataTune 2024 was a huge success! Thank you so much to the organizers who spent countless months planning and making this an incredible conference! It was so great to meet so many people and hang out with so many fellow data nerds!

GitHub link for my session: https://github.com/Anytsirk12/DataOnWheels/tree/main/DataTune%202024
Conference link: https://www.datatuneconf.com/sessions.html

Microsoft Fabric for Power BI Users by Alex Powers

Put your data in OneLake and do everything you want. The “OneDrive for data”. Every org has ONE OneLake. You can share across workspaces via datalakes.

There’s a tons of new Fabric items, what is important for PBI users?

Goal: Get data (OneDrive) > ingest (Dataflow Gen 2) > store (Lakehouse/semantic model) > deliver insights (PBI) with data pipelines to orchestrate.

Start by creating a lakehouse. Tables are optimized delta/parquet format. Files is unstructured. You can create sub folders within those as well. There’s a OneLake file explorer that lets you drag and drop files into the OneLake.

Next is a dataflow. If you want to copy and paste M queries, you can hold down control on the queries and control C then control V directly into a cloud dataflow! When merging tables, there’s a light bulb in top right that will suggest the join option for you so you don’t have to scroll anymore. The visual dataflow allows you how to see the lineage of various tables. Be sure to stage the changes then publish.

Next is to publish to the lakehouse we made before. You get a chance to double check the sources and make sure all the data types work with parquet format. You can also choose to append or replace on refresh. You can also allow for dynamic schema so it’ll pick up any name changes to columns! Lightning bolt on table means that it has been optimized. Right now this is all low code no code, but there may be some announcements at the Vegas conference at the end of the month to allow for writing code instead of using the point and clicks.

Next build a data pipeline. 95% parody to ADF pipelines. Create a dataflow step then configure settings to grab the new dataflow. Add a teams activity so when it fails it will send a teams notification. Just need to sign into teams and allow access. You can send to a channel or group chat. The UI is a lot like Power Automate.

To set up refresh, go to pipeline and schedule there! It’s way way easier on the schedule options. Soon there will be an option to refresh a semantic model.

To create shortcuts, go to the lakehouse and create a new shortcut. You can even reference items from other workspaces/lakehouses. Right now, you can also do a shortcut to AWS and ADLS gen 2.

In lakehouse, you can create a new semantic model. XMLA editing is supported! You can create your DAX measures there as well. You can even then create a new report from this semantic model! No need to import any data. Now you can do this all on Mac, Raspberry Pi, Linex, any OS with a browser. You can also use the SQL analytics endpoint for a SQL experience for navigating the lakehouse in a SQL manner that lets you run and save as views or create sps (upper right corner in lakehouse view). It is case sensitive.

Microsoft Fabric + Power Platform + Azure Open AI = Magic Mix by Alex Rostan & Gaston Cruz

Goal how to combine technologies together. Focus will be starting with a Power App and end with data in Fabric.

Check out their YouTube channel! https://www.youtube.com/@PowerMatesMS

Start off by making a Power App using copilot. Prompt = let’s create a new application to audit power bi licenses. Power bi pro and premium per user licenses tied to our organization. It creates a dataverse table within the environment. You can edit this by using another prompt = add a new column to check the latest access to each user for our Power BI Tenant. It created a new column for us! Then just say create app. And boom we have a functional application!

In the tables tab, you can click Analyze and there’s an option to Link to Microsoft Fabric. Dataverse is a service, so depending on the use case it’ll make the correct configuration to optimally grab your data. That would allow you to create Power Apps that people input data into then instantly see it in Fabric in real-time.

Now we go to Fabric. Data verse is a native connection in data pipeline. Use dataflow for ETL and pipeline for orchestration. Then determine where to drop data. You can choose Lakehouse (structured & unstructured data), KQL database (near real-time), or Warehouse (structured data).

Data activator is the way to go for alerting.

Once you create semantic model, you can create a report on top of it. This will use Direct Lake mode (perfect combo of import and direct query). No refresh needed. Lakehouse holds the data for the model to leverage. Report will automatically grab data from there.

The Problem with Pipelines: Building Extensible Data Code by Matthew Kosovec

Goal is to take software engineering principals

Pipelines can turn ugly fast when the requirements change or when introducing new source systems. You can easily end up with code that isn’t used.

Software Engineering concepts:

  1. DRY (don’t repeat yourself)
  2. SRP (single responsibility principle)
  3. Pure Functions Only

DRY – Every piece of knowledge must have a single, unambiguous, authoritative representation within a system. Ways to dry off: CTEs, views/materialized tables, macros/UDFs (user definied functions), ETL packages. Challenges: Encapsulation and composability (how generic is your code snippet that can be reused), dependency management, refactoring and regression testing. Smallest unit of work is a single SELECT statement (table, view, etc.) which can be the challenge of encapsulation.

DRY example – you have a feature engineering table need that is the summation of a BI table. To keep this DRY, you can build the FE table on top of the BI table to avoid the code duplication. The con here is that FE is now dependent on BI code. The additional layer can also create slower refresh cycles. This may also not be possible always because the relational logic may be different enough.

SRP – separation of concerns. Gather together the things that change for the same reasons. Separate those things that change for different reasons. Medallion architecture is a common approach. Pros is helps separate concerns between layers and provides a baseline for architectural gov and reviews for pipeline design. Cons are lacks structures for the most important part of pipelines (transformational code!) and often devolves into multiple sub-layers of storage to help with DRY or performance.

Pure Functions Only – data pipelines are imperative. Focuses on step by step program. Uses statements that change a program’s state. Goal is to have zero side effects, ultimate composability, DRY by default. Because you can reuse everything in small bits, you don’t need to rewrite your code over and over.

Example – Pure functions: modifies a single row in a single column with unlimited input columns/rows and can create a new column. Example of that is Scalar calcs, UDFs, and rollups to current grain. Semi pure functions: modifies multiple rows in a single column, unlimited input columns/rows, can create a new column. Example is window functions like ranking. Impure/non-functional: modifies multiple rows in multiple columns, creates multiple columns. Examples are CTEs/procedures, unpivot, CDC/refresh processes.

Making code DRY and Functional:

  1. Use SRP to break down SQL into small, functional, composable parts (SELECT, FROM, etc.)
  2. Build a system to check for DRY-ness and ensure referential integrity between objects
  3. Develop code generator to create runnable SQL from objects
  4. Use SRP to define processing steps to separate functional from non-functional (bonus step but has some serious benefits by repeating the process)

DataForge example:

Step 1 – identify the composable parts. Examples: aliasing (columnA as A), scalar calculation (ie CAST (A as int)), aggregation (existing grain aka rollup), aggregation (new grain), Filter (WHERE), all joins (including cardinality), etc. Sorting those into object groups. Rules = aliasing, scalar calcs, aggregation rollup. Channel mapping = filter, aggregations (new grain). Relation = joins. Not comprehensive, there’s tons.

Step 2 – design and build a meta-repo database to store and relate objects. Build monitoring services to check for duplicates on save and compilation. This gets stored.

Step 3 – Take the broken down model and use a SQL Code Generator which can write DDL (change tables) then Pure SQL (do all the SQL at once, no need for intermediate storage) then Impure SQL (everything else). The development is no longer the entire table, you write a function that can create the pieces you need. The system can do a bunch of checks to see if there’s any downstream effects or if it already exists. Then it gets orchestrated.

There are vendors who build data mappings from source data to target system and build adaptors that can work with DataForge. Matthew says this is fairly easy to do on your own as well. The goal of this is that when you make changes, you aren’t too worried since there are checks for duplication and reuse.

Step 4 – blow up the medallion architecture. Non-functional/imperative (ingest – parse – capture data changes) > pure (enrich) > semi-pure (refresh recalc – output) > imperative (custom post output). Everything goes through every step. You end up not needing orchestration. Grouping pure functions together removes that need and helps prevent the eventual chaos/spaghetti code.

This isn’t about building the first thing. For that, pipelines are pretty straightforward. This is more about adding to it. You end up building your lineage through the metadata which allows for increased observability.

DataForge has an out-of-the-box solution to this – www.dataforgelabs.com. This solution works with existing pipelines. Under the hood, this is databricks so it can integrate with a lot of different code bases. They have two ways to do this – an IDE/UI (tons of checks with real-time feedback) and a YAML config language which then is compiled in their system through an import process where the checks are then performed. On the roadmap is a CLI to compile locally. There’s a private cloud version and a SAAS version.

SQL Saturday ATL – BI Recap

First of all, wow! What an incredible event! Loved seeing everyone and meeting so many wonderful new people. Thank you to everyone who organized, spoke, and attended this event! Below are my notes from the sessions I attended as well as a link to my github with all the resources from my talk. Thank you again to all those who came to my session, you were all an amazing crowd and I really enjoyed learning with you all!

Link to the event and speakers: https://sqlsaturday.com/2024-02-10-sqlsaturday1071/#schedule

Introduction to Fabric Lakehouse by Shabnam Watson

For SQL Server users, a lakehouse will be used for landing zone and the data warehouse will be used for silver and gold layers. Currently, no demonstrated difference in price/performance between lakehouse and warehouse. Raw/Bronze = likely a lot of small files. Use with a lakehouse. Lots of ways to do (or not do) medallion architecture.

Cost is now capacity based. If you go over your capacity, there’s some smoothing but if you’re still over then they will throttle performance until smoothing catches up with your existing capacity. There is no per user license for Fabric items. F SKUs can be purchased on Azure portal, but P SKUs are through M365 licenses. F2 jobs will not be killed in the short term and the experience will match a F64. Allows you to play around for a cheaper price.

Parquet file formats are not deletable, so Delta tables are key. They build on top of parquet to soft delete (it toggles that parquet file as inactive). You can time travel because of that, but there is a cost to that since it requires a lot of storage. Fabric engines write Delta with V-order (Vertipaq). Data is automatically optimized for PBI VertiPaq engine. For example, Databricks writing to a datalake will run slower in PBI than a Fabric notebook writing to that datalake since it’ll be optimized.

Lakehouse works directly with files in the lake (workspace). Data ingestion options: Low Code/No Code (pipelines, dataflows gen2, shortcuts) and code first (notebooks, pyspark, sparkSQL). Data transformation uses the same tools. Data orchestration is through pipelines or notebooks.

Data load considerations
Use Case (Recommendation)
Small file upload from local machine (use local file upload)
Small data (dataflows)
Large data (Copy tool in pipelines)
Complex data transformations (Notebook code)

Use pipelines to orchestrate, not to transform. Use dataflows or notebooks to transform data. Notebooks can optimize your data files into larger files for faster runs.

Power BI import mode will still be fastest and direct query will still be more real time, but direct lake will allow for near real time and nearly as fast as import since there are no more scan operations. Limitations with direct lake – web modeling only (desktop is coming), it will fall back to direct query when capacity runs low which will slow your reports.

You can tie down security for the sql objects that are query-able via the SQL endpoint. You can query cross-workspace, so you can lock down the editing of data sources in a different workspace from where people consume it. Spark runs much faster in Fabric than in Synapse because the clusters are always up in Fabric so it doesn’t need to spin up a cluster to run your workload. Lakehouse land requires notebooks to interact with it. You can toggle to the SQL analytics endpoint within the service to build relationships, create measures, stored procedures, views, etc. Visual query allows you to drag and drop and right tables to generate sql behind the scenes and allows you to save that sql as a view. You can also manipulate the data using M queries. You can use CICD currently with semantic models, report, and notebooks.

How to Weave DataOps into Microsoft Fabric by John Kerski

Goal is to make production easier by combining principals from DevOps, Agile, and Lean manufacturing (every transformation steps needs to be tested for quality). If you check early, you’ll beat the customer. John has a video on this on YouTube as well.

GitHub: https://github.com/kerski
GitHub for this subject: https://github.com/kerski/fabric-dataops-patterns

Principals:

  1. Make it reproducible using version control
  2. Quality is paramount (TEST

Pattern 1 – DAX Query View Testing Pattern

Prerequisites: dax query view, PBIP format, Azure DevOps/Git integration (Visual Studio Code)

Use a naming convention for the tabs because we can get this out from the pbip file later. Recommended to use Measure.Tests and Column.Tests. You’ll also want to apply a schema like – TestName, ExpectedValue, ActualValue, Passed. John has a template for these tests, it’s essentially a union of rows. When you build the test cases, use the performance analyzer to grab the dax from a KPI card visual that has your testing values.

Good examples of this are making sure your date table has the correct number of days or your yearly sales are within a threshold. This only really works if you have expected values. This can be achieved by having a static dataset that eliminates variables of error to ensure your DAX is or is not the issue.

You can also test if your table still has unique values, or the % of null values in a table. DAX query view has a quick query feature to get the column features. AMAZING. You can now use the INFO.TABLES() DAX function that came out in December to grab the schema and see if it’s changing (which can break visuals). This is also vital if a data type changes and it breaks the relationship. You do need to hard code and define the schema of the table (easily create this using another script).

If you save this in the PBIP file, within the .dataset folder all your DAX queries are stored in a folder called DAX Queries! That means you can easily see if a test has changed during the PR and also easily add these tests to other models. May be wise to have sample set of tests with instructions for set up that you can easily copy/paste into other existing dataset folders.

You’ll need your workplace governance set up (dev, test, prod). Standarize your schema and naming conventions for the tests (name.environment.test). Build tests – test calcs, content, and schema.

Pattern 2 – Gen2, Notebook Automated Testing Pattern

Devops stores DAX queries > Dataflow Gen 2 can call those queries > store results in Lakehouse. Leverage notebook to orchestrate and output back into Lakehouse for PBI report to consume.

Push the pbip into DevOps git repo so DevOps has the DAX queries. You do need to have environment variables (workspace guid, branch) called out in a config file (John has instructions in his github). You can use Azure DevOps Services REST APIs to grab the items in a project/repo from a dataflow. Dataflow is called API – Azure DevOps DAX Queries – Gen 2. Lakehouse consumes the data from dataflow then we can have the notebook use Semantic Link to query a dataset using the DAX queries. You do need to have the public library for semantic link preloaded instead of using pip install. John’s sample contains easy variables to set for the branches.

Not using data factory in Fabric to call the APIs since there are a number of APIs missing for that product at the moment, but you can use a pipeline to run the dataflow and notebook in a scheduled fashion. You can work with DevOps pipeline to call this as part of a PR requirement.

Data activator could be used to trigger alerts. Be sure someone is actually monitoring the report tool since email is not super reliable.

Modernizing ETL: Leveraging Azure Databricks for Enhanced Data Integration by Joshua Higginbotham

GitHub: https://github.com/Jhiggin/Common_ETL_Patterns_Using_DataBricks/tree/main

Not everything here is transferable to Fabric, but a lot of it is.

What is Azure Databricks? Unified data analytics platform. Contains data engineering, analytics, ML, AI.

Common Architecture: ingest using ADF > store in ADLS > prep and train in Databricks > Azure Synapse and AAS to model and serve.
Microsoft recommends ADF does extraction (E) and databricks for transformation (T), but there are reasons to do more of the E within databricks.

Delta Lake Archicture: bronze/raw > silver/filtered > gold/business-level aggregates. Delta lake allows you to incrementally improve quality of data until it’s ready for consumption. Silver could contain a datalake, lots of ways to do this medallion process.

New to Databricks? Check out DataCamp and MS learning pathways. Josh can provide some resources for his favorite people to follow.

You can deploy databricks as a set environment which can make it easy to avoid egress charges or latency that would be caused by having the tenant in a region different from the data.

You have to set up access to the workspace and the items within it. Recommended using security groups for this. You can link this to source control (git hub and Azure DevOps) and do everything including PRs within the databricks UI.

Compute has lots of options. All-purpose can do everything, but may not be fastest nor cheapest option. Summary tells you how many workers, run time, memory, cores, and features that are enabled. Photon enabled is much faster. Tags are super useful for assigning accounting details so you can assign spend to teams. Don’t mess with your configuration unless you know what you’re doing. One useful config is azure data lake storage credential passthrough to pass creds through, but now it’s recommended to use unity catalog. SQL warehouses now have a serverless option that can allow you to have quick spin ups and auto shut downs when not in use to save money. Policies set up limits for scale or who can use what compute.

A mount point can tell databricks where the data is stored within ADLS. You can mount it once and can make your code much more readable and secure without the full ADLS path. dbutils.secrets.get allows you to grab information from keyvault without any access to the actual values.

Recommendation – explicitly set your schema, don’t allow databricks to guess.

Recommendation – don’t worry about partitioning delta lake until you reach 7 million records since the gains are minimal.

When you use delta tables, you can look at the delta log to see changes and time travel.
You can use %sql DESCRIBE DETAIL table to see meta data for a table.

You can use Partner Connet to open data in Power BI desktop. Keep in mind that if it’s too large it will connect with direct query which means your cluster will always be up and costing you a lot of money.

%sql DESCRIBE HISTORY will let you see what the data looked like as of a certain date. Very dependant on how much data you’re storing/pruning.

You can create CLONE s off of a delta table, so you can clone the table as of a certain time. This can also create snapshots.

Does Clone persist with vacuum? Shallow clone is a metadata copy, deep is default which also copies the full data. Unsure if either are impacted by vacuum, but the idea with clone is that it’s not persisted typically.

Configuration_Driven_Development_Demo is another notebook he walked us through (see github). He has a config file in yaml that can config for a certain client (file path and schema for the table and transformations). It does a lot of things including mapping SQL data types to spark data types. This pipeline is for more repeatable patterns for ETL.

Power BI and External Tools: This is the way! by Jason Romans

Goal is to make you a jedi of Power BI.

History of tools – back in the day you would start with a project in visual studio. Behind the scenes it was mostly editing the JSON file. Tabular Editor came out as a glorified json editor.

SSMS (SQL Server Management System): dependable and very useful for SQL. No DAX formatting, no column names displayed. It can process (full refresh, recalculate, etc) premium datasets and you can schedule that using sql jobs. Can’t connect to PBI desktop. Great for a dba.

Azure Data Studio (ADS): shiny but not the tool you’re looking for. Cannot connect to analysis services nor pbi desktop.

DAX studio: an analyst. Has intellisense, links to dax guide, dax formatter, performance tuning. You can see hidden date tables. Select “Server Timings” in DAX studio then run a query to get all the server timings to understand where it’s spending it’s time. You can also use DAX studio to compare two versions of a measure to see which runs faster.

Tabular Editor: the builder or tinkerer. There is a free and paid version. Create and edit measures & calculation groups. You can also do C# scripting and automation (my favorite!) and command line integration! Can help with source control. Tabular Editor github has tons of easy to use scripts you can use for common BI tasks (Tabular Editor GitHub and Useful Scripts Blog). The best practice analyzer can easily apply fixes and script out a fix.

PBI Explorer – navigator. Able to explore the changes. Pbip format is new and compares projects using vscode. Gitignore file is in pbip to tell git to ignore the data itself and only bring in metadata. PBI explorer can visualize the report differences. You don’t need the project file format to use PBI explorer. To run it, you have to run the executable – not located in the external tools tab in power bi desktop. It can actually visualize the changes. There’s additional functionality to even explorer and analyze the visuals.

Bravo – https://bravo.bi/. Jack of all trades. This one seems to fill in the gaps. It can help optimize data model by looking at unused columns, size, cardinality. Allows the adding of custom date tables. Exports data. First screen analyzes the model for you. Also allows you to format dax. You’ll get a warning for two things on date tables – 1 is if you still have auto date/time enabled, 2 is if you have a Date or Holiday table already in the measure. If you give the bravo date table a new name, it will resolve issue number two. The bravo date table is a DAX calculated table.

Semantic Link – A Magic Cauldron to Unlock Power BI with Python by Stephanie Bruno & Stacey Rudoy

Blog – https://data-witches.com/

Content at – bit.ly/ILDpres

Check out fabric.guru for the inspiration behind this session. Here’s the specific blog post relating to semantic link and a data catalog (https://fabric.guru/fabric-semantic-link-and-use-cases#heading-what-about-power-bi-developers).

Semantic link is a Fabric feature announced in Dec 2023. https://learn.microsoft.com/en-us/fabric/data-science/semantic-link-overview

Features – we can evaluate DAX queries, get metadata, and take actions like kicking off a refresh.

Two ways to install – pip install semantic-link or create an environment. When you create an environment, you can add libraries and configure Spark properties. Once created, you can set a notebook to use it.

Why use it? Document the tenant (semantic model catalog), save snapshot of semantic model, download usage metrics.

Data Quality – there’s a find dependency script that allows you to see what columns rely on others. You can also use the Great Expectations library.

For REST APIs, use the FabricRestClient to call the Fabric REST endpoints.

Usage metrics! With semantic link, you can loop through all the workspaces and query the model to dump the data directly into a Datalake! You cannot connect to the Admin monitoring workspace using semantic link (not supported currently). So excited to be able to loop through all the workspaces one has access to!

Before running a notebook in Fabric, you need to create an environment. Add a python library for semantic-library. Then save and publish and you can use this within the notebook. If you don’t use environment, you’ll have to pip install semantic link. There is a learn module about the semantic link library (sempy fabric read table). If you call datatsets() without anything in the parenthesis it will grab datasets within the workspaces that the notebook lives in now. You can also plot relationships.

You can also query the DMVs if needed.

Ooo there’s an option for read_table so you don’t need to evaluate a dax expression to return a Fabric dataframe. To send to lakehouse all you need to use is to_lakehouse_table function and it creates or appends data to a table in lakehouse. Keep in mind, you’ll want to remove spaces for lakehouse functionality. You can only use to_lakehouse_table if your dataframe is a Fabric dataframe. Otherwise, you’ll needto use write.format(“delta”).mode(“overwrite”).

Phil Seamark has a great blog about this as well: https://dax.tips/2023/12/05/visualize-power-bi-refresh-using-sempy/.

Power BI – Performing Data Quality Checks Using Python & SQL by me 🙂

Abstract:

Picture this, you have a report in Power BI that someone passes off to you for data quality checks. Simple enough until they clarify they need every measure checked against the source database.
There are a few ways to make sure the measures match what is in the source data system, but you need something easily repeatable and self documenting. In this presentation, I will walk through how to use python and excel to perform our data quality checks in one batch and document any differences. In order to do that, we are going to build a python script that can run Power BI REST APIs, connect to a SQL Server, and read/write to Excel.
By the end of this session, attendees will have a plug and play tool to check data from SQL against Power BI.

Link to GitHub with Power Point and final scripts: https://github.com/Anytsirk12/DataOnWheels/tree/main/SQL%20Saturday%20ATL%20-%20BI%202024

How to Call REST API using VS Code

Sometimes you need a fast way to test your API call and responses, and for that VS code is the perfect tool to do that. For today’s demo, we will use a free API to test with.

https://catfact.ninja/fact

Keep in mind, this is a very very simple REST API. For more complicated REST API calls, make sure to refer to the documentation around the API itself to know the required and optional parameters.

To start, we will need to install a REST API extension in Visual Studio Code. To find extensions, navigate to the block-looking icon and search for “rest” in the search bar. When picking an extension, I recommend picking one with a lot of installs and high star reviews. For our demo, I’m going to install the first option – “REST client”.

Now that we have a REST extension, we can build a rest file to run our code in. To set up a file to run HTTP requests (aka REST API calls), we need to make a text file with the extension “.http” or “.rest”. To do this, create a new file in VS Code and save it as “api_demo.rest”. You can name it whatever you’d like, but the key here is to have .rest a the end.

Make sure the file extension is set to “No Extension” so that the .rest can change the text file to a rest file. Now that we are inside a rest enabled file, we can build out our API call. There are a few standard REST API call types such as GET and POST. For this API, we will use GET to get information. It’s typically as simple as it sounds, but definitely triple check documentation if you’re unsure what call to use.

Add the following code to your file: GET https://catfact.ninja/fact

The GET should be colored and there should be a small subscript that allows you to Send the Request.

Select Send Request, and boom! You’ve run your first REST API! The results will pop up on the right. Happy coding!