Category Archives: Polybase

Consumption Based Architecture for Modern Data Analytics

Throughout many years of working with BI solutions and data warehouse solutions, we have strived to put all the data in one location so it could be easily consumed by reporting and analysis tools from SQL Server Reporting Services to Microsoft Excel. We have followed the dimensional modeling processes promoted by Ralph Kimball and others. These techniques were developed to turn relational data platforms into viable and well-performing reporting platforms. They worked.

Throughout the years, I have built many star schemas, enterprise data warehouses, and reporting databases based on these techniques. However, they were not without their flaws. In the early days, it would take years to create the enterprise data warehouse. When it was done, the business had already moved on. So, we started creating data marts which were departmentally focused. This allowed us to shorten the development lifecycle to more quickly meet the needs of the business. But the speed of business continued to outpace IT’s ability to deliver effective BI solutions when they were needed.

During this same time, the worlds most ubiquitous BI tool, Microsoft Excel, ran more and more businesses, both large and small. Why was Excel so popular? It allowed the user, notExcel-2013-Icon_thumb.png IT, to do the analysis and produce results when the business needed them. Self-service BI is not new, we just refuse to accept Excel as a “real” BI tool in the industry. Inevitably, no matter how good your BI or reporting tool is, users want to know, “How do I export this to Excel?” I found it humorous that Microsoft suffered the same problem with Power BI. The ability to extract the data is hugely important to users. Why? So they could do this on their own. They feel empowered with Excel.

In today’s world, we are also witnessing a shift to a more mobile, tech savvy group of users. As my teenage and young adult children begin to enter the workforce, I still see a disconnect with enterprise BI solutions in most cases. They want the data at their fingertips and easily consumable to solve the question they have now, not in three weeks. That brings us to the architecture I have been promoting for a few years. It’s not new, but I needed a way to talk about modern data and BI solutions that focused on one the most significant needs in the business – consumable data.

The Consumption Based Architecture is based on the following key concepts:

  1. Keep the data close to the source
  2. Data interfaces should be easy to use
  3. Modern, in-memory tools make this possible

Keep the data close to the source

In the normal enterprise data warehouse solution, we process data to clean it up, reshape it, and generally make it “better”. However, the reality is that the users don’t see it as sourceoftruthbetter if it does not match the source. The only “source of truth” that matters in reality is the one that is closes to the data entry as possible. If that data is wrong, it needs to be corrected there. Why? Because users will always verify that the data warehouse or any BI solution is correct by checking the system of record.

Besides the source of truth issue, any time data is transformed or shaped differently than the source, documentation and maintenance are required. In most cases, documentation is lacking in data warehouse solutions. It’s hard and changes often. Furthermore, the amount of maintenance required to support a traditional data warehouse solution is a problem. When the business needs to change a field or the source changes, how do you measure how long it will take to get that in the data warehouse? Days? Weeks? Months? Years? By the time we have the change in place, tested and ready for use, the business has moved on.

In Consumption Based Architecture, the goal is to move the data as needed and transform it as little as possible. Operational Data Stores which are replicas or copies of the data from the source systems are the best mechanisms to move the data to a reporting area. Only transform if absolutely necessary. In most cases, our transforms will involve data type cleanup or other in cases where fields have changed use over time maybe adding a column to clarify the data. Each time you move data or transform data, it should be to make the data more consumable.

The primary exception to this rule is dimensional data. Dimensional data has a special place in the architecture. While it is not necessary to transform this data, the overall solution is improved by combining similar dimensions and using natural keys for relationships. For example, if you have a customer record in CRM and shipping data, you can create a “golden record” or conformed dimension which contains both natural keys. This will allow you to cross reference data easily in both solutions. Master data solutions help with this but are not required to be successful in this architecture.

Data Interfaces Should Be Easy

Having done a lot of work in the application development field where interfaces are used to simplify programming, I think we should have a similar concept in working with data. We have dabbled in the area for some time. I know that I have used views with schemas (or users in Oracle) that were specifically designed to support SQL Server Analysis Services. We used views to support a consistent data set to the cubes for processing and allow changes in the back end data as needed. The views operated as an interface between the data warehouse and the cube.

In a similar fashion, these interfaces need to be identified and used throughout the Consumption Based Architecture. If you are referencing a relational data structure, views continue to make sense. They allow consumers to interact with the data in a known fashion. They are also logical constructs which can be deprecated with a timeline for users to move off of them to the newer versions.

Analytic models such as those found in Power BI, Qlik, and SQL Server Analysis Services create a similar tool for consumers. Usually those models can be used in the tool itself or even in other tools to produce reports and dashboards. Depending on how the model is viewsmodvirtcreated, they are often a table based view of data. For instance, Power BI can turn a folder of files into a table structure for easy consumption in Power BI reports. Power Pivot models created in Excel can be shared in SharePoint and in Power BI. These are just a few examples of using analytic models as interfaces.

The third option is data virtualization. This tends to be fairly expensive. The two that I am aware of are Cisco Data Virtualization and RedHat OpenShift virtualization. The concept of data virtualization is perfect for Consumption Based Architecture; however, I have not seen this used much due to cost implications. Microsoft may be changing the landscape of data visualization. During PASS Summit 2016, they announced expanding the use of Polybase to reference other data sources such as Oracle and Teradata. Currently, Polybase supports Hadoop data, but this change could allow it to become a virtualization tool that is cost effective (included with SQL Server) and simple to use.  I will be reviewing these tools in a later post as I get more information on them.

Modern, In-Memory Tools

The most significant technological improvement in the past few years to make this architecture really possible is in-memory data tools. While this revolution has occurred in more technologies than Microsoft, Microsoft tools are what I am most familiar with. In the Microsoft arena, this started with Power Pivot in Excel. The ability to mashup various data sources using in-memory models is awesome.

When Power Pivot came on the scene, I was doing a lot of work with SQL Server Analysis Services cubes. While these cubes provided a great analytic layer, they were very fragile in my experience. However, Power Pivot allowed us create better performing models more quickly. At that point, I knew a shift was coming. Two of the biggest pain points with cube design were tackled – speed of development and ease of use. We could solve problems quicker and easier with Power Pivot.

speed-and-ease

Since the release of Power Pivot, Microsoft has also improved their overall in-memory solution set by adding SQL Server Analysis Services Tabular Models and OLTP in-memory and columnstore functionality in SQL Server. These improvements continue to make it easier than ever to build out consumable models in memory. You can now use columnstore in SQL Server without an analytics model, simplifying your architecture without performance penalties. With the release of SQL Server 2016, Microsoft has created a data engine capable of in-memory OLTP to improve transactional loads while also supporting columnstore indexes for reporting loads in the same database. While your mileage may vary on implementation, it is easy to see that Consumption Based Architectures are best able to take advantage of these advances and making data more easily accessible and consumable to our business users.

What’s Next

Over the next few weeks, I will be digging in on some of the topics that support Consumption Based Architecture. The goal is to help you begin to take advantage of this architecture in your business and build out a flexible, easily consumed data and analytics platform.

cba-featured-pic

Upcoming Topics:

  • Interface Layers
  • Data Dictionaries
  • Dimensional or Master Data
  • Moving and Transforming Data
  • Modeling in MSBI
  • Reporting with Consumption Based Architecture
Advertisements