Using Power BI with HDInsight Part 1: Power Query and Files

With the rise of HDInsight and other Hadoop based tools, it is valuable to understand how Power BI can help you take advantage of those big data investments. If you need to set up a cluster to work with, check out my previous posts on Setting Up an HDInsight Cluster and Loading Data Into Your New HDInsight Cluster. These posts show how to do this with no scripting required. If you prefer to script, there are a number of resources with sample scripts on doing the same work.

In this article, I will focus on using Power Query to get data from the Hadoop file structure in HDInsight. I will be using Excel 2013 with the Power Query Add-In. I will also be using the restaurant data I loaded as noted in the three previous posts. If you need to create a cluster and load data I encourage you to check the following blog posts:

These posts walk through the process of creating a cluster and loading up data.

Connecting to HDInsight

First, open a new Excel workbook and click the Power Query tab. Once there, you can find the Azure HDInsight source in the From Other Sources dropdown. Select that option to open the following dialog:

image

You will need your storage account in order to continue. Then you will need the storage account key. Once you have added the key you will see that the Navigator opened in Excel on the right.

image

It should show the name of your cluster and the default container name. Double click the container name and it will open the Power Query window. It will show all the files available in the container. Even though we have it organized in folders, the view shows all the files. If you have a large amount of files and you don’t want to scroll to find them, you can click the down arrow on the Folder Path column and use the text filter to find the folder you are looking for.

image

Now I have the files I want to use in Power Query. If you click the binary link it will open a copy of the file. However, this is not how we want to work with the data as we have multiple files. (If you did this, remove steps up to the Filtered Rows step in the Applied Steps section.) I now have the files I uploaded showing.

image

In order to work with them all together we need to Combine Binaries.

image

This now merges all the files into a single dataset. You can now do any data shaping you would like using standard Power Query methods such as updating the column names and the data types. You can even add columns such as a Total Amount column. Here is what my final query looks like including the steps I did. Be sure to give your query a meaningful name. (Note that Power Query realized my Transaction Date column was a date dataype and changed it for me.)

image

Now click Close & Load and your data will be loaded into Excel. You can change the query to load a model in Power Pivot as well to do additional work with the data.

The next post will walk through using Power BI Desktop to load data using a Hive query.

Uploading Files to an HDInsight Cluster (No Scripting Required)

As I noted in my first post, I am not a fan of scripting. In that post we set up a cluster without using scripts to do so. Now we are going to look at how to upload files without scripts. While this will work for our demo and learning purposes, I would encourage you to use scripting to handle production level loads or even if you want to upload a lot of files. While I am not a fan, it does not mean the scripting may not be a better overall tool. However, when I am trying to learn the functionality or work with system using other tools (in this case Power BI), I find that methods such as these help me be more productive sooner.

Prepping to Load Data Into Your New HDInsight Cluster

A key difference between standard Hadoop and HDInsight is file management. With HDInsight, you can load files into Azure Storage and they can be consumed by the HDInsight cluster. Keeping with the No Scripting Required mantra, we will be using a graphical interface to load files into Azure storage. There are a number of options out there, you need one of them installed. For our example, we will be using the freeware version of CloudBerry Explorer for Azure Blob Storage. Once you have your tool of choice installed you are ready to get some files.

At this point, you need some files to load. I am using some data I created for another demo. My data is in 7 files of daily receipts for my restaurant for a week in March. Once you have the data, we can load that into the cluster.

Loading Data Into Your New HDInsight Cluster

A noted above, the next steps for use will use CloudBerry Explorer to load our data. In this case, I just copied the folder with my files over to the Azure Storage once I connected the tool to Azure.

image

Once that is done, we will look at working with the data in Hadoop and with Hive.

Creating an External Hive Table and Querying It

You can create two types of tables using Hive – internal and external. An internal table loads the data into a Hive database. An external table applies a schema to the data without moving it. I will be creating an external table. I like this concept because it applies schema to the files that have been uploaded and allows other tools to interact with that data using HiveQL. When you drop an external table, the data remains because the table represents structure only.

In order to help everyone through this (in particular me), the next sections walk through the steps I took to create my table and select data from it. (This is not a detailed look at Hive, but rather a focus on the process of making HDInsight data available using HiveQL.)

Understanding the Files

The first step was to document the structure of the data in the files. Here is the data that I had in each of the files in column order:

  • Ticket Number – int
  • Ticket Date – date
  • Hour of the Day – int
  • Seat Number – int
  • App Amount – int
  • Entrée Amount – int
  • Non Alcoholic Amount – int
  • Alcoholic Amount – int

My structure was fairly simplistic. Each file represented a day.

Creating the Table

Now that I had the structure, I needed to work out the table DDL. (Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable) Here is the syntax that I came up with.

DROP TABLE RestaurantSales;
CREATE EXTERNAL TABLE RestaurantSales (ticketnumber int, ticketdate string, hourofday int, seat int, appamount int, entreeamount int, nonalcoholamount int, alcoholamount int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE LOCATION ‘wasb:///restaurant-data-files’;

The first statement drops the existing table if it exists. Unlike SQL Server, no error is thrown if the table does not already exist, so there is no need to check for existence.

The second statement creates the table. One quick note on the data types. In my case, my date was not recognized as a date in my first attempt. I changed it to string. When I ran my query it returned NULLs in that column. As noted earlier, this is an external table which means it is applying schema. The fields are terminated by a comma (‘,’). The next part is significant because it is part of HDInsight syntax. The location is prefixed by wasb which tells HDInsight we are using Azure Blob Storage for the files. The three forward slashes mean we are using the default container for files. When I uploaded the data, I added a folder called restaurant-data-files which holds the files for HDInsight. It will now apply the schema to all the files in that folder. If new files get uploaded, they would be part of the query as well.

You can run this in the Hive Editor in the HDInsight Console discussed in the previous blog post. You can track your jobs there and see when it completes.

Querying the Table

The final step is checking to see if it worked. I opened a new Hive Editor window and executed the following statement:

select * from RestaurantSales;

Voila! Once the job completed, I was able to click the session link and review the results.

Hopefully you were equally successful creating your first Hive table and query with your data. The next two posts will talk about using Power BI to interact with HDInsight data.

Exploring the Microsoft Azure HDInsight Query Console (No Scripting Required)

As I noted in my previous post, I am not a fan of scripting. In that post we set up a cluster without using scripts to do so. Before we get to the steps to load our own data, I wanted to take a moment and explore the HDInsight Query Console or Dashboard (according to the link on the Azure Portal). This console is part of a HDInsight Cluster, if you need to create a cluster, check out my previous post. As part of my desire to not script, this console provides some insights into my cluster as well of some tools to explore the contents the cluster.

I am going to highlight areas of interest to me on this, so this is by no means exhaustive. I encourage you to explore more comment on what you find compelling that I may have missed. Also, with the rate of change in Azure, it is likely some of this will change sooner rather than later. Hopefully it will only get better.

Getting Started Gallery

I would guess that this is the section most likely to change since they have “Coming Soon” buttons. Each of these scenarios use a  step-by-step approach to helping you set up a job common to HDInsight or Hadoop. When you click a scenario button, you will get a list of steps to walk through to complete the solution. Here is the start page from Azure Website Log Analysis.

image

In this scenario, you load log files and eventually use Hive to organize and use the data in Excel. This is a great way to learn about using HDInsight to solve some specific problems. Currently the scenarios are grouped into solutions with Azure data and solutions using sample data. When this article was created, the following scenarios were available:

  • Microsoft Azure Website Log Analysis Solution (uses Azure Data)
  • Microsoft Azure Storage Analytics Solution (uses Azure Data)
  • Sensor Data Analysis (uses sample data)
  • Twitter Trend Analysis (uses sample data)
  • Website Log Analysis (uses sample data)
  • Mahout Movie Recommendation (uses sample data)

Hive Editor

The next section is an online Hive editor which lets you create Hive queries. It comes with a quick query for the hivesampletable which is included in your cluster when it is created. Go ahead and submit this query.  You will see the editor create the jobs and track the status.

image

Click View Details to see the Job information and the results of the query. You can download the results and logs from the details page. You can create additional queries. For instance if you want to see the structure of the table you can use the following Hive query:

SHOW CREATE TABLE hivesampletable;

You can find more info on the Hive Query Language at https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Job History

The Job History page shows all of the jobs you have run. You can also get to the Job Details from previously run jobs here like you did from the Hive Editor window.

File Browser

The File Browser page lets you explore the files and structure in your HDInsight cluster. You click the name to drill down into the contents of your cluster. When you first open the page, your cluster name is shown in the window. Here is the basic structure from a new cluster:

  • Cluster
    • Containers (e.g. $logs)
      • Directories (e.g. example, app-logs)
        • Files (these may be multiple directories deep)

Once you get to the file location, you can download the file.

Hadoop UI and Yarn UI Pages

The Hadoop UI page shows information about the Hadoop cluster you have created. It includes an overview and some specifics on datanodes and snapshots.

The Yarn UI gives some more information about jobs and cluster metrics related to your MapReduce jobs. If you are interested in learning more about Yarn check out this site: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

This console is a great way to become more familiar with Hadoop, HDInsight, and Hive. While it does not give you all the capabilities it is a good no scripting starting point. The next post will discuss how to load a simple set of data into Hadoop for analysis beyond the samples.

Excel BI Tip #28: Using member expressions in the CUBEVALUE Function

As I mentioned in my original post, Exploring Excel 2013 as Microsoft’s BI Client, I will be posting tips regularly about using Excel 2013 and later.  Much of the content will be a result of my daily interactions with business users and other BI devs.  In order to not forget what I learn or discover, I write it down … here.  I hope you too will discover something new you can use.  Enjoy!

CUBEVALUE and CUBEMEMBER

If you have not looked at the cube functions in Excel check out two of my previous posts: Excel BI Tip #18: Using Cube Functions to Break Out of Pivot Tables and Excel BI Tip #27: CUBESET and CUBESETCOUNT Functions. Cube functions allow us to apply MDX functionality in Excel beyond pivot tables and charts. In this post, I will show how you can expand the CUBEVALUE function by directly using member expressions or CUBEMEMBER functions in the syntax. Tip #18 demonstrates how to use cells in Excel.

Typically, you would use these functions separately. For example, CUBEMEMBER is used in column and row headers in a typical solution. Because of this CUBEVALUE usually refers to a series of cells and slicers.

CUBEMEMBER used as a column header:

=CUBEMEMBER(“AdvWorks”,”[Measures].[Customer Count]”)

CUBEVALUE referring to this member:

=CUBEVALUE(“AdvWorks”,$D8,F$7,Slicer_Commute_Distance)

The snapshot below shows how it looks in Excel when the cell with the CUBEVALUE function is selected:

image

Using Member Expressions and CUBEMEMBER

The method illustrated above is how the cells are made when a pivot table has been converted to formulas. However, if you look closely the CUBEVALUE function accepts CUBEMEMBER values. The next few examples show how you can use members directly in the name to get more functionality.

=CUBEVALUE(“AdvWorks”, “[Measures].[Customer Count]”,CUBEMEMBER(“AdvWorks”,”[Customer].[Gender].[Female]”))

 

=CUBEVALUE(“AdvWorks”, “[Measures].[Customer Count]”,”[Customer].[Gender].[Female]”)

 

You can use any MDX expression that returns a member as one of the expressions in the function. This includes functions like STRTOMEMBER which allows you to create dynamic functions to get current date. Using member expressions allow for even more customization without dependencies on reference columns.

Adding a Slicer

Finally, you can add a slicer reference to the CUBEVALUE function.

=CUBEVALUE(“AdvWorks”, “[Measures].[Customer Count]”,”[Customer].[Gender].[Female]”, Slicer_State_Province)

Excel BI Tip #27: CUBESET and CUBESETCOUNT Functions

As I mentioned in my original post, Exploring Excel 2013 as Microsoft’s BI Client, I will be posting tips regularly about using Excel 2013 and later.  Much of the content will be a result of my daily interactions with business users and other BI devs.  In order to not forget what I learn or discover, I write it down … here.  I hope you too will discover something new you can use.  Enjoy!

CUBE Function Overview

In Tip #18, I talked about the CUBEMEMBER and CUBEVALUE functions which can be used to refer to cube data directly. In this tip, we will look at CUBESET and CUBESETCOUNT. These functions return sets that can be applied to the CUBEVALUE function like the CUBEMEMBER function. They allow you to create custom sets which can be used for calculations such as trailing 12 months, top 10 performers, or the number of customers with sales in 2009. Keep in mind that the advantage of using the CUBE functions with SSAS data is that it allows you to go beyond the limiting capabilities of pivot tables using that data and MDX.

CUBESET Function

The CUBESET function is designed to return a valid MDX Set from the cube or tabular model. The construction of the formula is CUBESET(“connection”,”set expression”,”caption”,”sort order”, “sort by”). Only the connection and set expression are required. However, I highly recommend using the caption. The caption is what will show in the cell with the formula. If you don’t include the caption, the cell will appear empty so you will need mark it in some way. Before I started using the caption, I color coded the cell so I would make sure not to overwrite that cell. In a nutshell, use a caption to make your solution more usable.

Simple Set Example

In this example, we are going to pick a list of college level Education attribute members from the Adventure Works sample cube. Education is an attribute hierarchy that is part of the Customer dimension. Here is the MDX for that set:

{[Customer].[Education].&[Bachelors], [Customer].[Education].&[Graduate Degree]}

Here is how I used that in Excel with the CUBESET function (AdvWorks is my connection name).

=CUBESET(“AdvWorks”,”{[Customer].[Education].&[Bachelors], [Customer].[Education].&[Graduate Degree]}”,”Completed College”)

While this is a simple example, any MDX set will work in the expression. If you need to create a set dynamically such as a Trailing 12 Months or Top 10 set it will work just as well. Work out the set in SQL Server Management Studio and then put it into a cell and refer to that cell when creating your cube value.

CUBESETCOUNT Function

The CUBESETCOUNT function can be used to count the members in a set returned by a a CUBESET function. In MDX we use .count or COUNT() to determine the number of members in a set. This does not work with the CUBESET function and thus we have CUBESETCOUNT. This is a fairly simple function to use, but it is very powerful if you are trying to do dynamic set counts based on reference data on your spreadsheet.

Counting a Set of Customers with Sales for the Selected Period

In this example, we have created a set which returns customers with orders in 2011. We will count this set using the CUBESETCOUNT function to determine the count. In Excel, we could use the NOW() function to get the current year or some other dynamic value to build the set. However, Adventure Works is a static set so I am using a single year in the example.

nonempty( {([Customer].[Customer].[Customer],[Measures].[Internet Order Quantity])},[Date].[Calendar].[Calendar Year].&[2010])

This can be put into a CUBESET function as follows:

=CUBESET(“AdvWorks”,”nonempty( {([Customer].[Customer].[Customer],[Measures].[Internet Order Quantity])},[Date].[Calendar].[Calendar Year].&[2010])”)

Next, we put the CUBESET function into the CUBESETCOUNT function.

=CUBESETCOUNT( CUBESET(“AdvWorks”,”nonempty( {([Customer].[Customer].[Customer],[Measures].[Internet Order Quantity])},[Date].[Calendar].[Calendar Year].&[2010])”))

This can now be used as a value for other calculations. For example, you could show the average sale amount for customers who purchased something this year or even the average purchase amount this year for those who ordered last year to see if your repeat customers are buying more.

Wrap Up

Using these functions with other CUBE functions increases the flexibility of your dashboard solution and allows you to make many more creative visualizations. Remember to use captions with CUBESET functions you plan to use in your formulas and you need a CUBESET result to count with CUBESETCOUNT.

Thoughts on data, business analytics, and the SQL Server community

Follow

Get every new post delivered to your Inbox.

Join 943 other followers

%d bloggers like this: