Tag Archives: Hadoop

Using Power BI with HDInsight Part 1: Power Query and Files

With the rise of HDInsight and other Hadoop based tools, it is valuable to understand how Power BI can help you take advantage of those big data investments. If you need to set up a cluster to work with, check out my previous posts on Setting Up an HDInsight Cluster and Loading Data Into Your New HDInsight Cluster. These posts show how to do this with no scripting required. If you prefer to script, there are a number of resources with sample scripts on doing the same work.

In this article, I will focus on using Power Query to get data from the Hadoop file structure in HDInsight. I will be using Excel 2013 with the Power Query Add-In. I will also be using the restaurant data I loaded as noted in the three previous posts. If you need to create a cluster and load data I encourage you to check the following blog posts:

These posts walk through the process of creating a cluster and loading up data.

Connecting to HDInsight

First, open a new Excel workbook and click the Power Query tab. Once there, you can find the Azure HDInsight source in the From Other Sources dropdown. Select that option to open the following dialog:


You will need your storage account in order to continue. Then you will need the storage account key. Once you have added the key you will see that the Navigator opened in Excel on the right.


It should show the name of your cluster and the default container name. Double click the container name and it will open the Power Query window. It will show all the files available in the container. Even though we have it organized in folders, the view shows all the files. If you have a large amount of files and you don’t want to scroll to find them, you can click the down arrow on the Folder Path column and use the text filter to find the folder you are looking for.


Now I have the files I want to use in Power Query. If you click the binary link it will open a copy of the file. However, this is not how we want to work with the data as we have multiple files. (If you did this, remove steps up to the Filtered Rows step in the Applied Steps section.) I now have the files I uploaded showing.


In order to work with them all together we need to Combine Binaries.


This now merges all the files into a single dataset. You can now do any data shaping you would like using standard Power Query methods such as updating the column names and the data types. You can even add columns such as a Total Amount column. Here is what my final query looks like including the steps I did. Be sure to give your query a meaningful name. (Note that Power Query realized my Transaction Date column was a date dataype and changed it for me.)


Now click Close & Load and your data will be loaded into Excel. You can change the query to load a model in Power Pivot as well to do additional work with the data.

The next post will walk through using Power BI Desktop to load data using a Hive query.

Uploading Files to an HDInsight Cluster (No Scripting Required)

As I noted in my first post, I am not a fan of scripting. In that post we set up a cluster without using scripts to do so. Now we are going to look at how to upload files without scripts. While this will work for our demo and learning purposes, I would encourage you to use scripting to handle production level loads or even if you want to upload a lot of files. While I am not a fan, it does not mean the scripting may not be a better overall tool. However, when I am trying to learn the functionality or work with system using other tools (in this case Power BI), I find that methods such as these help me be more productive sooner.

Prepping to Load Data Into Your New HDInsight Cluster

A key difference between standard Hadoop and HDInsight is file management. With HDInsight, you can load files into Azure Storage and they can be consumed by the HDInsight cluster. Keeping with the No Scripting Required mantra, we will be using a graphical interface to load files into Azure storage. There are a number of options out there, you need one of them installed. For our example, we will be using the freeware version of CloudBerry Explorer for Azure Blob Storage. Once you have your tool of choice installed you are ready to get some files.

At this point, you need some files to load. I am using some data I created for another demo. My data is in 7 files of daily receipts for my restaurant for a week in March. Once you have the data, we can load that into the cluster.

Loading Data Into Your New HDInsight Cluster

A noted above, the next steps for use will use CloudBerry Explorer to load our data. In this case, I just copied the folder with my files over to the Azure Storage once I connected the tool to Azure.


Once that is done, we will look at working with the data in Hadoop and with Hive.

Creating an External Hive Table and Querying It

You can create two types of tables using Hive – internal and external. An internal table loads the data into a Hive database. An external table applies a schema to the data without moving it. I will be creating an external table. I like this concept because it applies schema to the files that have been uploaded and allows other tools to interact with that data using HiveQL. When you drop an external table, the data remains because the table represents structure only.

In order to help everyone through this (in particular me), the next sections walk through the steps I took to create my table and select data from it. (This is not a detailed look at Hive, but rather a focus on the process of making HDInsight data available using HiveQL.)

Understanding the Files

The first step was to document the structure of the data in the files. Here is the data that I had in each of the files in column order:

  • Ticket Number – int
  • Ticket Date – date
  • Hour of the Day – int
  • Seat Number – int
  • App Amount – int
  • Entrée Amount – int
  • Non Alcoholic Amount – int
  • Alcoholic Amount – int

My structure was fairly simplistic. Each file represented a day.

Creating the Table

Now that I had the structure, I needed to work out the table DDL. (Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable) Here is the syntax that I came up with.

DROP TABLE RestaurantSales;
CREATE EXTERNAL TABLE RestaurantSales (ticketnumber int, ticketdate string, hourofday int, seat int, appamount int, entreeamount int, nonalcoholamount int, alcoholamount int)
STORED AS TEXTFILE LOCATION ‘wasb:///restaurant-data-files’;

The first statement drops the existing table if it exists. Unlike SQL Server, no error is thrown if the table does not already exist, so there is no need to check for existence.

The second statement creates the table. One quick note on the data types. In my case, my date was not recognized as a date in my first attempt. I changed it to string. When I ran my query it returned NULLs in that column. As noted earlier, this is an external table which means it is applying schema. The fields are terminated by a comma (‘,’). The next part is significant because it is part of HDInsight syntax. The location is prefixed by wasb which tells HDInsight we are using Azure Blob Storage for the files. The three forward slashes mean we are using the default container for files. When I uploaded the data, I added a folder called restaurant-data-files which holds the files for HDInsight. It will now apply the schema to all the files in that folder. If new files get uploaded, they would be part of the query as well.

You can run this in the Hive Editor in the HDInsight Console discussed in the previous blog post. You can track your jobs there and see when it completes.

Querying the Table

The final step is checking to see if it worked. I opened a new Hive Editor window and executed the following statement:

select * from RestaurantSales;

Voila! Once the job completed, I was able to click the session link and review the results.

Hopefully you were equally successful creating your first Hive table and query with your data. The next two posts will talk about using Power BI to interact with HDInsight data.

Exploring the Microsoft Azure HDInsight Query Console (No Scripting Required)

As I noted in my previous post, I am not a fan of scripting. In that post we set up a cluster without using scripts to do so. Before we get to the steps to load our own data, I wanted to take a moment and explore the HDInsight Query Console or Dashboard (according to the link on the Azure Portal). This console is part of a HDInsight Cluster, if you need to create a cluster, check out my previous post. As part of my desire to not script, this console provides some insights into my cluster as well of some tools to explore the contents the cluster.

I am going to highlight areas of interest to me on this, so this is by no means exhaustive. I encourage you to explore more comment on what you find compelling that I may have missed. Also, with the rate of change in Azure, it is likely some of this will change sooner rather than later. Hopefully it will only get better.

Getting Started Gallery

I would guess that this is the section most likely to change since they have “Coming Soon” buttons. Each of these scenarios use a  step-by-step approach to helping you set up a job common to HDInsight or Hadoop. When you click a scenario button, you will get a list of steps to walk through to complete the solution. Here is the start page from Azure Website Log Analysis.


In this scenario, you load log files and eventually use Hive to organize and use the data in Excel. This is a great way to learn about using HDInsight to solve some specific problems. Currently the scenarios are grouped into solutions with Azure data and solutions using sample data. When this article was created, the following scenarios were available:

  • Microsoft Azure Website Log Analysis Solution (uses Azure Data)
  • Microsoft Azure Storage Analytics Solution (uses Azure Data)
  • Sensor Data Analysis (uses sample data)
  • Twitter Trend Analysis (uses sample data)
  • Website Log Analysis (uses sample data)
  • Mahout Movie Recommendation (uses sample data)

Hive Editor

The next section is an online Hive editor which lets you create Hive queries. It comes with a quick query for the hivesampletable which is included in your cluster when it is created. Go ahead and submit this query.  You will see the editor create the jobs and track the status.


Click View Details to see the Job information and the results of the query. You can download the results and logs from the details page. You can create additional queries. For instance if you want to see the structure of the table you can use the following Hive query:

SHOW CREATE TABLE hivesampletable;

You can find more info on the Hive Query Language at https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Job History

The Job History page shows all of the jobs you have run. You can also get to the Job Details from previously run jobs here like you did from the Hive Editor window.

File Browser

The File Browser page lets you explore the files and structure in your HDInsight cluster. You click the name to drill down into the contents of your cluster. When you first open the page, your cluster name is shown in the window. Here is the basic structure from a new cluster:

  • Cluster
    • Containers (e.g. $logs)
      • Directories (e.g. example, app-logs)
        • Files (these may be multiple directories deep)

Once you get to the file location, you can download the file.

Hadoop UI and Yarn UI Pages

The Hadoop UI page shows information about the Hadoop cluster you have created. It includes an overview and some specifics on datanodes and snapshots.

The Yarn UI gives some more information about jobs and cluster metrics related to your MapReduce jobs. If you are interested in learning more about Yarn check out this site: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.

This console is a great way to become more familiar with Hadoop, HDInsight, and Hive. While it does not give you all the capabilities it is a good no scripting starting point. The next post will discuss how to load a simple set of data into Hadoop for analysis beyond the samples.

Setting Up an HDInsight Cluster (No Scripts Required)

Let me start by saying, I am not a fan of scripting. It definitely has its place and a lot of my peers really like it. It is the easiest way to get functionality out from software vendors such as Microsoft. PowerShell is an incredibly powerful tool which can do just about anything. However, therein lies the problem for me. Scripting solves a lot of problems, however, I just wanted to set up and use a basic HDInsight cluster to create some Power BI demos (posts coming soon). So I started the journey to find the scripts and try to understand the syntax and so on. Then I went to the Azure Portal, here is what I did to set up my cluster and load data with no scripting required. My goal was to go to get a working demo platform up. Would I necessarily recommend this path for production work, not sure yet. But now I can work with HDInsight with considerably less work required to set up the environment.

HDInsight Cluster No Script Setup Requirements

You need an Azure account. You can go to http://azure.microsoft.com to sign up for a free account if you like. If you have an MSDN subscription you should have some time available as well.

HDInsight Cluster No Script Setup

Once you have your account created, you should go to http://portal.azure.com. We will be doing our setup from here. During the process we will be creating a storage account (if this is your first run in azure, you may choose to set up a Resource Group as well) and the HDInsight cluster. Be aware that the cluster has compute costs and the storage has storage costs. At the end we will remove the cluster to save your compute time.

Create the Storage Account

This step can be done during the HDInsight cluster creation, but this limits your ability to share data across clusters. If you are just trying it for fun, you can do this during the cluster set up.

Click the + symbol on the portal, then Data + Storage, then Storage Account. This will open a blade with the set up instructions for a storage account.


When you create your account you will have some options to fill in:

  • Name: this name will need to be a unique name, e.g., joescoolhdinsight
  • Pricing tier: The pricing tier is really important if you are using a limited plan or if you plan to keep the data for a long time. If you are planning to use this as a demo, I would select Locally Redundant as that is the lower cost plan.
  • Resource Group: The resource group lets organize your Azure assets. This is for your benefit, so if you want to keep all of the HDInsight components together, you could create a group for that or stick with the default.
  • Subscription: This lets you choose the subscription you want to use.
  • Location: Be sure to select a location close to you that supports HDInsight. Check http://azure.microsoft.com/en-us/regions/ to see what Azure services are supported in each region.
  • Diagnostics: This is optional. If you are looking into the diagnostics or need to prep for production, you will find this useful. In most cases, we would not turn this on for demos.

Click Create and it will create your storage account. This may take a few minutes. The notifications section on the portal will alert you when this has been completed. Once that is complete, we will continue with setting up the cluster.

Create a SQL Database for a Metastore

This is an optional section. If you would like to use Hive or Oozie and want to create a metastore so you can reuse your work when you recreate the cluster you need to create a SQL Database for this.


Here are the settings needed to create the database:

  • Name: Something easy for you to track on, e.g. HDInsightMetastore.
  • Server: You can use an existing server if you have it, otherwise you can create a new server. I recommend you create the server in the same location you plan to create your HDInsight cluster.
  • Pricing Tier: The default is S0. If you plan to use this for demos and don’t need the additional features, you can choose Basic.
  • Optional Configuration: no changes.
  • Resource Group: Use the Resource Group you have in place for this example.
  • Subscription: Select your subscription.

Click Create to create your database. You will work with this during the setup of your cluster.

Setting Up the HDInsight Cluster

Click the + symbol on the portal, then Data + Analytics, then HDInsight. This will open a blade with the set up instructions for a storage account.


Like with the storage account setup, this will open a blade with options for creating the cluster. Let’s walk through the settings on this.

  • Cluster Name: Like the storage account, this name needs to be unique.
  • Cluster Yype: Select Hadoop for this walk through.
  • Cluster Operating System: Select Windows Server for this walk through.
  • Subscription: Choose the same subscription as your storage account.
  • Resource Group: Choose the same Resource Group as your storage account.
  • Cluster Credentials: Here you select a login name and password for your cluster. You can also choose to enable Remote Desktop, but we are not using that feature for this setup. (Note: be sure to click Select at the bottom when you are done. If you don’t, you will be prompted by IE about unsaved settings.)
  • Data Source: Here is where you select your storage account. If you chose not to create a storage account, you can create a new account here as well.
  • Node Pricing Tiers: This section determines the capability and the associated computing costs of your cluster. By default, 4 worker nodes and 2 head nodes will be created with recommended servers (D12 at the time of this writing). Expand the pricing tier, to change the server type or node count. Unless you are sure you need to change, keep the default settings (you can recreate the cluster later). You will see the current hourly pricing based on your selections. This cost is incurred while the service is running. The only way to stop charges is to delete the cluster, so be sure to do this when you are done if you do not wish to pay for it to keep running.
  • Optional Configuration: You do not need to change any setting here if you choose not to. However, if you plan to delete your cluster and you want to retain the metadata, it is recommended that you set up an External Metastore using the database you created previously.
    • Select the database you want to use for the metadata in each case and update the credentials. You can use the same database for both metastores.

Next, you create the cluster. This will take a few minutes. You can track progress in the notifications section on the portal page.

Exploring Your New HDInsight Cluster

Once the cluster has been created, you will see the information page with the settings and other usage information. At the top of that area, you will see some icons. These will help you explore your cluster some more.


The gear will open up a settings page and you can review your settings in detail and change some if needed.

The icon with a square and an arrow will open up a dashboard with more options. We will dig into the dashboard more in the next post.

The last three icons are shortcuts to specific actions – remote desktop, scale cluster, and delete.

Once you are done, you should delete your cluster. You can always go through these steps again to recreate your cluster. In my next article we will go through what you can see and do with your cluster using the dashboard.

Building My HDInsight Server Cluster

After all the hype about Big Data, Hadoop, and now HDInsight, I decided to build out my own big data cluster on HDInsight. My overall goal is to have a cluster I can use with Excel and Data Explorer.  After all, I needed more data in my mashups. I am not going to get into the details or definitions of Big Data, there are entire books on the subject.  I will discuss any issues or tidbits during the process while I am here.

Setting Up the Environment

I am actually doing this on a VM on my Windows 8 laptop.  I created a Windows 2012 VM with 1 GB of RAM and 50 GB of storage.  (Need some help creating a VM in Windows 8, check out my post on the subject.

Installing the HDInsight Server

First, this product is still in Preview at the time of this writing, so mileage will vary and likely change over the next few months.  You will find the installer at http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW.  This uses the Microsoft Web Platform Installer.  When prompted I just ran the installer.  This took about one hour to complete on my VM setup. Once it completed, it opened up the dashboard view in IE.


At this point we have installed a cluster called “local (hdfs)”.

Exploring My Local Cluster

Well, things did not go well at first.  Whenever I clicked the big gray box to view my dashboard, I received the following error: “Your cluster ‘local (hdfs)’ is not responding.  Please click here to navigate to cluster.”  I clicked “here” and ended up on a IIS start page.  Not really effective.  Let the troubleshooting begin.

Based on this forum issue response, I opened the services window to find that none of my Apache Hadoop services were running after a restart AND they were set to manual.  To resolve this I took two steps.  First, I changed all of my services to run automatically.  This makes sense for my situation because the VM would be running when I wanted to use HDInsight.  Second, I used the command line option to restart all of the services as also noted in the forum post above.

From a command prompt execute the following code to restart all Hadoop services:


And, VOILA!, my cluster is now running.


Maybe we can get a better error message next time.

At this point I walked through the Getting Started option on the home screen and proceeded to do “Hello World”.  I used these samples as intended to get data in my cluster and start working with the various tools.  Stay tuned for more posts in the future on my Big Data adventures.

Why Not HDInsight Service on Azure?

The primary reason I did not use the HDInsight Service on Azure was that I did not want to risk the related charges.  Once I have a good understanding of how HDInsight Server works, I will be more comfortable working with HDInsight Service.

Other Resources

Here are some of the resources I used throughout the build.

HDInsight Service Quick Start and Tutorials

Getting Started With Microsoft HDInsight