Category Archives: Hadoop

VS Live 2016 – Las Vegas Follow Up

LVSPK18

I spoke at Visual Studio Live in Vegas on two topics. While the presentations have been uploaded to the site and were available for attendees, the code was not distributed yet as an oversight on my part. In this post, I will do a quick summary of the sessions and make sample code available. I will be writing more on these topics throughout the year and will tag VS Live in the notices.

JSON - VSLive

JSON & SQL Server Finally Together

JSON is now part of SQL Server 2016. SQL Server now includes functions to generate and shred JSON. Here are the basics:

  • OPENJSON: Used to convert JSON data into a tabular format
  • FOR JSON: Used to create JSON from tabular data
  • ISJSON: Determines if the data in question is JSON
  • JSON_VALUE: Returns scalar values from JSON data
  • JSON_QUERY: Returns JSON formatted arrays or objects
  • JSON_MODIFY: Used to modify JSON data and properties

With all of this support, JSON is not a native data type in SQL Server like XML.

You can download supporting files and code here.

Hive - VS Live

Using Hive and Hive ODBC with HDInsight and Power BI

During this session I went through the process of setting up HDInsight and loading data into the cluster. Once created, Hive tables were created and queries created that were used with Power BI to analyze the results.

You can find the details here.

Minnesota BI User Group – Powering Up HDInsight with Power BI (December 2015)

On Wednesday, December 16, I presented on this topic at the Minnesota BI User Group.  This session is based on five blog posts that I created in August 2015.

You can find the presentation here: Powering Up HDInsight with Power BI (pdf).

The details can be found in the blog posts noted below:

HDInsight-Series-Featured-Pic_thumb

Setting Up and HDInsight Cluster (No Scripts Required)

Exploring the Microsoft Azure HDInsight Query Console (No Scripting Required)

Uploading Files to an HDInsight Cluster (No Scripting Required)

Using Power BI with HDInsight Part 1: Power Query and Files

Using Power BI with HDInsight Part 2: Power BI Desktop and Hive

My goals for this series

1. Document using Power BI with HDInsight

2. Prove that you can set up a HDInsight Cluster with no scripts

Other References from the Session

Azure: http://azure.microsoft.com/en-us/

Cloud Berry: http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx

 

Thanks for attending my session.

Powering Up HDInsight with Power BI–

On Tuesday, September 15, I presented on this topic for Pragmatic Works. You can find that session here. This session is based on five blog posts that I created in August 2015.

Powering Up HDInsight with Power BI (pdf)https://dataonwheels.files.wordpress.com/2016/02/powering-up-hdinsight-with-power-bi.pdffoundin the

HDInsight Series Featured Pic

Setting Up and HDInsight Cluster (No Scripts Required)

Exploring the Microsoft Azure HDInsight Query Console (No Scripting Required)

Uploading Files to an HDInsight Cluster (No Scripting Required)

Using Power BI with HDInsight Part 1: Power Query and Files

Using Power BI with HDInsight Part 2: Power BI Desktop and Hive

My goals for this series

1. Document using Power BI with HDInsight

2. Prove that you can set up a HDInsight Cluster with no scripts

Other References from the Session

Azure: http://azure.microsoft.com/en-us/

Cloud Berry: http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx

Wrap Up from the Session

A few questions were asked during the session and I wanted to handle some of them here.

Why did you not use Azure Resource Manager to deploy storage?

I did this as simple as possible and did not need to use the Resource Manager for my demos. However, if you need to rebuild the cluster quickly, the Azure Resource Manager would be a good option. Find out more here: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-provision-clusters/. This site will also walk through scripts and other options for setting up HDInsight clusters.

Why didn’t the table structure show up in the Power Query demo?

The Power Query demo worked with the data from a file approach. This approach is more “raw”. The files did not have column headers, so no headers were created in the table. However, when working with the Power BI Desktop demo, I used Hive. The table was defined in Hive and were easily seen. This is another case for using Hive or something similar to define the schema for ease of use.

What are the differences between Hadoop, Hortonworks, and HDInsight?

Starting from the top, Hadoop is the Adobe open source specification. All of the products listed above are based on Hadoop. Hadoop

Hortonworks and Cloudera are examples of Hadoop distributions. These companies have worked with the various versions of open source technologies around Hadoop and created a supported distribution as a result.

image 

Finally, HDInsight is Microsoft’s cloud-based Hadoop implementation. They continue to add functionality including Spark, R, Giraph, and Solr. You can expect Microsoft to continue to grow the capabilities of HDInsight as part of their cloud-based analytics solutions.

Thanks for attending my session.

Using Power BI with HDInsight Part 2: Power BI Desktop and Hive

With the rise of HDInsight and other Hadoop based tools, it is valuable to understand how Power BI can help you take advantage of those big data investments. If you need to set up a cluster to work with, check out my previous posts on Setting Up an HDInsight Cluster and Loading Data Into Your New HDInsight Cluster. These posts show how to do this with no scripting required. If you prefer to script, there are a number of resources with sample scripts on doing the same work.

In this article, I will focus on using Power BI Desktop to get data from the Hadoop file structure in HDInsight using a Hive query. I will also be using the restaurant data I loaded as noted in the previous posts. If you need to create a cluster and load data I encourage you to check the following blog posts:

These posts walk through the process of creating a cluster and loading up data.

Connecting to HDInsight Using the Hive ODBC Driver

Before you can connect using a Hive query you need to download the Hive ODBC Hive from Microsoft.  You can find the driver here: http://www.microsoft.com/en-us/download/details.aspx?id=40886. Once you have the driver installed, the connection can be created.

Open Power BI Desktop and click Get Data on the splash screen. This will open the Get Data dialog. Scroll down until you see the ODBC option. (Do not select a Hadoop or HDInsight option. See my previous post on connecting using HDInsight.)

image

Click Connect to start the process.

image

Here is where the “fun” begins. You get no help creating a Hive connection string. It took some searching and trial and error to figure out what was needed to make this happen. Here are the properties you need:

  • Driver:  Driver={Microsoft Hive ODBC Driver}
  • Host: Host=yourHDInsightservername.azurehdinsight.net (Your HDInsight server name)
  • Port: Port=443
  • Schema: Schema=default (default Hive database schema)
  • RowsFetchedPerBlock: RowsFetchedPerBlock=10000 (This is the default)
  • HiveServerType: HiveServerType=2 (This is the default)
  • AuthMech: AuthMech=6
    • This is the Authentication Mechanism which is Windows Azure HDInsight Service.
  • DefaultStringColumnLength: DefaultStringColumnLength=200 (Default is 32767, this should always be set lower)

Each property is separated by a semicolon. My completed connection string looked like this (Note: I added spaces to fit better in the post.):

Driver={Microsoft Hive ODBC Driver}; Host=hugheshdinsight.azurehdinsight.net; Port=443;Schema=default; RowsFetchedPerBlock=10000; HiveServerType=2; AuthMech=6; DefaultStringColumnLength=200;

Enter the connection string into the dialog and then you will be prompted for credentials. Use the Database option and set the Username and Password. Then click Connect. In my case, I see three tables in the resultset including the sample table. We have connected to our HDInsight cluster using Hive.

image

Retrieving Data from HDInsight Using HiveQL

So, getting a list of tables is not really helpful. As you can see, this is the Power Query portion of the Power BI Desktop. Let’s add a HiveQL statement to return only our sales data.

In Applied Steps, click the gear next to Source. This will reopen the From ODBC dialog. Expand the SQL Statement portion and add a SELECT * FROM yourtable  to get our desired result set. Click Ok and check the results again. You should see the tablename.fieldname format for column headers. At this point, you can proceed with more data shaping and prep data for other analytics. Click Close and Load when you are done and it will load the data into the Power Pivot designer in Power BI desktop.

I hope you enjoyed this series through HDInsight and Power BI. It was a great learning experience for me.

Using Power BI with HDInsight Part 1: Power Query and Files

With the rise of HDInsight and other Hadoop based tools, it is valuable to understand how Power BI can help you take advantage of those big data investments. If you need to set up a cluster to work with, check out my previous posts on Setting Up an HDInsight Cluster and Loading Data Into Your New HDInsight Cluster. These posts show how to do this with no scripting required. If you prefer to script, there are a number of resources with sample scripts on doing the same work.

In this article, I will focus on using Power Query to get data from the Hadoop file structure in HDInsight. I will be using Excel 2013 with the Power Query Add-In. I will also be using the restaurant data I loaded as noted in the three previous posts. If you need to create a cluster and load data I encourage you to check the following blog posts:

These posts walk through the process of creating a cluster and loading up data.

Connecting to HDInsight

First, open a new Excel workbook and click the Power Query tab. Once there, you can find the Azure HDInsight source in the From Other Sources dropdown. Select that option to open the following dialog:

image

You will need your storage account in order to continue. Then you will need the storage account key. Once you have added the key you will see that the Navigator opened in Excel on the right.

image

It should show the name of your cluster and the default container name. Double click the container name and it will open the Power Query window. It will show all the files available in the container. Even though we have it organized in folders, the view shows all the files. If you have a large amount of files and you don’t want to scroll to find them, you can click the down arrow on the Folder Path column and use the text filter to find the folder you are looking for.

image

Now I have the files I want to use in Power Query. If you click the binary link it will open a copy of the file. However, this is not how we want to work with the data as we have multiple files. (If you did this, remove steps up to the Filtered Rows step in the Applied Steps section.) I now have the files I uploaded showing.

image

In order to work with them all together we need to Combine Binaries.

image

This now merges all the files into a single dataset. You can now do any data shaping you would like using standard Power Query methods such as updating the column names and the data types. You can even add columns such as a Total Amount column. Here is what my final query looks like including the steps I did. Be sure to give your query a meaningful name. (Note that Power Query realized my Transaction Date column was a date dataype and changed it for me.)

image

Now click Close & Load and your data will be loaded into Excel. You can change the query to load a model in Power Pivot as well to do additional work with the data.

The next post will walk through using Power BI Desktop to load data using a Hive query.