Uploading Files to an HDInsight Cluster (No Scripting Required)

As I noted in my first post, I am not a fan of scripting. In that post we set up a cluster without using scripts to do so. Now we are going to look at how to upload files without scripts. While this will work for our demo and learning purposes, I would encourage you to use scripting to handle production level loads or even if you want to upload a lot of files. While I am not a fan, it does not mean the scripting may not be a better overall tool. However, when I am trying to learn the functionality or work with system using other tools (in this case Power BI), I find that methods such as these help me be more productive sooner.

Prepping to Load Data Into Your New HDInsight Cluster

A key difference between standard Hadoop and HDInsight is file management. With HDInsight, you can load files into Azure Storage and they can be consumed by the HDInsight cluster. Keeping with the No Scripting Required mantra, we will be using a graphical interface to load files into Azure storage. There are a number of options out there, you need one of them installed. For our example, we will be using the freeware version of CloudBerry Explorer for Azure Blob Storage. Once you have your tool of choice installed you are ready to get some files.

At this point, you need some files to load. I am using some data I created for another demo. My data is in 7 files of daily receipts for my restaurant for a week in March. Once you have the data, we can load that into the cluster.

Loading Data Into Your New HDInsight Cluster

A noted above, the next steps for use will use CloudBerry Explorer to load our data. In this case, I just copied the folder with my files over to the Azure Storage once I connected the tool to Azure.


Once that is done, we will look at working with the data in Hadoop and with Hive.

Creating an External Hive Table and Querying It

You can create two types of tables using Hive – internal and external. An internal table loads the data into a Hive database. An external table applies a schema to the data without moving it. I will be creating an external table. I like this concept because it applies schema to the files that have been uploaded and allows other tools to interact with that data using HiveQL. When you drop an external table, the data remains because the table represents structure only.

In order to help everyone through this (in particular me), the next sections walk through the steps I took to create my table and select data from it. (This is not a detailed look at Hive, but rather a focus on the process of making HDInsight data available using HiveQL.)

Understanding the Files

The first step was to document the structure of the data in the files. Here is the data that I had in each of the files in column order:

  • Ticket Number – int
  • Ticket Date – date
  • Hour of the Day – int
  • Seat Number – int
  • App Amount – int
  • Entrée Amount – int
  • Non Alcoholic Amount – int
  • Alcoholic Amount – int

My structure was fairly simplistic. Each file represented a day.

Creating the Table

Now that I had the structure, I needed to work out the table DDL. (Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable) Here is the syntax that I came up with.

DROP TABLE RestaurantSales;
CREATE EXTERNAL TABLE RestaurantSales (ticketnumber int, ticketdate string, hourofday int, seat int, appamount int, entreeamount int, nonalcoholamount int, alcoholamount int)
STORED AS TEXTFILE LOCATION ‘wasb:///restaurant-data-files’;

The first statement drops the existing table if it exists. Unlike SQL Server, no error is thrown if the table does not already exist, so there is no need to check for existence.

The second statement creates the table. One quick note on the data types. In my case, my date was not recognized as a date in my first attempt. I changed it to string. When I ran my query it returned NULLs in that column. As noted earlier, this is an external table which means it is applying schema. The fields are terminated by a comma (‘,’). The next part is significant because it is part of HDInsight syntax. The location is prefixed by wasb which tells HDInsight we are using Azure Blob Storage for the files. The three forward slashes mean we are using the default container for files. When I uploaded the data, I added a folder called restaurant-data-files which holds the files for HDInsight. It will now apply the schema to all the files in that folder. If new files get uploaded, they would be part of the query as well.

You can run this in the Hive Editor in the HDInsight Console discussed in the previous blog post. You can track your jobs there and see when it completes.

Querying the Table

The final step is checking to see if it worked. I opened a new Hive Editor window and executed the following statement:

select * from RestaurantSales;

Voila! Once the job completed, I was able to click the session link and review the results.

Hopefully you were equally successful creating your first Hive table and query with your data. The next two posts will talk about using Power BI to interact with HDInsight data.