Tag Archives: Big Data

Setting Up an HDInsight Cluster (No Scripts Required)

Let me start by saying, I am not a fan of scripting. It definitely has its place and a lot of my peers really like it. It is the easiest way to get functionality out from software vendors such as Microsoft. PowerShell is an incredibly powerful tool which can do just about anything. However, therein lies the problem for me. Scripting solves a lot of problems, however, I just wanted to set up and use a basic HDInsight cluster to create some Power BI demos (posts coming soon). So I started the journey to find the scripts and try to understand the syntax and so on. Then I went to the Azure Portal, here is what I did to set up my cluster and load data with no scripting required. My goal was to go to get a working demo platform up. Would I necessarily recommend this path for production work, not sure yet. But now I can work with HDInsight with considerably less work required to set up the environment.

HDInsight Cluster No Script Setup Requirements

You need an Azure account. You can go to http://azure.microsoft.com to sign up for a free account if you like. If you have an MSDN subscription you should have some time available as well.

HDInsight Cluster No Script Setup

Once you have your account created, you should go to http://portal.azure.com. We will be doing our setup from here. During the process we will be creating a storage account (if this is your first run in azure, you may choose to set up a Resource Group as well) and the HDInsight cluster. Be aware that the cluster has compute costs and the storage has storage costs. At the end we will remove the cluster to save your compute time.

Create the Storage Account

This step can be done during the HDInsight cluster creation, but this limits your ability to share data across clusters. If you are just trying it for fun, you can do this during the cluster set up.

Click the + symbol on the portal, then Data + Storage, then Storage Account. This will open a blade with the set up instructions for a storage account.

image

When you create your account you will have some options to fill in:

  • Name: this name will need to be a unique name, e.g., joescoolhdinsight
  • Pricing tier: The pricing tier is really important if you are using a limited plan or if you plan to keep the data for a long time. If you are planning to use this as a demo, I would select Locally Redundant as that is the lower cost plan.
  • Resource Group: The resource group lets organize your Azure assets. This is for your benefit, so if you want to keep all of the HDInsight components together, you could create a group for that or stick with the default.
  • Subscription: This lets you choose the subscription you want to use.
  • Location: Be sure to select a location close to you that supports HDInsight. Check http://azure.microsoft.com/en-us/regions/ to see what Azure services are supported in each region.
  • Diagnostics: This is optional. If you are looking into the diagnostics or need to prep for production, you will find this useful. In most cases, we would not turn this on for demos.

Click Create and it will create your storage account. This may take a few minutes. The notifications section on the portal will alert you when this has been completed. Once that is complete, we will continue with setting up the cluster.

Create a SQL Database for a Metastore

This is an optional section. If you would like to use Hive or Oozie and want to create a metastore so you can reuse your work when you recreate the cluster you need to create a SQL Database for this.

image

Here are the settings needed to create the database:

  • Name: Something easy for you to track on, e.g. HDInsightMetastore.
  • Server: You can use an existing server if you have it, otherwise you can create a new server. I recommend you create the server in the same location you plan to create your HDInsight cluster.
  • Pricing Tier: The default is S0. If you plan to use this for demos and don’t need the additional features, you can choose Basic.
  • Optional Configuration: no changes.
  • Resource Group: Use the Resource Group you have in place for this example.
  • Subscription: Select your subscription.

Click Create to create your database. You will work with this during the setup of your cluster.

Setting Up the HDInsight Cluster

Click the + symbol on the portal, then Data + Analytics, then HDInsight. This will open a blade with the set up instructions for a storage account.

image

Like with the storage account setup, this will open a blade with options for creating the cluster. Let’s walk through the settings on this.

  • Cluster Name: Like the storage account, this name needs to be unique.
  • Cluster Yype: Select Hadoop for this walk through.
  • Cluster Operating System: Select Windows Server for this walk through.
  • Subscription: Choose the same subscription as your storage account.
  • Resource Group: Choose the same Resource Group as your storage account.
  • Cluster Credentials: Here you select a login name and password for your cluster. You can also choose to enable Remote Desktop, but we are not using that feature for this setup. (Note: be sure to click Select at the bottom when you are done. If you don’t, you will be prompted by IE about unsaved settings.)
  • Data Source: Here is where you select your storage account. If you chose not to create a storage account, you can create a new account here as well.
  • Node Pricing Tiers: This section determines the capability and the associated computing costs of your cluster. By default, 4 worker nodes and 2 head nodes will be created with recommended servers (D12 at the time of this writing). Expand the pricing tier, to change the server type or node count. Unless you are sure you need to change, keep the default settings (you can recreate the cluster later). You will see the current hourly pricing based on your selections. This cost is incurred while the service is running. The only way to stop charges is to delete the cluster, so be sure to do this when you are done if you do not wish to pay for it to keep running.
  • Optional Configuration: You do not need to change any setting here if you choose not to. However, if you plan to delete your cluster and you want to retain the metadata, it is recommended that you set up an External Metastore using the database you created previously.
    • Select the database you want to use for the metadata in each case and update the credentials. You can use the same database for both metastores.

Next, you create the cluster. This will take a few minutes. You can track progress in the notifications section on the portal page.

Exploring Your New HDInsight Cluster

Once the cluster has been created, you will see the information page with the settings and other usage information. At the top of that area, you will see some icons. These will help you explore your cluster some more.

image

The gear will open up a settings page and you can review your settings in detail and change some if needed.

The icon with a square and an arrow will open up a dashboard with more options. We will dig into the dashboard more in the next post.

The last three icons are shortcuts to specific actions – remote desktop, scale cluster, and delete.

Once you are done, you should delete your cluster. You can always go through these steps again to recreate your cluster. In my next article we will go through what you can see and do with your cluster using the dashboard.

Intro to Data Factory–Training on the T’s Follow Up Post

PragmaticWorks-LogoThis is a follow up blog post based on the Intro to Data Factory session I gave on the Training on the T’s with Pragmatic Works. Find more free training from the past and upcoming here. I did my session on January 13, 2015.

 Intro To Data Factory

In this session, I gave a simple introduction to new Azure Data Factory using a CopyActivity pipeline between Azure Blob Storage and Azure SQL Database. Below is a diagram illustrating the factory that is created in the demo.

image

I have published my presentation materials here. This includes the sample JSON files, the Movies.csv, and PowerShell scripts.

Q & A

Here are a few questions that were answered during the session.

1. Does Availability refer to when data that has been transferred will be available? Or when the data source is actually available for query?

Availability refers to when the datasets will make a slice available. This is the when the dataset can be consumed as an input or be targeted as an output. This means you can consume data hourly but choose to push it to its final destination on a different cadence to prevent issues on the receiving end.

2. What pre-requisites are must haves?…e.g.(Azure account, HDInsight, Blob Storage Accounts, etc.)

    • An Azure Account is the only real must have. You could use two on premise SQL Server instances.
    • HDInsight if you want to use the HDInsight activitities
    • An Azure Storage account to use blob or table storage

3. How do you decide to use a Factory or Warehouse?

The factory is more of a data movement tool. A warehouse could be a source or target of a factory pipeline.

4. Is this similar to SSIS in SQL Server?

Yes and no. SSIS is definitely more mature and has more tooling available such as data sources and transformations. SSIS also have a good workflow constructor. The focus of the Data Factory initially was to load HDInsight tables from a variety of sources with more flexibility. The other note here is that Data Factory is being built from the ground up to support the scale of the cloud or Azure.

5. Can this be used for Big Data?

Absolutely. I would say that it is one of the primary reasons for the tool. In reference to the previous question, it will likely be the tool of choice for big data operations because it will be able to scale with Azure.

Links to Additional Resources on Data Factory or tools that were used in the presentation:

Azure Data Factory on Azure’s Website

Azure Data Factory Documentation

Azure Data Factory Pricing

Azure Storage Explorer

Azure PowerShell Documentation

Thanks for joining me for this presentation. We look forward to seeing you at the next Free Training on the T’s.

Building My HDInsight Server Cluster

After all the hype about Big Data, Hadoop, and now HDInsight, I decided to build out my own big data cluster on HDInsight. My overall goal is to have a cluster I can use with Excel and Data Explorer.  After all, I needed more data in my mashups. I am not going to get into the details or definitions of Big Data, there are entire books on the subject.  I will discuss any issues or tidbits during the process while I am here.

Setting Up the Environment

I am actually doing this on a VM on my Windows 8 laptop.  I created a Windows 2012 VM with 1 GB of RAM and 50 GB of storage.  (Need some help creating a VM in Windows 8, check out my post on the subject.

Installing the HDInsight Server

First, this product is still in Preview at the time of this writing, so mileage will vary and likely change over the next few months.  You will find the installer at http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW.  This uses the Microsoft Web Platform Installer.  When prompted I just ran the installer.  This took about one hour to complete on my VM setup. Once it completed, it opened up the dashboard view in IE.

image

At this point we have installed a cluster called “local (hdfs)”.

Exploring My Local Cluster

Well, things did not go well at first.  Whenever I clicked the big gray box to view my dashboard, I received the following error: “Your cluster ‘local (hdfs)’ is not responding.  Please click here to navigate to cluster.”  I clicked “here” and ended up on a IIS start page.  Not really effective.  Let the troubleshooting begin.

Based on this forum issue response, I opened the services window to find that none of my Apache Hadoop services were running after a restart AND they were set to manual.  To resolve this I took two steps.  First, I changed all of my services to run automatically.  This makes sense for my situation because the VM would be running when I wanted to use HDInsight.  Second, I used the command line option to restart all of the services as also noted in the forum post above.

From a command prompt execute the following code to restart all Hadoop services:

c:\hadoop\start-onebox

And, VOILA!, my cluster is now running.

image

Maybe we can get a better error message next time.

At this point I walked through the Getting Started option on the home screen and proceeded to do “Hello World”.  I used these samples as intended to get data in my cluster and start working with the various tools.  Stay tuned for more posts in the future on my Big Data adventures.

Why Not HDInsight Service on Azure?

The primary reason I did not use the HDInsight Service on Azure was that I did not want to risk the related charges.  Once I have a good understanding of how HDInsight Server works, I will be more comfortable working with HDInsight Service.

Other Resources

Here are some of the resources I used throughout the build.

HDInsight Service Quick Start and Tutorials

Getting Started With Microsoft HDInsight

Are You Signed Up for 24 Hours of PASS–Business Analytics?

If you have not signed up for the 24 Hours of PASS-Business Analytics you should be.  This is a great chance to hear 12 speakers (they will be repeated in the following 12 hours).  Topics are varied from Big Data to Strategy to Collaboration.  Most importantly you24 Hours of PASS Business Analytics can’t beat the price to hear speakers like Denny Lee, Peter Meyers, and Stacia Misner to name a few.

I get the privilege of moderating two of the sessions: Session 8:  What Is Big Data? by Mark Whitehorn and Session 10: Visualizing Data with Power View by Sean Boon.

Finally, I heard Marc Reguera talk about how Microsoft Finance uses Power View at a different event.  If you want to see Power View put into practical use by a business user, I highly recommend you check out his session.  I think it is the final piece of the puzzle to join the technology with the business.

I hope you all take the opportunity to join us for this compelling and free event preview to the PASS Business Analytics Conference in Chicago on April 10-12, 2013.

Why I am excited about SQL Server 2012 (Part 2)

Earlier this month I published a blog entry on this same sumagenic-custom-soltionsbject.  In honor of the local Minneapolis launch event, I decided to expand the list.  You can find five more reasons I am excited out on Magenic’s blog.

Here is the link and enjoy SQL Server 2012.

http://magenic.com/Blog/WhyIAmExcitedaboutSQLServer2012Part2.aspx