Tag Archives: Hadoop

Building My HDInsight Server Cluster

After all the hype about Big Data, Hadoop, and now HDInsight, I decided to build out my own big data cluster on HDInsight. My overall goal is to have a cluster I can use with Excel and Data Explorer.  After all, I needed more data in my mashups. I am not going to get into the details or definitions of Big Data, there are entire books on the subject.  I will discuss any issues or tidbits during the process while I am here.

Setting Up the Environment

I am actually doing this on a VM on my Windows 8 laptop.  I created a Windows 2012 VM with 1 GB of RAM and 50 GB of storage.  (Need some help creating a VM in Windows 8, check out my post on the subject.

Installing the HDInsight Server

First, this product is still in Preview at the time of this writing, so mileage will vary and likely change over the next few months.  You will find the installer at http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW.  This uses the Microsoft Web Platform Installer.  When prompted I just ran the installer.  This took about one hour to complete on my VM setup. Once it completed, it opened up the dashboard view in IE.


At this point we have installed a cluster called “local (hdfs)”.

Exploring My Local Cluster

Well, things did not go well at first.  Whenever I clicked the big gray box to view my dashboard, I received the following error: “Your cluster ‘local (hdfs)’ is not responding.  Please click here to navigate to cluster.”  I clicked “here” and ended up on a IIS start page.  Not really effective.  Let the troubleshooting begin.

Based on this forum issue response, I opened the services window to find that none of my Apache Hadoop services were running after a restart AND they were set to manual.  To resolve this I took two steps.  First, I changed all of my services to run automatically.  This makes sense for my situation because the VM would be running when I wanted to use HDInsight.  Second, I used the command line option to restart all of the services as also noted in the forum post above.

From a command prompt execute the following code to restart all Hadoop services:


And, VOILA!, my cluster is now running.


Maybe we can get a better error message next time.

At this point I walked through the Getting Started option on the home screen and proceeded to do “Hello World”.  I used these samples as intended to get data in my cluster and start working with the various tools.  Stay tuned for more posts in the future on my Big Data adventures.

Why Not HDInsight Service on Azure?

The primary reason I did not use the HDInsight Service on Azure was that I did not want to risk the related charges.  Once I have a good understanding of how HDInsight Server works, I will be more comfortable working with HDInsight Service.

Other Resources

Here are some of the resources I used throughout the build.

HDInsight Service Quick Start and Tutorials

Getting Started With Microsoft HDInsight

Why I am excited about SQL Server 2012 (Part 2)

Earlier this month I published a blog entry on this same sumagenic-custom-soltionsbject.  In honor of the local Minneapolis launch event, I decided to expand the list.  You can find five more reasons I am excited out on Magenic’s blog.

Here is the link and enjoy SQL Server 2012.


SQL PASS Summit–Day 4 (BI Sessions)

Here is the summary of the session I attended while at my 4th day at the Summit. (Technically, this is the second day of the Summit.)  I went to four sessions all of which were in the BI tracts.

Top 10 Things You Didn’t Know about SSIS by Matt Masson, Microsoft (@mattmasson)

Be sure to check out the Tech Ed Video with CTP 3 enhancements.

Two New Concepts in SQL Server 2012 Integration Services:  Projects and Parameters.  SSIS projects now compile down to a single ispac file (“ice packs”).  This will allow you to deploy packages in a project.  Parameters have been added to better support configurations and they are exposed by the package execution UIs.

The SSIS Catalog.  A SQL Server User database called “SSISDB” is where the catalog is stored.  This provides a central place for configuration including new parameters and connection managers.  Security is also much improved as encryption is now managed at the server level.  Logs and reporting will also be recorded in the catalog.

Top 10 List:

1. Change Data Capture and 2. ODBC Support through Attunity partnership.

CDC for Oracle databases will result in better integration similar to SQL Server CDC.  CDC components in SSIS will work with  both SQL Server and Oracle.  Three new tasks have been added to SSIS – CDC Control Task , CDC Source (reads changes) , CDC Splitter (handles insert/update/delete).

ODBC components have also been added to support many of the non-Microsoft data sources including bulk load capabilities with the ODBC destination object.

3. Connection Manager

We now have connection managers which are shared at the project level.  Connection managers get marked as offline if the data source is not available or you want to bring it offline.  The capability to share in-memory cache across packages at the project level has been added.

4. Flat File Source Improvements

Varying number of columns and embedded qualifiers are now supported.

5. Package Format Changes

Updated the XML format using attributes.  Layout information can now be ignored.  Better support for merging.  Lineage IDs have been replaced with refid to support better merge. Still not recommeded, but better supported.

6. Visual Studio Designer Changes

Parameters are in a Visual Studio design tab which allow you to set up defaults for various environments.

7. Script Components

Uses compiled IL in .NET for better performance.  Script component debugging is in 2012 with the ability debug script components and see the values of the data in the pipe.  All script components support debugging.

8. Troubleshooting and Logging

Because all packages use the SSIS catalog, all error and performance logging including task execution timing and row counts are recorded in the SSISDB.  Much of this information will be available in  built in reports.

9. Data Taps

Allows data in a pipeline to be dumped to a flat file.  This allows for troubleshooting packages with errors in production.  You can create data taps in SSMS using T-SQL.

10.  Server Management using PowerShell

The whole SSIS catalog can be managed or accessed through PowerShell or TSQL.  This will allow you to automate or script out deployment and frequently used tasks.

You can do a lot of these tasks right in SSMS as well using menu and shortcut functionality.  You can query the SSISDB for performance counters while the package is executing.  You can query SSISDB for package execution progress at the task and component levels.

There is built in reporting.  You start from the Integration Services Dashboard with a bunch of stats on the front page.  It includes a number of built in reports with performance data and other relevant topics.  Jamie Thompson has a CodePlex report pack (http://sqlblog.com/blogs/jamie_thomson/archive/2011/09/04/ssis-reporting-pack-v0-2-now-available.aspx).  Check out post from Jamie on the SSIS blog about tables as well(http://sqlblog.com/blogs/jamie_thomson/archive/2011/07/16/ssis-logging-in-denali.aspx).

They bubble up errors related to connections.  There are drill down reports available to the error messages in the tool. Logs have a configurable retention value. Versions of projects are also maintained and the number of versions is configurable.

11.  Project Barcelona Metadata Discovery

This is a project that is coming soon.  Demos available online.  This is not a metadata repository.  This is zero touch.  Targets get crawled and added to an Index Server.  Check out the project blog at http://blogs.msdn.com/b/project_barcelona_team_blog/.

Tier-1 BI in the World of Big Data – Denny Lee and Thomas Keiser, SQLCAT

Two models to work with Dimensional (UDM) and Tabular (formerly  BISIM).  UDM is the current large scale engine. Tabular puts all of the data in the memory, so not very suitable large cubes.

New themes for workload scale:

  • Cheap storage at scale
  • Massive query scale (both size and concurrent)
  • Scaling ETL another order of magnitude
  • Scaled and integrated reporting/BI

NoSQL ecosystem (e.g. Facebook, Yahoo, Twitter, Google, Amazon data offerings) are being used frequently.  Not only are people using these systems, they are building solutions on top of them.  This is why Microsoft has announced the HADOOP integration.

MapReduce systems (e.g. HADOOOP)  and RDBMS systems are actually complimentary, not competing in most cases.  HADOOP is about dynamic schemas.  Getting the data somewhere is the important piece.  Hadoop / NoSQL: Move compute to the Data. BASE basically available, Soft State Eventually Consistent. Each row may have differnent schema.

Tradintional RDBMS move data to compute.  They answer the question quickly and are ACID compliant.

The balance between scale and performance is getting more important.  What do you want? Guaranteed Response, but get it slow.

Hive Connector is the first Step in integraion with the BI Platform.  Hive is a data warehouse framework for Hadoop.  This is the starting point, not the end game for adding structure to Big Data for consumption by other tools such as SSAS.  HiveODBC Driver will allow the cube to be loaded directly from Hadoop.

Yahoo – “Fast” Oracle Load – 24 TB cube. This is the largest known cube in the world.  Yahoo moves data from Hadoop into Oracle and processes the cube from there.

Check out the new SSAS 2008 R2 Performance Guide from SQLCAT at http://sqlcat.com/sqlcat/b/whitepapers/archive/2011/10/10/analysis-services-2008-r2-performance-guide.aspx.

What’s New in StreamInsight?  Isaac Kunen, Microsoft

Stream Data Processing – unbounded streams of data with queries lasting forever.  This engine handles the data coming through.  Temporal Semantics can be added to the StreamInsight data to make it more understandable.

Event processing varies greatly based on business function – latency versus aggregated data rate.  StreamInsight is designed to handle high volumes of low latency data. Common use cases – Fraud Detection, Shipping, Sales, Clickstream Analysis, Security, Building Management.  Common scenario is pattern analysis.

StreamInsight Releases:

  • StreamInsight 1.2 is available today on premise.
  • StreamInsight on Azure currently private CTP, should open up later this year.

Temporal Semantics

  • 3 event types: intervals (begin end), points (“one tick”), edge (only the beginning with an end to be defined later) events.
  • CTI = Current Time Increment.  Up to this point in time, all events are processed.  From one CTI to the next CTI is the period reported.  They can have to increment into the future, but do not need to be equal time periods.  After the CTI is issued, the data is gone from memory.
  • You write declarative queries against StreamInsight in LINQ.  It is the actual query language for SI, not just a wrapper.
  • Select and Project Operators return events from a stream.
  • Join Operator returns events that overlap streams.
  • Snapshot Window is any time the event stream changes, start a new window.  This window is completely data driven.

Resiliency through checkpointing has been added to StreamInsight.  This allows you to restart from specific points.  Performance Counters and Admin Logs have also been added to StreamInsight 1.2.  Check out his blog for more information: http://blogs.msdn.com/b/isaac/archive/2011/07/25/introducing-checkpointing-in-streaminsight-1-2.aspx.

Building the Perfect BI Semantic Model for Power View – Carl Rabeler (SQLCAT), Kasper de Jonge – Microsoft SSAS

You can import a PowerPivot workbook into SQL Server Data Tools as a Tabular SSAS project.  This will create a Tabular Model in SSAS.

Changes in SQL Server Data Tools with Tabular models change the deployed Tabular SSAS model.  You should use a small set of data when developing the model because the data needs to be loaded in memory. You can add data mining models to the tabular model by creating a connection to a SSAS Muultidimensional data source.  When you add a data source to the tabular model in design, it will add the data to the tabular model (the workspace database).  Columns can now be added into the model in any position.

There is an “effective user name” property in the user string to handle the doublehop issues without Kerberos.  Set execution context to this account sets this property in the Reporting Services Data Servivces (RSDS) connection in SharePoint.  This functionality is currently available in SQL Server 2008 R2. In SQL Server 2012, this can be seen in the profiler trace in the discoverer events.

To add images to the Tabular model, you can use an embedded image, stored image (e.g. in the database), or link.  If you use a link, be sure to set the ImageURL property to “true” for this column so Power View knows that it is an image.

The tabular model has a wealth of metadata options that will be carried forward to Power View.  For instance, if you set the Default Label and Default Image in the tabular model table, Power View will display these in many visualizations including Card View.

Be aware that in Power View, the refresh on the tool refreshes data wheras an IE refresh will reload the metadata.

Time intelligence functions in DAX will allow you to create functions as Year To Date.  However, if your tabular model uses surrogate keys to map to the Date table you need to specify the table as a Date table and specify the date column to use.  This is needed to make sure the Time Intelligence functions work correctly.

The RANKX(Table, Expression, [Value], [Order], [Time]) function was illustrated to show how to rank items within a table. Ranking:=RANKX(all(Products), Sum(Margin)).  Watch out for situations where you are using a complex key.

DirectQuery functionality will allow you to directly query SQL Server relational data.  There is a list of limitations to using DirectQuery.  Model with it carefully.

SQL PASS Summit–Final Day (Keynote)

Got to David DeWitt’s Keynote a bit late so this will not cover as much as normal.  (For the record, I missed a lot in 5 minutes.)  Here are some of the notes starting at the NOSQL discussion I was able to get.

NoSQL does not mean NO to SQL.
NoSQL should mean Not Only SQL

Two major NOSQL systems:
1. Key/Value stores – NOSQL OLTP
2. Hadoop – NOSQL data warehousing

Two Universes are the New Reality:
1. Relational DB Systems
2. NOSQL Systems

Relational DB Systems are no longer the only game in town. The world has changed.  However, this not a paradigm shift, SQL Server and other relational databases systems will NOT go away.  (I call this job security.)  These systems have different purposes, so the question is when to use them not which one is the only one I should use.

The next part of the discussion is about Hadoop.  It all started with Google.  Hadoop = HDFS (file system) + MapReduce (prorgamming framework).  It already has a huge Ecosystem.  Here is some discussion on those components:

  • HDFS – underpins the entire Hadoop ecosystem.  Scalable to 1000s of nodes, assumes failures are common.  Replication factors are used to handle multiple failures by distributing the same block of data to 3+ nodes.  No use of mirroring or RAID in order to reduce cost and complexity.  On the negative side, you have know idea where your data is making it hard to get good performance.
  • MapReduce – This is the programming framework to support analyzing data sets in HDFS.  Essentially it take’s a large problem and divides it into subproblems (map), next perform the same function on all subproblems (map), finally combining the output (Reduce). The JobTracker tracks the tasks in MapReduce, the NameNode tracks the data in the blocks.  Core value of MapReduce:  Divide and Conquer.  It is highly distributed making it fault tolerant.  One of the cons, is a lack of schema which makes sharing data and optimizing difficult.
  • HiveQL and Pig – Facebook produced a SQL-like language called HiveQL and Yahoo produced a more procedural language called Pig. Both were developed to hide the complexity of building MapReduce functions.  HiveQL reduced 4 pages of MapReduce code to about 10 lines.  Awesome demo.  HiveQL takes the best features of SQL and combines them with MapReduce.
  • Sqoop – Command line load utility from Microsoft for Hadoop to RDBMS data loads.  Data has to be moved from structured to unstructured because the unstructured data has not been organized or cleansed.  Limited due to the fact that each map query requires a table scan on the relational system which performs poorly.

In summary, we will live in two worlds – Parallel DB systems and Hadoop.  Relational databases and Hadoop are complementary systems not competing systems.

I was unable to do this keynote justice.  The amount of information that he covered was immense and worthwhile.  Check out this slide deck at http://pages.cs.wisc.edu/~dewitt/includes/passtalks/passtalks.html. I plan to watch this keynote a few more times to understand more of what was said.  Hopefully this taste of Hadoop and NoSQL was helpful.