Building My HDInsight Server Cluster

24 04 2013

After all the hype about Big Data, Hadoop, and now HDInsight, I decided to build out my own big data cluster on HDInsight. My overall goal is to have a cluster I can use with Excel and Data Explorer.  After all, I needed more data in my mashups. I am not going to get into the details or definitions of Big Data, there are entire books on the subject.  I will discuss any issues or tidbits during the process while I am here.

Setting Up the Environment

I am actually doing this on a VM on my Windows 8 laptop.  I created a Windows 2012 VM with 1 GB of RAM and 50 GB of storage.  (Need some help creating a VM in Windows 8, check out my post on the subject.

Installing the HDInsight Server

First, this product is still in Preview at the time of this writing, so mileage will vary and likely change over the next few months.  You will find the installer at http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW.  This uses the Microsoft Web Platform Installer.  When prompted I just ran the installer.  This took about one hour to complete on my VM setup. Once it completed, it opened up the dashboard view in IE.

image

At this point we have installed a cluster called “local (hdfs)”.

Exploring My Local Cluster

Well, things did not go well at first.  Whenever I clicked the big gray box to view my dashboard, I received the following error: “Your cluster ‘local (hdfs)’ is not responding.  Please click here to navigate to cluster.”  I clicked “here” and ended up on a IIS start page.  Not really effective.  Let the troubleshooting begin.

Based on this forum issue response, I opened the services window to find that none of my Apache Hadoop services were running after a restart AND they were set to manual.  To resolve this I took two steps.  First, I changed all of my services to run automatically.  This makes sense for my situation because the VM would be running when I wanted to use HDInsight.  Second, I used the command line option to restart all of the services as also noted in the forum post above.

From a command prompt execute the following code to restart all Hadoop services:

c:\hadoop\start-onebox

And, VOILA!, my cluster is now running.

image

Maybe we can get a better error message next time.

At this point I walked through the Getting Started option on the home screen and proceeded to do “Hello World”.  I used these samples as intended to get data in my cluster and start working with the various tools.  Stay tuned for more posts in the future on my Big Data adventures.

Why Not HDInsight Service on Azure?

The primary reason I did not use the HDInsight Service on Azure was that I did not want to risk the related charges.  Once I have a good understanding of how HDInsight Server works, I will be more comfortable working with HDInsight Service.

Other Resources

Here are some of the resources I used throughout the build.

HDInsight Service Quick Start and Tutorials

Getting Started With Microsoft HDInsight





PASS Summit 2012 Wrap Up

13 11 2012

Wow, what a week. Once again, PASS put on a great event that provided much in the way of events and training for the SQL Server community. If you followed my countdown you know some of what I love about PASS. Last year I blogged everyday but I did not do that this week. So, what was different for me? Well, for one I volunteered much more this year than last and I was privileged to speak twice. I spent more time meeting new people and catching up with friends and that was great as well. Enjoy my wrap up from my week.

Tuesday – Leadership Meetings, Welcome Reception, and some Karaoke

Before the event officially kicked off, I joined community leaders from around the world for a series of leadership meetings. First we had a meeting on SQL Saturdays which was an opportunity to see the immense growth of these free training events around the U.S. and throughout the world. What a great opportunity for SQL Server professionals to improve their skills and for those passionate about the community to improve their abilities by leading these events. Many ideas were shared among the team including a panel on how to effectively run a SQL Saturday on a tight budget.

Once that was completed, the Regional Mentors enjoyed a lunch together and an opportunity to share what we do to support the user groups in our regions. I particularly enjoyed the fact that I was able to spend some time with Regional Mentors from Germany, Holland, and Portugal. This highlighted further the international scope and reach of PASS. This was followed by the Chapter Leaders meeting. That meeting was held as a series of round tables that the chapter leaders could move through. I was working at the table focused on leadership with Ryan Adams ( B | T ) from the North Texas SQL Server User Group – NTSSUG. We had a number of good conversations around building leadership teams for user groups and what is needed to have an effectively led user group. Check out the NTSSUG site for the by-laws sample we discussed multiple times.

All of these meetings were followed up with the Welcome Reception, which I made a small portion of as I was trying to drop my backpack at my hotel and work my way back there. After the reception, I headed out to Bush Gardens with a number of others. During that time, Jes Borland ( T ) managed to get a microphone in my hand and I had my first round of karaoke. Yes, I actually did sing and had fun doing it. All-in-all, it was a good time had by all.

Wednesday – SQL Around the World, Microsoft Announcements, Tabular Models, and Magenic Team Dinner

This was the true kick off to the event. For many, they looked at the key note as the kick off. Before that even began, I was working in the Community Zone encouraging people to participate in the SQL Around the World community activity. It was a great game. You needed to find 10 people from 10 different countries and find out something interesting about them or their country. I found a dancer and someone who had ridden a cheetah as a kid. I also surprised someone from the Czech Republic when she mentioned her home town only to have me let her know I had been to her home town many years ago. It was a fun conversation. If you did this and have other cool stories let me know. It was amazing as well over 50 different countries were represented at PASS.

Next, Ted Kummert had the first keynote session of the day. His keynote was filled with announcements concerning SQL Server including the following:

  • Hekaton: the project code name for a new in-memory OLTP engine
  • The Columnstore Index will be updateable
  • Next version of PDW will be out in H1 2013
  • Polybase: allows you to query across multiple types of data sources such as SQL Server and Hadoop with T-SQL
  • DAX Queries will be able to query SSAS Cubes

He also highlighted some recent announcements related to the SQL Server stack:

  • Microsoft HDInsight Server CTP: Hadoop for Windows Server
  • Windows Azure HDInsight Service Preview: Hadoop for Azure
  • Power View and PowerPivot fully implemented in Excel 2013

After the keynote, I hit a session on BigData and Hive which was put on by SQL CAT and very informative. My big takeaway was to use EXTERNAL tables not INTERNAL tables when working with Hive. I then went to do final prep for my Tabular Model session. In this session, “Building a Tabular Model Database”, I present on what tabular and in-memory is, and then proceed to open up a Visual Studio project and create a database. I think it went well and the attendees seemed to enjoy the upbeat nature for an end of day session. The night wrapped up with dinner with the Magenic team (7 of us). Good chance to grow relationships across offices from around the country.

Thursday – Community Zone and DAX as a Query Language

Thursday was a fairly low key day for me. Once again I spent time in the Community Zone. I had the opportunity to talk with a few people on creating a user group in their area. As always, I like to see people interested in growing their local community.

I also attended Alberto Ferrari’s session on DAX. I think the biggest surprise to me was that you can now query DAX directly from SSMS. I am not sure that I am convinced that it is a full query language yet, but it is definitely closer. The key to it all is the EVALUATE expression which allows you to create the DAX query ironically in the MDX window. Here is just a taste of DAX as a query:

EVALUATE
    ‘DimCurrency’
ORDER BY

    ‘DimCurrency’[CurrencyAlternateKey]

What I found interesting is that you can create columns, build measures, and perform many other operations against the tabular model using DAX. In the end, it will not increase the memory used as storage as it is all calculated. Look for some more on this in later blog posts as I delve more into the in-memory storage and usage when working with DAX.

Friday – More Community Zone, HDInsight, Paul White, and Window Functions

Last day. I spent more time in the Zone. I really did enjoy my time there as I continued to meet more people. I was even present when a contract was completed for the Shanghai user group. Very cool indeed. I then attended a session on HDInsight by Mike Flasco from Microsoft. This is very cool stuff as you can create simple Hadoop cluster on your desktop to test the technology. Microsoft and Hortonworks have done a great job of bringing Hadoop data into the Microsoft stack.

On my way to present my final session of the day and the conference, I stopped in for the second half of Allen White’s ( B ) optimization presentation. In a word (or two), mind-blowing! Wow, who knew that the optimizer did all those things? I was highly impressed and think he should look at a precon on the subject next year. Unlike some three hour presentations, he could have went longer as he was not stretching his content out. Nice work Paul. So, I got to follow that with a presentation on Window Functions in T-SQL. For the second time, I had the last slot of the last day. I think this presentation went well even though we were all worn out from a content-filled week. It was fun to try some ideas from the audience in the demos. That always makes for a more interesting demo. I will be doing a follow up post on what I learned from some of the attendees on the subject as well, proving once again this is a user community event. We all have something to contribute! (If you attended this session, you will find links to the blogs on the subject here.)

What’s Next?

Coming in April is the new Business Analytics conference in Chicago followed by the PASS Summit in Charlotte, North Carolina. Of course, your local user groups will continue to meet with regional SQL Saturdays sprinkled throughout the year as well. How will you participate and contribute in 2013? We look forward to seeing you all again, soon.





Why I am excited about SQL Server 2012 (Part 2)

28 03 2012

Earlier this month I published a blog entry on this same sumagenic-custom-soltionsbject.  In honor of the local Minneapolis launch event, I decided to expand the list.  You can find five more reasons I am excited out on Magenic’s blog.

Here is the link and enjoy SQL Server 2012.

http://magenic.com/Blog/WhyIAmExcitedaboutSQLServer2012Part2.aspx





SQL PASS Summit–Day 4 (BI Sessions)

16 10 2011

Here is the summary of the session I attended while at my 4th day at the Summit. (Technically, this is the second day of the Summit.)  I went to four sessions all of which were in the BI tracts.

Top 10 Things You Didn’t Know about SSIS by Matt Masson, Microsoft (@mattmasson)

Be sure to check out the Tech Ed Video with CTP 3 enhancements.

Two New Concepts in SQL Server 2012 Integration Services:  Projects and Parameters.  SSIS projects now compile down to a single ispac file (“ice packs”).  This will allow you to deploy packages in a project.  Parameters have been added to better support configurations and they are exposed by the package execution UIs.

The SSIS Catalog.  A SQL Server User database called “SSISDB” is where the catalog is stored.  This provides a central place for configuration including new parameters and connection managers.  Security is also much improved as encryption is now managed at the server level.  Logs and reporting will also be recorded in the catalog.

Top 10 List:

1. Change Data Capture and 2. ODBC Support through Attunity partnership.

CDC for Oracle databases will result in better integration similar to SQL Server CDC.  CDC components in SSIS will work with  both SQL Server and Oracle.  Three new tasks have been added to SSIS – CDC Control Task , CDC Source (reads changes) , CDC Splitter (handles insert/update/delete).

ODBC components have also been added to support many of the non-Microsoft data sources including bulk load capabilities with the ODBC destination object.

3. Connection Manager

We now have connection managers which are shared at the project level.  Connection managers get marked as offline if the data source is not available or you want to bring it offline.  The capability to share in-memory cache across packages at the project level has been added.

4. Flat File Source Improvements

Varying number of columns and embedded qualifiers are now supported.

5. Package Format Changes

Updated the XML format using attributes.  Layout information can now be ignored.  Better support for merging.  Lineage IDs have been replaced with refid to support better merge. Still not recommeded, but better supported.

6. Visual Studio Designer Changes

Parameters are in a Visual Studio design tab which allow you to set up defaults for various environments.

7. Script Components

Uses compiled IL in .NET for better performance.  Script component debugging is in 2012 with the ability debug script components and see the values of the data in the pipe.  All script components support debugging.

8. Troubleshooting and Logging

Because all packages use the SSIS catalog, all error and performance logging including task execution timing and row counts are recorded in the SSISDB.  Much of this information will be available in  built in reports.

9. Data Taps

Allows data in a pipeline to be dumped to a flat file.  This allows for troubleshooting packages with errors in production.  You can create data taps in SSMS using T-SQL.

10.  Server Management using PowerShell

The whole SSIS catalog can be managed or accessed through PowerShell or TSQL.  This will allow you to automate or script out deployment and frequently used tasks.

You can do a lot of these tasks right in SSMS as well using menu and shortcut functionality.  You can query the SSISDB for performance counters while the package is executing.  You can query SSISDB for package execution progress at the task and component levels.

There is built in reporting.  You start from the Integration Services Dashboard with a bunch of stats on the front page.  It includes a number of built in reports with performance data and other relevant topics.  Jamie Thompson has a CodePlex report pack (http://sqlblog.com/blogs/jamie_thomson/archive/2011/09/04/ssis-reporting-pack-v0-2-now-available.aspx).  Check out post from Jamie on the SSIS blog about tables as well(http://sqlblog.com/blogs/jamie_thomson/archive/2011/07/16/ssis-logging-in-denali.aspx).

They bubble up errors related to connections.  There are drill down reports available to the error messages in the tool. Logs have a configurable retention value. Versions of projects are also maintained and the number of versions is configurable.

11.  Project Barcelona Metadata Discovery

This is a project that is coming soon.  Demos available online.  This is not a metadata repository.  This is zero touch.  Targets get crawled and added to an Index Server.  Check out the project blog at http://blogs.msdn.com/b/project_barcelona_team_blog/.

Tier-1 BI in the World of Big Data – Denny Lee and Thomas Keiser, SQLCAT

Two models to work with Dimensional (UDM) and Tabular (formerly  BISIM).  UDM is the current large scale engine. Tabular puts all of the data in the memory, so not very suitable large cubes.

New themes for workload scale:

  • Cheap storage at scale
  • Massive query scale (both size and concurrent)
  • Scaling ETL another order of magnitude
  • Scaled and integrated reporting/BI

NoSQL ecosystem (e.g. Facebook, Yahoo, Twitter, Google, Amazon data offerings) are being used frequently.  Not only are people using these systems, they are building solutions on top of them.  This is why Microsoft has announced the HADOOP integration.

MapReduce systems (e.g. HADOOOP)  and RDBMS systems are actually complimentary, not competing in most cases.  HADOOP is about dynamic schemas.  Getting the data somewhere is the important piece.  Hadoop / NoSQL: Move compute to the Data. BASE basically available, Soft State Eventually Consistent. Each row may have differnent schema.

Tradintional RDBMS move data to compute.  They answer the question quickly and are ACID compliant.

The balance between scale and performance is getting more important.  What do you want? Guaranteed Response, but get it slow.

Hive Connector is the first Step in integraion with the BI Platform.  Hive is a data warehouse framework for Hadoop.  This is the starting point, not the end game for adding structure to Big Data for consumption by other tools such as SSAS.  HiveODBC Driver will allow the cube to be loaded directly from Hadoop.

Yahoo – “Fast” Oracle Load – 24 TB cube. This is the largest known cube in the world.  Yahoo moves data from Hadoop into Oracle and processes the cube from there.

Check out the new SSAS 2008 R2 Performance Guide from SQLCAT at http://sqlcat.com/sqlcat/b/whitepapers/archive/2011/10/10/analysis-services-2008-r2-performance-guide.aspx.

What’s New in StreamInsight?  Isaac Kunen, Microsoft

Stream Data Processing – unbounded streams of data with queries lasting forever.  This engine handles the data coming through.  Temporal Semantics can be added to the StreamInsight data to make it more understandable.

Event processing varies greatly based on business function – latency versus aggregated data rate.  StreamInsight is designed to handle high volumes of low latency data. Common use cases – Fraud Detection, Shipping, Sales, Clickstream Analysis, Security, Building Management.  Common scenario is pattern analysis.

StreamInsight Releases:

  • StreamInsight 1.2 is available today on premise.
  • StreamInsight on Azure currently private CTP, should open up later this year.

Temporal Semantics

  • 3 event types: intervals (begin end), points (“one tick”), edge (only the beginning with an end to be defined later) events.
  • CTI = Current Time Increment.  Up to this point in time, all events are processed.  From one CTI to the next CTI is the period reported.  They can have to increment into the future, but do not need to be equal time periods.  After the CTI is issued, the data is gone from memory.
  • You write declarative queries against StreamInsight in LINQ.  It is the actual query language for SI, not just a wrapper.
  • Select and Project Operators return events from a stream.
  • Join Operator returns events that overlap streams.
  • Snapshot Window is any time the event stream changes, start a new window.  This window is completely data driven.

Resiliency through checkpointing has been added to StreamInsight.  This allows you to restart from specific points.  Performance Counters and Admin Logs have also been added to StreamInsight 1.2.  Check out his blog for more information: http://blogs.msdn.com/b/isaac/archive/2011/07/25/introducing-checkpointing-in-streaminsight-1-2.aspx.

Building the Perfect BI Semantic Model for Power View – Carl Rabeler (SQLCAT), Kasper de Jonge – Microsoft SSAS

You can import a PowerPivot workbook into SQL Server Data Tools as a Tabular SSAS project.  This will create a Tabular Model in SSAS.

Changes in SQL Server Data Tools with Tabular models change the deployed Tabular SSAS model.  You should use a small set of data when developing the model because the data needs to be loaded in memory. You can add data mining models to the tabular model by creating a connection to a SSAS Muultidimensional data source.  When you add a data source to the tabular model in design, it will add the data to the tabular model (the workspace database).  Columns can now be added into the model in any position.

There is an “effective user name” property in the user string to handle the doublehop issues without Kerberos.  Set execution context to this account sets this property in the Reporting Services Data Servivces (RSDS) connection in SharePoint.  This functionality is currently available in SQL Server 2008 R2. In SQL Server 2012, this can be seen in the profiler trace in the discoverer events.

To add images to the Tabular model, you can use an embedded image, stored image (e.g. in the database), or link.  If you use a link, be sure to set the ImageURL property to “true” for this column so Power View knows that it is an image.

The tabular model has a wealth of metadata options that will be carried forward to Power View.  For instance, if you set the Default Label and Default Image in the tabular model table, Power View will display these in many visualizations including Card View.

Be aware that in Power View, the refresh on the tool refreshes data wheras an IE refresh will reload the metadata.

Time intelligence functions in DAX will allow you to create functions as Year To Date.  However, if your tabular model uses surrogate keys to map to the Date table you need to specify the table as a Date table and specify the date column to use.  This is needed to make sure the Time Intelligence functions work correctly.

The RANKX(Table, Expression, [Value], [Order], [Time]) function was illustrated to show how to rank items within a table. Ranking:=RANKX(all(Products), Sum(Margin)).  Watch out for situations where you are using a complex key.

DirectQuery functionality will allow you to directly query SQL Server relational data.  There is a list of limitations to using DirectQuery.  Model with it carefully.





SQL PASS Summit–Final Day (Keynote)

14 10 2011

Got to David DeWitt’s Keynote a bit late so this will not cover as much as normal.  (For the record, I missed a lot in 5 minutes.)  Here are some of the notes starting at the NOSQL discussion I was able to get.

NoSQL does not mean NO to SQL.
NoSQL should mean Not Only SQL

Two major NOSQL systems:
1. Key/Value stores – NOSQL OLTP
2. Hadoop – NOSQL data warehousing

Two Universes are the New Reality:
1. Relational DB Systems
2. NOSQL Systems

Relational DB Systems are no longer the only game in town. The world has changed.  However, this not a paradigm shift, SQL Server and other relational databases systems will NOT go away.  (I call this job security.)  These systems have different purposes, so the question is when to use them not which one is the only one I should use.

The next part of the discussion is about Hadoop.  It all started with Google.  Hadoop = HDFS (file system) + MapReduce (prorgamming framework).  It already has a huge Ecosystem.  Here is some discussion on those components:

  • HDFS – underpins the entire Hadoop ecosystem.  Scalable to 1000s of nodes, assumes failures are common.  Replication factors are used to handle multiple failures by distributing the same block of data to 3+ nodes.  No use of mirroring or RAID in order to reduce cost and complexity.  On the negative side, you have know idea where your data is making it hard to get good performance.
  • MapReduce – This is the programming framework to support analyzing data sets in HDFS.  Essentially it take’s a large problem and divides it into subproblems (map), next perform the same function on all subproblems (map), finally combining the output (Reduce). The JobTracker tracks the tasks in MapReduce, the NameNode tracks the data in the blocks.  Core value of MapReduce:  Divide and Conquer.  It is highly distributed making it fault tolerant.  One of the cons, is a lack of schema which makes sharing data and optimizing difficult.
  • HiveQL and Pig – Facebook produced a SQL-like language called HiveQL and Yahoo produced a more procedural language called Pig. Both were developed to hide the complexity of building MapReduce functions.  HiveQL reduced 4 pages of MapReduce code to about 10 lines.  Awesome demo.  HiveQL takes the best features of SQL and combines them with MapReduce.
  • Sqoop – Command line load utility from Microsoft for Hadoop to RDBMS data loads.  Data has to be moved from structured to unstructured because the unstructured data has not been organized or cleansed.  Limited due to the fact that each map query requires a table scan on the relational system which performs poorly.

In summary, we will live in two worlds – Parallel DB systems and Hadoop.  Relational databases and Hadoop are complementary systems not competing systems.

I was unable to do this keynote justice.  The amount of information that he covered was immense and worthwhile.  Check out this slide deck at http://pages.cs.wisc.edu/~dewitt/includes/passtalks/passtalks.html. I plan to watch this keynote a few more times to understand more of what was said.  Hopefully this taste of Hadoop and NoSQL was helpful.








Follow

Get every new post delivered to your Inbox.

Join 811 other followers

%d bloggers like this: