Tag Archives: PDW

Microsoft Data Appliances Help Simplify MSBI Projects

As some of you know, I am really excited about the data appliances Microsoft and HP have released this year.  I really believe that they make it even easier to get MSBI projects up and running while minimizing the complexity of building out servers. Read more about my thoughts on this in an article I wrote for Magenic:  Microsoft Data Appliances Lower the Entry Bar for MSBI Adoption.


SQL PASS Summit–Final Day (Keynote)

Got to David DeWitt’s Keynote a bit late so this will not cover as much as normal.  (For the record, I missed a lot in 5 minutes.)  Here are some of the notes starting at the NOSQL discussion I was able to get.

NoSQL does not mean NO to SQL.
NoSQL should mean Not Only SQL

Two major NOSQL systems:
1. Key/Value stores – NOSQL OLTP
2. Hadoop – NOSQL data warehousing

Two Universes are the New Reality:
1. Relational DB Systems
2. NOSQL Systems

Relational DB Systems are no longer the only game in town. The world has changed.  However, this not a paradigm shift, SQL Server and other relational databases systems will NOT go away.  (I call this job security.)  These systems have different purposes, so the question is when to use them not which one is the only one I should use.

The next part of the discussion is about Hadoop.  It all started with Google.  Hadoop = HDFS (file system) + MapReduce (prorgamming framework).  It already has a huge Ecosystem.  Here is some discussion on those components:

  • HDFS – underpins the entire Hadoop ecosystem.  Scalable to 1000s of nodes, assumes failures are common.  Replication factors are used to handle multiple failures by distributing the same block of data to 3+ nodes.  No use of mirroring or RAID in order to reduce cost and complexity.  On the negative side, you have know idea where your data is making it hard to get good performance.
  • MapReduce – This is the programming framework to support analyzing data sets in HDFS.  Essentially it take’s a large problem and divides it into subproblems (map), next perform the same function on all subproblems (map), finally combining the output (Reduce). The JobTracker tracks the tasks in MapReduce, the NameNode tracks the data in the blocks.  Core value of MapReduce:  Divide and Conquer.  It is highly distributed making it fault tolerant.  One of the cons, is a lack of schema which makes sharing data and optimizing difficult.
  • HiveQL and Pig – Facebook produced a SQL-like language called HiveQL and Yahoo produced a more procedural language called Pig. Both were developed to hide the complexity of building MapReduce functions.  HiveQL reduced 4 pages of MapReduce code to about 10 lines.  Awesome demo.  HiveQL takes the best features of SQL and combines them with MapReduce.
  • Sqoop – Command line load utility from Microsoft for Hadoop to RDBMS data loads.  Data has to be moved from structured to unstructured because the unstructured data has not been organized or cleansed.  Limited due to the fact that each map query requires a table scan on the relational system which performs poorly.

In summary, we will live in two worlds – Parallel DB systems and Hadoop.  Relational databases and Hadoop are complementary systems not competing systems.

I was unable to do this keynote justice.  The amount of information that he covered was immense and worthwhile.  Check out this slide deck at http://pages.cs.wisc.edu/~dewitt/includes/passtalks/passtalks.html. I plan to watch this keynote a few more times to understand more of what was said.  Hopefully this taste of Hadoop and NoSQL was helpful.