Category Archives: Parallel Data Warehouse

Azure Data Warehouse and 60 Servers

In today’s segment of Azure Every Day, I’d like to talk about Azure Data Warehouse and to help you understand some of the underlying architecture and its impact on the workloads you’re planning to put into Azure. Azure Data Warehouse is a parallel data warehousing solution in the cloud, useful for large data workloads.

When you turn on Azure Data Warehouse for a workload, it will instantiate 60 SQL Server databases underneath, which means it’s looking at how to parallelize your workload across 60 databases. This is the standard Platform as a Service (PaaS) implementation. Unlike an appliance or APS, you don’t have control over the number of servers that you run. It is designed to handle this without you having to do anything to manage that workload.

So, when starting to use Azure Data Warehouse, there are 3 aspects that you should understand as to how it will affect how you plan and decide if the workload makes sense in Azure DW.

1. Tables Use Clustered Columnstore Indexes by Default

The first thing to realize is that all the tables going into Azure DW are, by default, cluster columnstore indexes. These indexes are great for compression and performance, but typically need a million rows in them before they become compressed. So, for example, if you think about a distributed workload across 60 databases, you need 60 million rows before you can see an effective compression solution or take advantage of the compression utilities in a cluster columnstore.

2. Tables Are Distributed, Use Partitions Sparingly

Be aware when you’re partitioning a table, that the table is already distributed. Hence, if you have a scenario where a table already has 100 partitions, that creates 6000 partitions across the Azure instance. To break this down further, you’re going to need about 6 billion rows before you can take advantage of the columnstore indexes, including compression. Therefore, if you’re not running a large system, partitioning may not give you the same benefit as on a SQL Server or Azure SQL database.

3. Parallelized Data Loads

Lastly, realize how the data is loaded into the data warehouse. The Azure Data Warehouse, with its 60 databases, makes it a great platform for loading data because you can parallelize the data load across those 60 databases. One tool that helps you is CTAS, CREATE TABLE AS,  SQL statements that were designed for use within Azure DW and can distribute workloads across partitions and load in parallel, resulting in a very fast load.

Used in conjunction with PolyBase, you can pull data from storage, like Azure Blob Storage or Azure Data Lake, and pull that data through and use CTAS to write that data out. A final thought on this scenario is if you use Azure Data Factory and its capabilities to further use that parallelism. With this, you can load a staging table or make sure it sends 60 files at a time so every server is being used.


PASS Summit 2012 Wrap Up

Wow, what a week. Once again, PASS put on a great event that provided much in the way of events and training for the SQL Server community. If you followed my countdown you know some of what I love about PASS. Last year I blogged everyday but I did not do that this week. So, what was different for me? Well, for one I volunteered much more this year than last and I was privileged to speak twice. I spent more time meeting new people and catching up with friends and that was great as well. Enjoy my wrap up from my week.

Tuesday – Leadership Meetings, Welcome Reception, and some Karaoke

Before the event officially kicked off, I joined community leaders from around the world for a series of leadership meetings. First we had a meeting on SQL Saturdays which was an opportunity to see the immense growth of these free training events around the U.S. and throughout the world. What a great opportunity for SQL Server professionals to improve their skills and for those passionate about the community to improve their abilities by leading these events. Many ideas were shared among the team including a panel on how to effectively run a SQL Saturday on a tight budget.

Once that was completed, the Regional Mentors enjoyed a lunch together and an opportunity to share what we do to support the user groups in our regions. I particularly enjoyed the fact that I was able to spend some time with Regional Mentors from Germany, Holland, and Portugal. This highlighted further the international scope and reach of PASS. This was followed by the Chapter Leaders meeting. That meeting was held as a series of round tables that the chapter leaders could move through. I was working at the table focused on leadership with Ryan Adams ( B | T ) from the North Texas SQL Server User Group – NTSSUG. We had a number of good conversations around building leadership teams for user groups and what is needed to have an effectively led user group. Check out the NTSSUG site for the by-laws sample we discussed multiple times.

All of these meetings were followed up with the Welcome Reception, which I made a small portion of as I was trying to drop my backpack at my hotel and work my way back there. After the reception, I headed out to Bush Gardens with a number of others. During that time, Jes Borland ( T ) managed to get a microphone in my hand and I had my first round of karaoke. Yes, I actually did sing and had fun doing it. All-in-all, it was a good time had by all.

Wednesday – SQL Around the World, Microsoft Announcements, Tabular Models, and Magenic Team Dinner

This was the true kick off to the event. For many, they looked at the key note as the kick off. Before that even began, I was working in the Community Zone encouraging people to participate in the SQL Around the World community activity. It was a great game. You needed to find 10 people from 10 different countries and find out something interesting about them or their country. I found a dancer and someone who had ridden a cheetah as a kid. I also surprised someone from the Czech Republic when she mentioned her home town only to have me let her know I had been to her home town many years ago. It was a fun conversation. If you did this and have other cool stories let me know. It was amazing as well over 50 different countries were represented at PASS.

Next, Ted Kummert had the first keynote session of the day. His keynote was filled with announcements concerning SQL Server including the following:

  • Hekaton: the project code name for a new in-memory OLTP engine
  • The Columnstore Index will be updateable
  • Next version of PDW will be out in H1 2013
  • Polybase: allows you to query across multiple types of data sources such as SQL Server and Hadoop with T-SQL
  • DAX Queries will be able to query SSAS Cubes

He also highlighted some recent announcements related to the SQL Server stack:

  • Microsoft HDInsight Server CTP: Hadoop for Windows Server
  • Windows Azure HDInsight Service Preview: Hadoop for Azure
  • Power View and PowerPivot fully implemented in Excel 2013

After the keynote, I hit a session on BigData and Hive which was put on by SQL CAT and very informative. My big takeaway was to use EXTERNAL tables not INTERNAL tables when working with Hive. I then went to do final prep for my Tabular Model session. In this session, “Building a Tabular Model Database”, I present on what tabular and in-memory is, and then proceed to open up a Visual Studio project and create a database. I think it went well and the attendees seemed to enjoy the upbeat nature for an end of day session. The night wrapped up with dinner with the Magenic team (7 of us). Good chance to grow relationships across offices from around the country.

Thursday – Community Zone and DAX as a Query Language

Thursday was a fairly low key day for me. Once again I spent time in the Community Zone. I had the opportunity to talk with a few people on creating a user group in their area. As always, I like to see people interested in growing their local community.

I also attended Alberto Ferrari’s session on DAX. I think the biggest surprise to me was that you can now query DAX directly from SSMS. I am not sure that I am convinced that it is a full query language yet, but it is definitely closer. The key to it all is the EVALUATE expression which allows you to create the DAX query ironically in the MDX window. Here is just a taste of DAX as a query:



What I found interesting is that you can create columns, build measures, and perform many other operations against the tabular model using DAX. In the end, it will not increase the memory used as storage as it is all calculated. Look for some more on this in later blog posts as I delve more into the in-memory storage and usage when working with DAX.

Friday – More Community Zone, HDInsight, Paul White, and Window Functions

Last day. I spent more time in the Zone. I really did enjoy my time there as I continued to meet more people. I was even present when a contract was completed for the Shanghai user group. Very cool indeed. I then attended a session on HDInsight by Mike Flasco from Microsoft. This is very cool stuff as you can create simple Hadoop cluster on your desktop to test the technology. Microsoft and Hortonworks have done a great job of bringing Hadoop data into the Microsoft stack.

On my way to present my final session of the day and the conference, I stopped in for the second half of Allen White’s ( B ) optimization presentation. In a word (or two), mind-blowing! Wow, who knew that the optimizer did all those things? I was highly impressed and think he should look at a precon on the subject next year. Unlike some three hour presentations, he could have went longer as he was not stretching his content out. Nice work Paul. So, I got to follow that with a presentation on Window Functions in T-SQL. For the second time, I had the last slot of the last day. I think this presentation went well even though we were all worn out from a content-filled week. It was fun to try some ideas from the audience in the demos. That always makes for a more interesting demo. I will be doing a follow up post on what I learned from some of the attendees on the subject as well, proving once again this is a user community event. We all have something to contribute! (If you attended this session, you will find links to the blogs on the subject here.)

What’s Next?

Coming in April is the new Business Analytics conference in Chicago followed by the PASS Summit in Charlotte, North Carolina. Of course, your local user groups will continue to meet with regional SQL Saturdays sprinkled throughout the year as well. How will you participate and contribute in 2013? We look forward to seeing you all again, soon.

Microsoft Data Appliances Help Simplify MSBI Projects

As some of you know, I am really excited about the data appliances Microsoft and HP have released this year.  I really believe that they make it even easier to get MSBI projects up and running while minimizing the complexity of building out servers. Read more about my thoughts on this in an article I wrote for Magenic:  Microsoft Data Appliances Lower the Entry Bar for MSBI Adoption.

SQL PASS Summit Final Day (Sessions)

Here is a summary of the final day of sessions I attended at PASS.

Vertipaq vs OLAP – Change Your Data Modeling Approach by Marco Russo

This session discussed the role of each data model type in your Business Intelligence platform.

The Many-to-Many Revolution 2.0 is being released which will cover BISM multidimensional and BISM Tabular.  It is available now from

The relational model for the multidimensional cube is a star schema.  It provides the best performance and functionality.

OLAP (Multidimensional) Vertipaq (Tabular)
Dimensional modeling Relational modeling
   Facts, dimensions    Tables
   Complex Relationships    Basic Relationships (1:N)
   Script    Calculated columns
   Powerful, Complex    Measures

OLAP and MDX is much more complex than Vertipaq and DAX.  That makes the barrier to entry or learning curve with DAX and Vertipaq much lower.

The dimensional model is designed to be faster and to store aggregations in the relational world.  For solutions with Vertipaq, is the datamart still required?  Can Vertipaq replace the dimensionally modeled datamart?

Marco went through a number of scenarios which show the difference in the performance between DAX, MDX, and SQL for aggregating and analyzing data.  At this point, it is too early to say that OLAP is old and will die.  We have two tools with different purposes and areas of strengths.  For instance, Vertipaq can handle distinct count operation much easier and faster than OLAP.  Marco keeps noting that Vertipaq and DAX are “fast enough” which seems to mean that expensive operations are expensive in both models Vertipaq and OLAP.

Key take away: Adapt the technology to the data model, not the data model to the technology.

PDW Architecture and Internals – Brian Walker

Only place you can put your software on PDW is on the Landing Zone.

Flow of data in PDW:
Source – Landing Zone Files – Data Loader – Compute Nodes

You can get up to 654 TB of backup space on the Backup Node.  Yes, that is 654 Terabytes!

The bread and butter of PDW is the Compute Nodes and Storage Nodes. Each compute node is running a SQL Server 2008 EE instance. SQL is the primary interface to the compute nodes.  TempDB lives on the Compute Node as well because it serves instance specific functions.

In the storage node, SQL Server file groups are created by the appliance to support a compute node.  This includes filegroups for distributed data, replicated data, and logs.

The Admin console is available through the web browser with tabs for Alerts, Sessions, Queries, Loads, Backup/Restores, Appliance State, Locks and Performance Monitor.

There are now add ins for System Center management with PDW.

PDW supports a hub and spoke architecture.  You can connect SQL Server and Analysis Services to PDW so you can offload data to different consumers.  You can treat the data warehouse on PDW as the single source of truth that can support analytics and data marts and related functionality in a spoke.

Integration with PDW:

  • SQL Server – Remote table copy from MPP to SMP
  • PDW ADO.NET drivers for SSAS, SSIS, .NET, SSRS

Data layout approaches that support Shared Nothing querying which is the ability for the appliance to request discreet query subsets from an originating query.

  • Replicated – full copy on each discrete PDW node (good for dimensions)
  • Distributed – uniformly distributed across all nodes (good for fact tables).  Uses hash keys to distribute the data across the storage nodes.
Data Warehousing: From SMP to MPP using SQL Server Parallel Data Warehouse – Dandy Weyn (Sr. Tech Project Mgr) and Brian Mitchell (Sr Premier Field Engineer)

Simplified Data Warehouse Offerings:

  • Enterprise Edition (software only)
  • Fast Track Data Warehouse (reference architecture) – includes software, configuration guidelines, and hardware;
  • BDW Appliance (integrated architecture)
  • Parallel Data Warehouse (DW appliance)

SQL Server Appliances:

  • HP Enterprise Data Warehouse Appliance (PDW)
  • Dell Parallel Data Warehouse Appliance (PDW)
  • HP Business Data Warehouse Appliance
  • HP Business Business Application Appliance (inc SharePoint)

DWLoader or SQL Server Integration Services can be used to load data into the PDW.  DWLoader will usually be the faster method because it has been optimized for PDW.

This is a Shared Nothing Architecture unlike Oracle RAC which is a shared everything architecture.

You need to choose your distribution key very well.  It will make sure that the table is properly distributed across nodes.

Largest SQL Azure Projects in the World & How Do They Scale – Kevin Cox, Michael Thomassy (SQLCAT)

Top stats:

  • Largest Sharded Database is 20 TB
  • Largest number of databases for 1 app is 500
  • Most users on 1 app is 3 million
  • 2k block Read, 8 node end-point – 34,000 ops/sec

SQL Server is the core DB Engine for SQL Azure.

Case Studies

Flavorous – Ticket Sales Vendor

This is a truly elastic solution.  They only keep databases up for the duration of the ticket sale.  (NOTE: Checkout Chaos Monkey).  All of the applications needed to be built with SQL Azure capabilities in mind.  They were able to handle 3 million tickets in 6 minutes including 150,000 tickets in 10 seconds.

Bing Games

Bing Game tracking and scoring system.  398 SQL Azure databases supporting 30 million unique users per month, 200,000 concurrent users with 99.975% uptime.

Why SQL Azure?

  • Faster than Azure Table Storage – for entities with large number of attributes
  • Very low learning curve – prototyped and written in less than 4 days with 1 developer
  • Testability – mostly compatible with SQL Server and easy to prepopulate with millions of records.

Sepia Labs –

They are now in the process of closing their on-premise data center and they are moving everything to the cloud.  This is a private social network for mobile phones.  (Side note from them: “Windows 7 is by far the easiest mobile platform to build on.”)  They picked Azure table storage and not SQL Azure except for some analytical data.

Why Table Storage?

  • Less expensive than SQL Azure
  • Speed approximately the same for small entities (very few columns; i.e. < 10)
  • High size limit – 100 TB total between Table, Blob store, Queue per storage account.

MYOB – Account Software ISV

They service small businesses with accounting solutions.  They chose the cloud because of the types of offerings they have.  This allowed them to provide massive scalability and high availability.  They scale by a sharding technique built by the developers.  They are considering Federations in SQL Azure for the future.

Federations in SQL Azure will allow you to do less development to support sharding within SQL Azure.  It is designed to help customers take care of the following:

  • Unlimited Scalability – distribute workload equally to all nodes, scale out which is not scale up.  Scale applications horizontally to many nodes and allow applications to dynamically deliver data.
  • Rich Elasticity – expand and contract without downtime.
  • Simplified Tenancy Model – whether you have multiple tenants per database, one tenant per database, or multiple databases per tenant.

Thus concludes my week at PASS Summit 2011.

SQL PASS Summit–Final Day (Keynote)

Got to David DeWitt’s Keynote a bit late so this will not cover as much as normal.  (For the record, I missed a lot in 5 minutes.)  Here are some of the notes starting at the NOSQL discussion I was able to get.

NoSQL does not mean NO to SQL.
NoSQL should mean Not Only SQL

Two major NOSQL systems:
1. Key/Value stores – NOSQL OLTP
2. Hadoop – NOSQL data warehousing

Two Universes are the New Reality:
1. Relational DB Systems
2. NOSQL Systems

Relational DB Systems are no longer the only game in town. The world has changed.  However, this not a paradigm shift, SQL Server and other relational databases systems will NOT go away.  (I call this job security.)  These systems have different purposes, so the question is when to use them not which one is the only one I should use.

The next part of the discussion is about Hadoop.  It all started with Google.  Hadoop = HDFS (file system) + MapReduce (prorgamming framework).  It already has a huge Ecosystem.  Here is some discussion on those components:

  • HDFS – underpins the entire Hadoop ecosystem.  Scalable to 1000s of nodes, assumes failures are common.  Replication factors are used to handle multiple failures by distributing the same block of data to 3+ nodes.  No use of mirroring or RAID in order to reduce cost and complexity.  On the negative side, you have know idea where your data is making it hard to get good performance.
  • MapReduce – This is the programming framework to support analyzing data sets in HDFS.  Essentially it take’s a large problem and divides it into subproblems (map), next perform the same function on all subproblems (map), finally combining the output (Reduce). The JobTracker tracks the tasks in MapReduce, the NameNode tracks the data in the blocks.  Core value of MapReduce:  Divide and Conquer.  It is highly distributed making it fault tolerant.  One of the cons, is a lack of schema which makes sharing data and optimizing difficult.
  • HiveQL and Pig – Facebook produced a SQL-like language called HiveQL and Yahoo produced a more procedural language called Pig. Both were developed to hide the complexity of building MapReduce functions.  HiveQL reduced 4 pages of MapReduce code to about 10 lines.  Awesome demo.  HiveQL takes the best features of SQL and combines them with MapReduce.
  • Sqoop – Command line load utility from Microsoft for Hadoop to RDBMS data loads.  Data has to be moved from structured to unstructured because the unstructured data has not been organized or cleansed.  Limited due to the fact that each map query requires a table scan on the relational system which performs poorly.

In summary, we will live in two worlds – Parallel DB systems and Hadoop.  Relational databases and Hadoop are complementary systems not competing systems.

I was unable to do this keynote justice.  The amount of information that he covered was immense and worthwhile.  Check out this slide deck at I plan to watch this keynote a few more times to understand more of what was said.  Hopefully this taste of Hadoop and NoSQL was helpful.