Category Archives: Azure Data Warehouse

Azure Data Relational Services

Today I’d like to talk about the Azure Relational Data Services Platform. This is an important foundational component for many things that are being built on Azure Platform as a Service related to databases.

One of the key PaaS offerings when Microsoft started with Azure was Azure SQL Database. Moving forward, changes were made to this and Azure SQL DW was released. Recently, Microsoft released a preview of the Azure SQL Database Managed Instance option. This is significant as it is a v-core plus storage option and intended to have parity with the on premises version of SQL Server, plus is a key step to separating compute and storage for Azure SQL Databases as well.

See you at Azure Data Week in a few days!

This is important since it allows Microsoft to standardize their relational database support pattern for other databases as well. This has existed for Azure DW for some time and was also improved in Gen 2. Check out more about this in some previous posts in this series.

Azure’s Relational Database platform supports Azure DW’s MPP platform, Azure SQL Database or SQL Server as PaaS, Azure Database for MySQL and PostgreSQL. So, open source databases are supported on the same relational data services platform. Azure Database for MariaDB is coming by the end of 2018.

You may be thinking, why is all this important and what does a common platform include?

  • First, Azure storage services as a foundation for all databases and all the data on the Azure platform. All data stored here, as well as Azure Databases, whether open source or SQL, are encrypted at rest.
  • Manages high availability of a solution by keeping free copies of data available for the platform at all times. So high availability built in and encryption at rest—secure and available.
  • Azure compute is the VMs supporting the compute needs of the databases. This is where you pick the cores that you want to provide scale up function. However, you’re not managing VMs, you’re managing capacity. Microsoft has taken on the task of understanding what you need from a capacity standpoint, like how do you want to scale up or down or how many v-cores do you want to set aside.
  • A key component of many things in Azure is that we can scale compute separate from storage. The database services platform sits on top of Azure storage and compute, so its strength is that the core of the solution lives in those 2 platforms. It allows support of MPP, open source and SQL databases with PaaS.
  • Databases services is where the next tier happens (or all the cool stuff). On top of the foundation, Microsoft adds a set of common components that are used across all these databases.
  • It’s a trusted platform with things like backup and restore, security, audit and isolation all managed in this service. This allows you to trust the platform and build databases with confidence in the security.
  • It’s flexible, enabling scalability and resource management within the platform. This includes features like scaling up or down on demand and adding storage as you need, giving flexibility to the platform. This is hard to do if you build this for yourself or use an IaaS solution.
  • It’s intelligent. We see big benefits in the fact that it provides monitoring, automated tuning and advisors to the platform. These are built in to make your databases better, so you can rely on good performance and know what is happening in your database when you need to.
  • Think of the third tier (after storage/compute and database services) as each unique database platform and the features each brings to your application. Whether you’re using an open source product that’s using MySQL or a SQL Server, their feature sets come forward in PaaS.

Another advantage to mention is by supporting standard SQL and managed instances, and MySQL and PostgreSQL Community Edition, it makes moving to the cloud so much easier. This open opportunities for you to migrate in clean fashion using all the capabilities of a system you’re familiar with.

Advertisements

Do You Need a Relational Data Warehouse?

Are you looking to do a major update to your data warehouse or looking to modernize? Many technologies have come about that are changing the landscape of what data warehouses are made of. In this Azure Every Day session, I’d like to talk about 3 new technologies in Azure and HDInsight that break the rules.

1. HDInsight (with Spark in particular)

You don’t have to use SSIS to get ETL into your big data storage. HDInsight with Spark can give us ETL, as well as bring to light a lot of machine learning and other technologies, so we no longer have a dependency on SSIS. Earlier this year, Microsoft released Azure Databricks which improves of what is in HDInsight with expanded Spark capabilities.

2. Azure Data Lake

This is a great place to store your data, and no, it won’t cost you a lot of money and it’s not hard to work with. Azure Data Lake gives you the ability to store all your data, regardless of where it comes from or how it looks, in that space – including real time data.

Have you ever considered putting real time data into your data warehouse? Streaming data into your data warehouse would break everything we consider conventional. So, don’t put it in a data warehouse, move it into a Hadoop structure in Azure Data Lake, and avoid the issue in a structured data warehouse of: I got it in there, how do I get it out?

Check out Azure Data Week coming in October 2018

3. Interactive Hive and Spark SQL (also part of HDInsight)

The everyday report writers and users are not going to learn how to do map reduce or all those other technologies that make us the cool big data/data science people. These users just need to build a report.

These technologies bring us to a place where we can write SQL against those data structures and not care where the data came from, how it got there, or the type of data format it’s in. We apply the schema after the fact and it means the same data that we put in our data lake, can be used in multiple scenarios.

Azure Data Warehouse and 60 Servers

In today’s segment of Azure Every Day, I’d like to talk about Azure Data Warehouse and to help you understand some of the underlying architecture and its impact on the workloads you’re planning to put into Azure. Azure Data Warehouse is a parallel data warehousing solution in the cloud, useful for large data workloads.

When you turn on Azure Data Warehouse for a workload, it will instantiate 60 SQL Server databases underneath, which means it’s looking at how to parallelize your workload across 60 databases. This is the standard Platform as a Service (PaaS) implementation. Unlike an appliance or APS, you don’t have control over the number of servers that you run. It is designed to handle this without you having to do anything to manage that workload.

So, when starting to use Azure Data Warehouse, there are 3 aspects that you should understand as to how it will affect how you plan and decide if the workload makes sense in Azure DW.

1. Tables Use Clustered Columnstore Indexes by Default

The first thing to realize is that all the tables going into Azure DW are, by default, cluster columnstore indexes. These indexes are great for compression and performance, but typically need a million rows in them before they become compressed. So, for example, if you think about a distributed workload across 60 databases, you need 60 million rows before you can see an effective compression solution or take advantage of the compression utilities in a cluster columnstore.

2. Tables Are Distributed, Use Partitions Sparingly

Be aware when you’re partitioning a table, that the table is already distributed. Hence, if you have a scenario where a table already has 100 partitions, that creates 6000 partitions across the Azure instance. To break this down further, you’re going to need about 6 billion rows before you can take advantage of the columnstore indexes, including compression. Therefore, if you’re not running a large system, partitioning may not give you the same benefit as on a SQL Server or Azure SQL database.

3. Parallelized Data Loads

Lastly, realize how the data is loaded into the data warehouse. The Azure Data Warehouse, with its 60 databases, makes it a great platform for loading data because you can parallelize the data load across those 60 databases. One tool that helps you is CTAS, CREATE TABLE AS,  SQL statements that were designed for use within Azure DW and can distribute workloads across partitions and load in parallel, resulting in a very fast load.

Used in conjunction with PolyBase, you can pull data from storage, like Azure Blob Storage or Azure Data Lake, and pull that data through and use CTAS to write that data out. A final thought on this scenario is if you use Azure Data Factory and its capabilities to further use that parallelism. With this, you can load a staging table or make sure it sends 60 files at a time so every server is being used.