Do You Need a Relational Data Warehouse?

Are you looking to do a major update to your data warehouse or looking to modernize? Many technologies have come about that are changing the landscape of what data warehouses are made of. In this Azure Every Day session, I’d like to talk about 3 new technologies in Azure and HDInsight that break the rules.

1. HDInsight (with Spark in particular)

You don’t have to use SSIS to get ETL into your big data storage. HDInsight with Spark can give us ETL, as well as bring to light a lot of machine learning and other technologies, so we no longer have a dependency on SSIS. Earlier this year, Microsoft released Azure Databricks which improves of what is in HDInsight with expanded Spark capabilities.

2. Azure Data Lake

This is a great place to store your data, and no, it won’t cost you a lot of money and it’s not hard to work with. Azure Data Lake gives you the ability to store all your data, regardless of where it comes from or how it looks, in that space – including real time data.

Have you ever considered putting real time data into your data warehouse? Streaming data into your data warehouse would break everything we consider conventional. So, don’t put it in a data warehouse, move it into a Hadoop structure in Azure Data Lake, and avoid the issue in a structured data warehouse of: I got it in there, how do I get it out?

Check out Azure Data Week coming in October 2018

3. Interactive Hive and Spark SQL (also part of HDInsight)

The everyday report writers and users are not going to learn how to do map reduce or all those other technologies that make us the cool big data/data science people. These users just need to build a report.

These technologies bring us to a place where we can write SQL against those data structures and not care where the data came from, how it got there, or the type of data format it’s in. We apply the schema after the fact and it means the same data that we put in our data lake, can be used in multiple scenarios.

Advertisements

Azure Hybrid and Security

Data Security. These words are in the forefront of many people’s minds and just hearing them may bring on an instant tension headache. Whether you’re in Europe dealing with the GDPR (General Data Protection Regulation) that went into effect in May, or here in the US, security and compliance is a hot topic today.

The GDPR handles privacy much differently than before and, yes, if you’re in the US but have business or clients in the UK, you need to be ready as well. With these added regulations, many people are talking about the cloud or hybrid scenario.

Many of us are more comfortable with our data being handled locally, but that does not mean it’s more secure. Location is only one aspect of compliance. Today, I’d like to tell you 4 ways to take advantage of the cloud to help you become more secure.

1. Every time you provision a new service or add more business value, you need to take care of security.

Cloud providers take care of this for you. They have a vested interest in making sure the security of their platform is in place for you. As these platforms add new value-added resources, you can continue to grow your business with security taken care of.

2. Cloud facilities are physically more secure.

Cloud providers, with their multiple layers of security, processes and constant auditing and regulations, provide more security than your own data center–period.

3. Patching.

Cloud providers have very specific requirements to patch on a regular basis and to maintain their security, as well as be compliant with the numerous regulations in the industry, therefore, taking this responsibility off your team.

4. Cloud providers hire the best in the business.

They invest money to have the best and brightest specialists to ensure your environment is secure. They want to be sure you have everything you need to be secure and compliant.

If you are serious about security in your environment and your data, you should be considering the cloud. They are likely doing it better than you can and they are constantly improving.

Azure Data Warehouse and 60 Servers

In today’s segment of Azure Every Day, I’d like to talk about Azure Data Warehouse and to help you understand some of the underlying architecture and its impact on the workloads you’re planning to put into Azure. Azure Data Warehouse is a parallel data warehousing solution in the cloud, useful for large data workloads.

When you turn on Azure Data Warehouse for a workload, it will instantiate 60 SQL Server databases underneath, which means it’s looking at how to parallelize your workload across 60 databases. This is the standard Platform as a Service (PaaS) implementation. Unlike an appliance or APS, you don’t have control over the number of servers that you run. It is designed to handle this without you having to do anything to manage that workload.

So, when starting to use Azure Data Warehouse, there are 3 aspects that you should understand as to how it will affect how you plan and decide if the workload makes sense in Azure DW.

1. Tables Use Clustered Columnstore Indexes by Default

The first thing to realize is that all the tables going into Azure DW are, by default, cluster columnstore indexes. These indexes are great for compression and performance, but typically need a million rows in them before they become compressed. So, for example, if you think about a distributed workload across 60 databases, you need 60 million rows before you can see an effective compression solution or take advantage of the compression utilities in a cluster columnstore.

2. Tables Are Distributed, Use Partitions Sparingly

Be aware when you’re partitioning a table, that the table is already distributed. Hence, if you have a scenario where a table already has 100 partitions, that creates 6000 partitions across the Azure instance. To break this down further, you’re going to need about 6 billion rows before you can take advantage of the columnstore indexes, including compression. Therefore, if you’re not running a large system, partitioning may not give you the same benefit as on a SQL Server or Azure SQL database.

3. Parallelized Data Loads

Lastly, realize how the data is loaded into the data warehouse. The Azure Data Warehouse, with its 60 databases, makes it a great platform for loading data because you can parallelize the data load across those 60 databases. One tool that helps you is CTAS, CREATE TABLE AS,  SQL statements that were designed for use within Azure DW and can distribute workloads across partitions and load in parallel, resulting in a very fast load.

Used in conjunction with PolyBase, you can pull data from storage, like Azure Blob Storage or Azure Data Lake, and pull that data through and use CTAS to write that data out. A final thought on this scenario is if you use Azure Data Factory and its capabilities to further use that parallelism. With this, you can load a staging table or make sure it sends 60 files at a time so every server is being used.

External Access to Azure Storage

There are multiple ways to allow external access to Azure storage accounts, some better (and more secure) than others. Today, I’d like to share with you 3 methods to access your storage accounts externally, as well as the preferred methods for doing so.

1.  Access Keys

This is one way to allow access, but I don’t highly recommend using it. Why? There are only 2 access keys that exist for your storage account. If you use one, it will be a problem later, since when you have to change it, you’ll have to change everything referring to that storage account. These keys should be used for applications or special use cases that you can manage accordingly.

If you share these access keys outside of the organization, this could create problem, as you don’t want to have to go in and change them if you feel that the key has been violated in some way.

2.  Share Access Signatures (SAS)

This is a much better option than sharing the keys. This is designed to limit access to your storage account and the containers they are involved in. They allow you to establish security at a more granular level than access keys. With this method, you can use one storage account and create multiple signatures and allow for specific security access.

This works great for when a company is sending you data and you want to store it somewhere and then give them secure access to your Azure environment. There are expiration properties, so you can allow access for a designated amount of time or if things change, it’s easy to kill the key and stop access.

3.  Share Access Policies

This is your best option for supporting access for external entities to get the data into the storage account and is the easiest to manage. Shared access policies leverage shared access signatures and must be created with PowerShell. They are easily managed without creating a new SAS every time.

How to Get Developers Using Azure

You know the talk out there, everyone is moving to the cloud and people are looking at Azure to get there. But many have concerns with going to the cloud or the unknown things around Azure. That’s why we created this blog series. Today, I want to tell you why I think it’s critical to help your development teams leverage Azure, which will help you as an organization.

First, there are 3 things Azure brings to you as a business and for your teams:

1.  You can learn new things quickly.

You can go into Azure, turn on a few features and work with them to try them out. On the contrary, to learn to work and interact with Hadoop infrastructure, for example, you’d have to take the time to set up a bunch of clusters. If your team is looking to take advantage of something new like containers or NoSQL, they can try these out in Azure without investing in infrastructure.

2.  It’s good prep for when you want to go to the cloud.

Bottom line – it’s different in the cloud. Whether you plan to move to the cloud soon or sometime in the future, you’re going to need to understand how subscriptions and components work and gain knowledge in dealing with interactions and how to work in the cloud.  Your team having a better understanding is key to your ability to successfully and effectively make the move.

3.  Breeding creativity within your IT organization and development teams.

Creativity can lead to new business opportunities and improvements to your business. Giving your team the opportunity to jump in and work with some tools, can give them the ability to come up with some new ideas.

Check out Azure Data Week coming in October 2018.

Managing the $

The risk, as always, is cost. No one wants to spend money. But here are 3 things you can do to help manage your risk and cost:

  • Dev Test environments are available inside of Azure. This allows you to automatically set things to shut down. This great feature can help you manage costs.
  •  If your company uses MSDN or developer licenses, you can get a free amount of spend in Azure (from $25 – $200 per month).
  • Free trials. It’s easy to spin up multiple free trials, so you can test out something and reduce the risk of spend. This is a great opportunity to do a POC for things that have high cost risk, like Azure Data Warehouse or HD Insight clusters. Just be sure to use them effectively as they do expire after a certain point.

Check out the current free offering and services from Microsoft here: https://azure.microsoft.com/en-us/free/.

Thoughts on data, business analytics, and the SQL Server community

%d bloggers like this: