Tag Archives: Azure

Cosmos DB for the Data Professional

Cosmos DB LogoCosmos DB is one of the fastest growing Azure services in 2018. As its popularity grows, data professionals are faced with a changing reality in the world of data. Data is no longer contained in relational databases as general rule. We saw the start of this with Hadoop data storage, but no one ever referred to Hadoop as a database. Sure Hive and other Hadoop based technologies made the data look like a database, but we (data professionals) were able to keep our distance. What’s changed?

The Cloud, Data, and Databases

As cloud reaches more and more businesses, traditional data stores are being reconsidered. We now have data stored in Azure – Azure Data Lake, Azure Storage, Azure Database Services (SQL, PostgreSQL, MySQL), Azure Data Warehouse, and now Cosmos DB. Cosmos DB is the globalized version of Azure Document DB (more about that later). If we are to grow our skillset and careers to a cloud data professional, we need to know more about other ways the data is stored and used. I want to summarize some things that we need to be aware of about Cosmos DB. If your business uses it or plans to and you are a data pro, you will need to know this.

Introducing Cosmos DB

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database.

Cosmos DB Overview 201804

Source: https://docs.microsoft.com/en-us/azure/cosmos-db/introduction 

I will break down key components of Cosmos DB with a data professional in mind. There are a lot of aspects of Cosmos DB that make it very cool, but you will want to understand this when you get the call to fix the database.

Multi-model Database Service

Currently Cosmos DB supports four database models. This is like having for different database servers in one. I liken it to having SQL Server Database Engine and SQL Server Analysis Services using the same underlying engine and it only “looks different.” Cosmos DB refers to these as APIs. The API is chosen when the database is created. This optimizes the portal and database for use with that API. Other APIs can be used to query the data, but it is not optimal. Here are the four models supported and the APIs that support them.

Cosmos DB models

  • Key Value Pair: This is exactly as it sounds. The API is implemented with the Azure Table Storage APIs.
  • Wide Column or Column Family: This stores data similar to relational, but there is no row consistency (each row can look different). Cosmos DB uses the Cassandra API to support this model. (For more information on Cassandra click here.)
  • Documents: This model is based on JSON document storage. Cosmos DB currently supports two APIs for this model: SQL which is the Document DB API and Mongo DB. These are the most common models used in Cosmos DB today. Document DB is the “parent” to Cosmos DB which was rebranded.
  • Graph: Graph databases are used to map relationships in data and were made popular with Facebook for instance. Microsoft uses the open source Gremlin API to support the Graph Database Model.

None of these databases are traditional row/column stores. They are all variations of NoSQL databases.

Turnkey Global Distribution

This is a key attribute for Cosmos DB. Cosmos DB can be easily distributed around the world. Click the data center you want to replicate to and Cosmos DB takes care of the rest. Cosmos DB uses a single write node and multiple read nodes. However, because Cosmos DB was built with global distribution in mind, you can easily and safely move the write node as well. This allows you to “chase the sun” and keep write operations happening “locally”.

Data Consistency

Data consistency is a primary concern of any data professional. The following tables compare Cosmos DB Consistency Levels with SQL Server Isolation Levels. These are not a one for one match, but demonstrate the different concerns between the systems.

 

Cosmos DB

SQL Server

Consistency Level Guarantees Isolation Level Dirty Read Non- repeatable Read Phantom
Strong Reads are guaranteed to return the most recent version of an item. Serializable No No No
Bounded Staleness Consistent Prefix or read order. Reads lag behind writes by prefixes (K versions) or time (t) interval. Snapshot No No No
Session Consistent Prefix. Monotonic reads, monotonic writes, read-your-writes, write-follows-reads. Repeatable Read No No Yes
Consistent Prefix Updates returned are some prefix of all the updates, with no gaps. Reads are not read out of order. Read Committed No Yes Yes
Eventual Out of order reads. Read Uncommitted Yes Yes Yes

As you can see, there are some similarities. These options are important to understand. In the Cosmos DB, the more consistent you need the data, the higher the latency in the distributed data. As a result, most Cosmos DB solutions usually start with Session Consistency as this gives a good, consistent user experience while reducing latency in the read replicas.

Throughput

I am not going to dig into this much. But you need to understand that Request Units (RU) are used to guarantee throughput in Cosmos DB. As a baseline, Microsoft recommends thinking that a 1 KB JSON file will require 1 RU. The capacity is reserved for each second. You will pay for what you reserve, not what you use. If you exceed capacity in a second your request will be throttled. RUs are provisioned by region and can vary by region as a result. But they are not shared between regions. This will require you to understand usage patterns in each region you have a replica.

Scaling and Partitions

Within Cosmos DB, partitions are used to distribute your data for optimal read and write operations. It is recommended to create a granular key with highly distinct values. The partitions are managed for you. Cosmos DB will split or merge partitions to keep the data properly distributed. Keep in mind your key needs to support distributed writes and distributed reads.

Indexing

By default, everything is indexed. It is possible to use index policies to influence the index operations. Index policies are modified for storage, write performance, and read or query performance. You need to understand your data very well to make these adjustments. You can include or exclude documents or paths, configure the index type, and configure the index update mode.  You do not have the same level of flexibility in indexes found in traditional relational database solutions.

Security

Cosmos DB is an Azure data storage solution which means that the data at rest is encrypted by default and data is encrypted in transit. If you need RBAC, Azure Active Directory (AAD) is supported in Cosmos DB.

SLAs

I think that the SLAs Microsoft provides with Cosmos DB are a key differentiator for them. Here is the short summary of guarantees Microsoft provides:

  • Latency: 99.99% of P99 Latency Attainment (based on hours over the guarantee)
    • Reads under 10 ms
    • Writes under 15 ms
  • Availability
    • All up – 99.99% by month
    • Read – 99.999% by month
  • Throughput – 99.99% based on reserved RUs (number of failures to meet reserved amount)
  • Consistency – 99.99% based on setting

These are financially backed SLAs from Microsoft. Imagine you providing these SLAs for your databases. This is very impressive.

Wrap Up

For more information, check out Microsoft’s online documentation on Cosmos DB.

I presented this material at the April 2018 PASS MN User Group Meeting. The presentation can be found here.

Advertisements

Power BI Is Finally in the Azure Trust Center

With the most recent announcement of Power BI’s inclusion in the Azure Trust Center, it is a good time to review where we are today with Power BI security and compliance as it relates to various customer needs. I do a lot of work with financial, energy, and medical customers. These groups represent a large amount of compliance and regulation needs. I wanted to understand where we are today and this announcement is significant.

What’s in the Announcement?

One the primary roadblocks to accepting the Power BI service has been the lack of compliance and concerns around security. Microsoft has been making a number of enterprise level improvement to the Power BI service and desktop. Power BI now has the following compliance certifications:

PowerBI Compliance 2016

This announcement shows Microsoft’s continued commitment to security and compliance in its cloud based products. While Power BI is not yet to the level of Office 365, some key compliance areas are now covered.

I think the most significant compliance certification is HIPAA/HITECH which removes barriers related for the medical industry. As hospitals, insurance companies, and providers scramble to meet reporting demands from their customers and the government, Power BI gives them a flexible reporting and visualization platform to meet those needs. It will empower self-service in the organizations and departmental or enterprise collaboration with data. The HIPAA/HITECH certification will allow them to use the platform with more confidence and security.

Beyond medical, more institutions will be able to rely on Power BI in a manner that is compliant and safe. As Microsoft continues this journey with Power BI and its other Azure based offerings, customers will be able to react more quickly to the changing business and regulatory environments with confidence in the security and management of their data.

The Reality – You Are as Secure as You Choose to Be

Even with this significant move by Microsoft, you are still responsible for implementing a secure, compliant solution. Microsoft is merely providing tools that are secure and will comply with regulations if implemented correctly. The key to a secure environment will always be you. The data you use and analyze with Power BI is ultimately your responsibility.

I encourage you to review the following resources in addition to the ones above as you determine your security and compliance within the Power BI product:

 

VS Live 2016 – Las Vegas Follow Up

LVSPK18

I spoke at Visual Studio Live in Vegas on two topics. While the presentations have been uploaded to the site and were available for attendees, the code was not distributed yet as an oversight on my part. In this post, I will do a quick summary of the sessions and make sample code available. I will be writing more on these topics throughout the year and will tag VS Live in the notices.

JSON - VSLive

JSON & SQL Server Finally Together

JSON is now part of SQL Server 2016. SQL Server now includes functions to generate and shred JSON. Here are the basics:

  • OPENJSON: Used to convert JSON data into a tabular format
  • FOR JSON: Used to create JSON from tabular data
  • ISJSON: Determines if the data in question is JSON
  • JSON_VALUE: Returns scalar values from JSON data
  • JSON_QUERY: Returns JSON formatted arrays or objects
  • JSON_MODIFY: Used to modify JSON data and properties

With all of this support, JSON is not a native data type in SQL Server like XML.

You can download supporting files and code here.

Hive - VS Live

Using Hive and Hive ODBC with HDInsight and Power BI

During this session I went through the process of setting up HDInsight and loading data into the cluster. Once created, Hive tables were created and queries created that were used with Power BI to analyze the results.

You can find the details here.

Minnesota BI User Group – Powering Up HDInsight with Power BI (December 2015)

On Wednesday, December 16, I presented on this topic at the Minnesota BI User Group.  This session is based on five blog posts that I created in August 2015.

You can find the presentation here: Powering Up HDInsight with Power BI (pdf).

The details can be found in the blog posts noted below:

HDInsight-Series-Featured-Pic_thumb

Setting Up and HDInsight Cluster (No Scripts Required)

Exploring the Microsoft Azure HDInsight Query Console (No Scripting Required)

Uploading Files to an HDInsight Cluster (No Scripting Required)

Using Power BI with HDInsight Part 1: Power Query and Files

Using Power BI with HDInsight Part 2: Power BI Desktop and Hive

My goals for this series

1. Document using Power BI with HDInsight

2. Prove that you can set up a HDInsight Cluster with no scripts

Other References from the Session

Azure: http://azure.microsoft.com/en-us/

Cloud Berry: http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx

 

Thanks for attending my session.

TechFuse Minnesota Follow Up

 TechFuse Fall Edition

My Session: Using Azure SQL Database for Enterprise Needs

On 10/6/2015, I presented on Azure SQL Database at TechFuse Minnesota. Some of the highlights from the session included a comparison of the various SQL Server offerings and how Azure SQL Database fits into the overall picture. During the session we also discussed the importance of V12 and elastic databases (still in preview) for the enterprise environment. I was also raised and discussed the cloudy concerns including data safety, data compliance, data privacy and business continuity. With the ongoing changes Microsoft makes in this space, expect more changes to come. However, even today you can be successful using Azure SQL Database for your enterprise needs.

The presentation can be found in PDF format here.

Microsoft has a wealth of documentation which can be found here: https://azure.microsoft.com/en-us/documentation/services/sql-database/.

Questions from the Session

How are upgrades handled in Azure SQL Database?

Most upgrades to Azure SQL Database are seamless to us. However, they have made changes over the years which have required some intervention. One such change was the move from Web and Business Service Tiers to Basic, Standard, and Premium. In this case Microsoft provided a lot of guidance around the process and it was very easy to do in most cases. The most recent instance was the move to V12. Once again Microsoft provided the upgrade path in documentation. Both upgrades could be handled in the portal. However, V12 was a significant change so groups needed to do more testing prior to an upgrade.

Microsoft’s Azure upgrades for Azure SQL Database over the past couple of years: https://azure.microsoft.com/en-us/updates/?service=sql-database.

Can we use Point in Time Restore to recover a table or to recover to a new database?

Point in Time Restore allows us to recover databases to specific points in time. Each service tier has different amounts they retain: Basic-7 days, Standard-14 days; Premium-35 days. Azure SQL Database always recovers the database to a new database. This means you can either “swap” the database once it is recovered or move the items you need to recover from the new database. One word of caution, you need to have the DTU capability to run both databases on the server to support the restore capability. Be sure to plan the recovery process and clean up when you have completed your recovery.

Thanks again to everyone who could attend this year.

Until the next session …