Category Archives: Windows Azure

Cosmos DB for the Data Professional

Cosmos DB LogoCosmos DB is one of the fastest growing Azure services in 2018. As its popularity grows, data professionals are faced with a changing reality in the world of data. Data is no longer contained in relational databases as general rule. We saw the start of this with Hadoop data storage, but no one ever referred to Hadoop as a database. Sure Hive and other Hadoop based technologies made the data look like a database, but we (data professionals) were able to keep our distance. What’s changed?

The Cloud, Data, and Databases

As cloud reaches more and more businesses, traditional data stores are being reconsidered. We now have data stored in Azure – Azure Data Lake, Azure Storage, Azure Database Services (SQL, PostgreSQL, MySQL), Azure Data Warehouse, and now Cosmos DB. Cosmos DB is the globalized version of Azure Document DB (more about that later). If we are to grow our skillset and careers to a cloud data professional, we need to know more about other ways the data is stored and used. I want to summarize some things that we need to be aware of about Cosmos DB. If your business uses it or plans to and you are a data pro, you will need to know this.

Introducing Cosmos DB

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database.

Cosmos DB Overview 201804

Source: https://docs.microsoft.com/en-us/azure/cosmos-db/introduction 

I will break down key components of Cosmos DB with a data professional in mind. There are a lot of aspects of Cosmos DB that make it very cool, but you will want to understand this when you get the call to fix the database.

Multi-model Database Service

Currently Cosmos DB supports four database models. This is like having for different database servers in one. I liken it to having SQL Server Database Engine and SQL Server Analysis Services using the same underlying engine and it only “looks different.” Cosmos DB refers to these as APIs. The API is chosen when the database is created. This optimizes the portal and database for use with that API. Other APIs can be used to query the data, but it is not optimal. Here are the four models supported and the APIs that support them.

Cosmos DB models

  • Key Value Pair: This is exactly as it sounds. The API is implemented with the Azure Table Storage APIs.
  • Wide Column or Column Family: This stores data similar to relational, but there is no row consistency (each row can look different). Cosmos DB uses the Cassandra API to support this model. (For more information on Cassandra click here.)
  • Documents: This model is based on JSON document storage. Cosmos DB currently supports two APIs for this model: SQL which is the Document DB API and Mongo DB. These are the most common models used in Cosmos DB today. Document DB is the “parent” to Cosmos DB which was rebranded.
  • Graph: Graph databases are used to map relationships in data and were made popular with Facebook for instance. Microsoft uses the open source Gremlin API to support the Graph Database Model.

None of these databases are traditional row/column stores. They are all variations of NoSQL databases.

Turnkey Global Distribution

This is a key attribute for Cosmos DB. Cosmos DB can be easily distributed around the world. Click the data center you want to replicate to and Cosmos DB takes care of the rest. Cosmos DB uses a single write node and multiple read nodes. However, because Cosmos DB was built with global distribution in mind, you can easily and safely move the write node as well. This allows you to “chase the sun” and keep write operations happening “locally”.

Data Consistency

Data consistency is a primary concern of any data professional. The following tables compare Cosmos DB Consistency Levels with SQL Server Isolation Levels. These are not a one for one match, but demonstrate the different concerns between the systems.

 

Cosmos DB

SQL Server

Consistency Level Guarantees Isolation Level Dirty Read Non- repeatable Read Phantom
Strong Reads are guaranteed to return the most recent version of an item. Serializable No No No
Bounded Staleness Consistent Prefix or read order. Reads lag behind writes by prefixes (K versions) or time (t) interval. Snapshot No No No
Session Consistent Prefix. Monotonic reads, monotonic writes, read-your-writes, write-follows-reads. Repeatable Read No No Yes
Consistent Prefix Updates returned are some prefix of all the updates, with no gaps. Reads are not read out of order. Read Committed No Yes Yes
Eventual Out of order reads. Read Uncommitted Yes Yes Yes

As you can see, there are some similarities. These options are important to understand. In the Cosmos DB, the more consistent you need the data, the higher the latency in the distributed data. As a result, most Cosmos DB solutions usually start with Session Consistency as this gives a good, consistent user experience while reducing latency in the read replicas.

Throughput

I am not going to dig into this much. But you need to understand that Request Units (RU) are used to guarantee throughput in Cosmos DB. As a baseline, Microsoft recommends thinking that a 1 KB JSON file will require 1 RU. The capacity is reserved for each second. You will pay for what you reserve, not what you use. If you exceed capacity in a second your request will be throttled. RUs are provisioned by region and can vary by region as a result. But they are not shared between regions. This will require you to understand usage patterns in each region you have a replica.

Scaling and Partitions

Within Cosmos DB, partitions are used to distribute your data for optimal read and write operations. It is recommended to create a granular key with highly distinct values. The partitions are managed for you. Cosmos DB will split or merge partitions to keep the data properly distributed. Keep in mind your key needs to support distributed writes and distributed reads.

Indexing

By default, everything is indexed. It is possible to use index policies to influence the index operations. Index policies are modified for storage, write performance, and read or query performance. You need to understand your data very well to make these adjustments. You can include or exclude documents or paths, configure the index type, and configure the index update mode.  You do not have the same level of flexibility in indexes found in traditional relational database solutions.

Security

Cosmos DB is an Azure data storage solution which means that the data at rest is encrypted by default and data is encrypted in transit. If you need RBAC, Azure Active Directory (AAD) is supported in Cosmos DB.

SLAs

I think that the SLAs Microsoft provides with Cosmos DB are a key differentiator for them. Here is the short summary of guarantees Microsoft provides:

  • Latency: 99.99% of P99 Latency Attainment (based on hours over the guarantee)
    • Reads under 10 ms
    • Writes under 15 ms
  • Availability
    • All up – 99.99% by month
    • Read – 99.999% by month
  • Throughput – 99.99% based on reserved RUs (number of failures to meet reserved amount)
  • Consistency – 99.99% based on setting

These are financially backed SLAs from Microsoft. Imagine you providing these SLAs for your databases. This is very impressive.

Wrap Up

For more information, check out Microsoft’s online documentation on Cosmos DB.

I presented this material at the April 2018 PASS MN User Group Meeting. The presentation can be found here.

Advertisements

Minnesota BI User Group – Powering Up HDInsight with Power BI (December 2015)

On Wednesday, December 16, I presented on this topic at the Minnesota BI User Group.  This session is based on five blog posts that I created in August 2015.

You can find the presentation here: Powering Up HDInsight with Power BI (pdf).

The details can be found in the blog posts noted below:

HDInsight-Series-Featured-Pic_thumb

Setting Up and HDInsight Cluster (No Scripts Required)

Exploring the Microsoft Azure HDInsight Query Console (No Scripting Required)

Uploading Files to an HDInsight Cluster (No Scripting Required)

Using Power BI with HDInsight Part 1: Power Query and Files

Using Power BI with HDInsight Part 2: Power BI Desktop and Hive

My goals for this series

1. Document using Power BI with HDInsight

2. Prove that you can set up a HDInsight Cluster with no scripts

Other References from the Session

Azure: http://azure.microsoft.com/en-us/

Cloud Berry: http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx

 

Thanks for attending my session.

Powering Up HDInsight with Power BI–

On Tuesday, September 15, I presented on this topic for Pragmatic Works. You can find that session here. This session is based on five blog posts that I created in August 2015.

Powering Up HDInsight with Power BI (pdf)https://dataonwheels.files.wordpress.com/2016/02/powering-up-hdinsight-with-power-bi.pdffoundin the

HDInsight Series Featured Pic

Setting Up and HDInsight Cluster (No Scripts Required)

Exploring the Microsoft Azure HDInsight Query Console (No Scripting Required)

Uploading Files to an HDInsight Cluster (No Scripting Required)

Using Power BI with HDInsight Part 1: Power Query and Files

Using Power BI with HDInsight Part 2: Power BI Desktop and Hive

My goals for this series

1. Document using Power BI with HDInsight

2. Prove that you can set up a HDInsight Cluster with no scripts

Other References from the Session

Azure: http://azure.microsoft.com/en-us/

Cloud Berry: http://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx

Wrap Up from the Session

A few questions were asked during the session and I wanted to handle some of them here.

Why did you not use Azure Resource Manager to deploy storage?

I did this as simple as possible and did not need to use the Resource Manager for my demos. However, if you need to rebuild the cluster quickly, the Azure Resource Manager would be a good option. Find out more here: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-provision-clusters/. This site will also walk through scripts and other options for setting up HDInsight clusters.

Why didn’t the table structure show up in the Power Query demo?

The Power Query demo worked with the data from a file approach. This approach is more “raw”. The files did not have column headers, so no headers were created in the table. However, when working with the Power BI Desktop demo, I used Hive. The table was defined in Hive and were easily seen. This is another case for using Hive or something similar to define the schema for ease of use.

What are the differences between Hadoop, Hortonworks, and HDInsight?

Starting from the top, Hadoop is the Adobe open source specification. All of the products listed above are based on Hadoop. Hadoop

Hortonworks and Cloudera are examples of Hadoop distributions. These companies have worked with the various versions of open source technologies around Hadoop and created a supported distribution as a result.

image 

Finally, HDInsight is Microsoft’s cloud-based Hadoop implementation. They continue to add functionality including Spark, R, Giraph, and Solr. You can expect Microsoft to continue to grow the capabilities of HDInsight as part of their cloud-based analytics solutions.

Thanks for attending my session.

Intro to Data Factory–Training on the T’s Follow Up Post

PragmaticWorks-LogoThis is a follow up blog post based on the Intro to Data Factory session I gave on the Training on the T’s with Pragmatic Works. Find more free training from the past and upcoming here. I did my session on January 13, 2015.

 Intro To Data Factory

In this session, I gave a simple introduction to new Azure Data Factory using a CopyActivity pipeline between Azure Blob Storage and Azure SQL Database. Below is a diagram illustrating the factory that is created in the demo.

image

I have published my presentation materials here. This includes the sample JSON files, the Movies.csv, and PowerShell scripts.

Q & A

Here are a few questions that were answered during the session.

1. Does Availability refer to when data that has been transferred will be available? Or when the data source is actually available for query?

Availability refers to when the datasets will make a slice available. This is the when the dataset can be consumed as an input or be targeted as an output. This means you can consume data hourly but choose to push it to its final destination on a different cadence to prevent issues on the receiving end.

2. What pre-requisites are must haves?…e.g.(Azure account, HDInsight, Blob Storage Accounts, etc.)

    • An Azure Account is the only real must have. You could use two on premise SQL Server instances.
    • HDInsight if you want to use the HDInsight activitities
    • An Azure Storage account to use blob or table storage

3. How do you decide to use a Factory or Warehouse?

The factory is more of a data movement tool. A warehouse could be a source or target of a factory pipeline.

4. Is this similar to SSIS in SQL Server?

Yes and no. SSIS is definitely more mature and has more tooling available such as data sources and transformations. SSIS also have a good workflow constructor. The focus of the Data Factory initially was to load HDInsight tables from a variety of sources with more flexibility. The other note here is that Data Factory is being built from the ground up to support the scale of the cloud or Azure.

5. Can this be used for Big Data?

Absolutely. I would say that it is one of the primary reasons for the tool. In reference to the previous question, it will likely be the tool of choice for big data operations because it will be able to scale with Azure.

Links to Additional Resources on Data Factory or tools that were used in the presentation:

Azure Data Factory on Azure’s Website

Azure Data Factory Documentation

Azure Data Factory Pricing

Azure Storage Explorer

Azure PowerShell Documentation

Thanks for joining me for this presentation. We look forward to seeing you at the next Free Training on the T’s.

Oracle Tips for MSBI Devs #6: Supporting SSAS Tabular Development

As SQL Server Analysis Services Tabular Models become more popular, models will use Oracle databases as sources. One of the key issues whenever you work with Oracle is understanding how to properly configure the necessary components to enable development.

Getting Started

If you have worked with Oracle before, you are very aware of a few things you need to be successful. First, you need to install the Oracle client. Here is where the details get messy. When you are working with MSBI tools, you will be using SQL Server Data Tools in Visual Studio which is still only 32 bit. Of the BI tools in SSDT, only SSIS has run modes to support 32 bit and 64 bit configurations. As a result, you need to install the 32 bit Oracle client in order to develop your tabular model.

Once that has been installed you will need to update the TNSNAMES.ORA file with the servers you will be targeting during development. Ideally, your Oracle DBAs have a file for you to use so you don’t need to create one. One nice thing is that the Oracle 12c client updates the PATH environment variable with the location of the bin folder. (Yes, Oracle still uses environment variables.) I would also recommend adding or using the TNS_ADMIN variable to specify the location of the TNSNAMES.ORA file. (See http://www.orafaq.com/wiki/TNS_ADMIN for details.)

NOTE: It took me many hours to work through a variety of configuration issues related to working with the Oracle client install. A couple of reinstalls, reboots, TNSNames.ORA tweaks, and lots of fruitless searching were all required to get this working. Be warned, working with Oracle clients are neither fun nor simple.

The Issue

Now that you have the 32 bit client installed you can connect to the Oracle database through the tabular model designer. As shown below, you can connect to Oracle through the Table Import Wizard.

image

You will be able to successfully test the connection as noted here.

image

And you will be able to execute a query and get results. You can also use the option to select tables and views.

image

However, once you decide to import the data you will encounter the following error:

image

The issue is that while you can do most of your work within Visual Studio using the 32 bit client, the import process targets the SQL Server tabular instance you specified when you created the project. While the 32 bit version of SQL Server is still available, most of us would not install that, even in our development environments. If you do not encounter this error, you are either using the 32 bit client of SQL Server or you have the 64 bit Oracle client installed (more on that next). As long as Visual Studio is only 32 bit compliant and you choose to use the 64 version of SQL Server you will see this issue.

The Resolution

The resolution is fairly simple. You need to download and install the 64 bit Oracle client. I would recommend that you get it installed, then reboot your development PC. While this may not be required, it seems to have helped me with a number of connectivity issues. You will need to be prepared for some “interesting” issues as you will have more than one Oracle home installed and you have the potential of many types of ORA-XXXXX errors. Once you are up and running you should be able to develop tabular models built on Oracle databases.

Some Parting Thoughts

First, I want to be clear that I think that Oracle is a solid database platform. However, I have never been at a client site or on a project where the connectivity or client installs were totally correct or functional without some work between the Oracle team and the BI development team. I think that the .NET driver is supposed to better and I may try that out for a later post (when I have the hours to spare).

I did the testing for this completely on Azure (and my Surface). I set up an Oracle VM and a SQL Server VM on Azure. The Microsoft team put together a great reference on setting up your Oracle VM. Check it out. I also did a previous post on setting up Oracle in an Azure VM. Both VM types can be pricey, but in a testing environment all was not too bad. I encourage you to use Azure to for these types of scenarios. But be sure to turn it off when you are done.