Azure Integration Options and Overview

Today I’d like to talk about Azure integration services that you can leverage within the platform. Integration services allow data to interact with other applications or businesses and to move data around within the platform. I’d like to discuss what is currently available, where it might fit, as well as give you an overview of the capabilities of Azure integration.

Visual Integration Tools in Azure

Let’s start with the “business” or user friendly integration tools.

Flow

This workflow engine comes with Office 365. It’s a lightweight data integration engine that allows business users the ability to integrate between different types of applications. For example, you can use flow to integrate between SharePoint and drop data into email or use an email and drop data into Azure Storage. It’s set up to be simple to use and no custom coding is really allowed. Power users that are used to moving data around or have worked with SharePoint workflows will find this easy to transition to.

Logic Apps

Logic Apps is available in Azure as opposed to Office 365. It does allow you to go after the code and create within the context of Visual Studio. Its graphical interface is almost identical to flow, with almost the same capabilities, but we can expand on those capabilities. It also gives you more around security and things that make it more IT friendly.

Azure Data Week is coming soon – October 2018

Developer Integration Tools in Azure

The other integration services get a bit more sophisticated and require development, but they give you better options around source control and other areas:

Service Bus

This is about moving messages around; it’s a message broker service. It handles data from all kinds of sources, on and off premises, cloud based, mobile, etc. It’s a highly scalable, high volume service and one of the most mature services in the Azure platform.

Azure Data Factory

This integration tool does a lot of the same types of workloads as SQL Server Integration Services (SSIS) or Informatica. With Azure Data Factory V2, you have the ability to take SSIS and put it in as an integration runtime and run SSIS packages in the context of Azure Data Factory, thus making its capability even greater.

This awesome V2 version has a lot of capabilities that are not available in V1 and it has much more control around it. It enhances our ability to integrate data and ETL type of functionality and workflow.

Event Grid

This tool manages events coming off different devices and applications and simplifies the whole process of anything event driven. So, if you’re tracking events off an app or an IoT and trying to figure out what you’re going to do with that data or how you’re going to manage it, the event grid simplifies that infrastructure for you.

Web Job and Functions

Web jobs are part of the app service, functions are not, they are serverless. The idea behind these is they allow you to basically distribute code and have the ability to do other interactions.

So, you have all these options available for integration. The key is to understand what makes the most sense for you and your business. Each gives the opportunity to leverage it in a certain space, some have more expandability than others and some require more coding.

What you need to do is find the right tool for the job. In some cases, you may need more of a toolbox scenario, where you put pieces together, so you can get the best of all of them. You need to determine the best way to integrate the apps you have, from everywhere your apps and users exist, across your enterprise.

Power Testing ETL with Power BI – Creating the Tests with Power Pivot

PowerTool_1This is the second deep dive into Power Testing ETL with Power BI. At this point, we have created the source table which will be used in our testing. The next step is to bring in the destination table and create the tests that will be “run” against the data. In its simplest form the tests are created using logical conditions based on whether source data matches destination data and calculations applied to those data sets also match. When they don’t match, you have data load error which results in a failed test.

How to Calculate Success and Failure

The basics of the testing is turn the results into numbers and calculate if and how much we succeeded or failed. Typically, every test will result in a 1 or 0. Whether you assign 1 to success or failure is largely dependent on how you plan to display your results. If you plan to use KPIs built into the Power Pivot model, you will be comparing the number of successful tests against the number of rows expected to be imported. The primary reason for this is that you cannot target zero when using KPIs. In this scenario, successful tests result in 1 and are therefore easily compared to the number of expected rows which would be 100% successful if they matched.

The other scenario is to measure failures. In this case, we assign 1 to each failed test and count the number of failed tests. This can easily be handled in visualizations such as conditional formatting where 0 can be displayed as green and the number of failures change the state from from green to yellow then red. This helps identify the most commonly failed tests.

The method you choose is up to you and how you prefer to see the results. We will cover using both variations in visualizations, but for sake of brevity here, we will measure success against our row count. Success = 1; Failure = 0.

Creating the Power Pivot Tests

In order to create the tests, you need to open the Power Pivot window and add the destination table to the model. In our case we have created a table in the HughesMediaLibrary database called books that is our target. Here is the syntax for the target table.

CREATE TABLE dbo.Books(
BookID int IDENTITY(1,1) NOT NULL
CONSTRAINT pk_Books PRIMARY KEY CLUSTERED,
BookName varchar(100) NOT NULL,
Publisher varchar(100) NULL,
Genre varchar(50) NULL,
CopyrightYear smallint NULL,
AuthorFName1 varchar(100) NULL,
AuthorLName1 varchar(100) NULL,
AuthorFName2 varchar(100) NULL,
AuthorLName2 varchar(100) NULL,
AuthorFName3 varchar(100) NULL,
AuthorLName3 varchar(100) NULL,
AuthorFName4 varchar(100) NULL,
AuthorLName4 varchar(100) NULL,
AuthorFName5 varchar(100) NULL,
AuthorLName5 varchar(100) NULL,
PageCount int NULL
)

While I realize this is not a good normalized table, it serves our purposes well to build out the tests. This table needs to be added to the Power Pivot model before we can do the next steps.

Relating the Source and Destination

The next step is to relate the source and destination. In our case, the only data that will work is the book name. We will use the Source table as the primary table in this relationship. The idea is that all the data in the source table should exist in the target. As this is not always the case, the source is the “source of truth” for the testing scenario.

 

Building the Tests

The tests are comprised of calculated columns that handle data analysis and calculated measures which summarize results.

Validating Data Field by Field,  Row by Row

This is the primary reason that we worked with Power BI. One of the most common testing scenarios is whether the data came over correctly. In the previous post, we shaped the data with Power Query. Now we will compare it with the results from our ETL process in SSIS. We will use Book Name as the example. Every field you wish to test can follow this pattern. The test consists of a calculated column and a calculated measure.

We create a column in the destination table called Book Name Matches. (Remember we are tracking success not failures.) In each row of the data we need determine that the book name in the destination is the exact match for the book name in our source. We used the following DAX for that calculation:

=IF(RELATED(‘Booklist Source Fixes'[BookName])=’Media Library – Books'[BookName],1,0)

It looks at the related table to determine that the field names match. If they match, the test is assigned a 1 for that row. If they do not match, a 0 is assigned. (The table names are how I named the source and destination. They may not match your solution if you are following along.) Once we have the rows evaluated, we will sum the values with a Book Name Matches measure:

Book Name Matches (34):=SUM([Book Name Mismatch])

We will use the Book Name Matches (34) measure to compare with the book count. If they match, all tests passed. If they do not, then some or all rows have failed.

The number after the measure, 34, is the test key from TFS. I added this into the measure to make it easier to identify which test case is being evaluated with this measure. In some cases, you may have multiple measures that are required to complete a test. You can either evaluate them independently or create and additional measure that summarizes them for your use.

Other Validations or Tests

Some other basic validations can be created as well. A common one would the book count. In my scenario, I return the book count then evaluate it using a KPI. Another way to do this is to add another measure that checks for equality between the two book count measures in the source and destination. If they match, success. If not, failure.

You can also use measures to validate expected totals the same way we were working with counts. This is particularly helpful in financial data loads where you would want to verify a sum of balances to make sure the results match. The point is that you can add any other measures that you want to compare in order to meet the unique needs of your situation. It is also possible that you can compare to entered values. If you know that 100 widgets are to be imported, you can have the measure evaluate against 100 instead of  a measure in the source.

Recording the Results in TFS

In order to bring the process full circle, we enter test results into TFS or Visual Studio Online. This allows us the ability to track test results, bugs, and fixes in a development lifecycle tool. It is also the best way to track testing history. One caveat here is that the query results from TFS do not permit you to set test results in Excel. Ideally, we should be able to link in the tests with the results. We could then update the results in the query and push it back. This is NOT supported at the moment. As a result, you will need to open the tests in TFS to update your results. This is not a significant issue because you should also create bugs for failed tests. It’s primarily a nuisance.

An added side effect of using this method to test is that we are able to collaborate with developers to determine what the bug actually is. Because all the data is loaded into Excel reviewing results is fairly simple and may actually be easier than trying to look at the destination system.

Quick Look at SSIS

Up to this point, we have focused on how an non-developer can set up the source and destination and proceed to test. I wanted to call out the author name work done in Power Query to highlight why Power BI is a great choice. When splitting author names, the work was done using right-click operations. Here is an example of the expression code used to split out the second author name column:

(DT_STR,200,1252)TRIM((FINDSTRING(AuthorNames,”,”,1) == 0 ? NULL(DT_WSTR,200) : TRIM(SUBSTRING(AuthorNames,FINDSTRING(AuthorNames,”,”,1) + 1,FINDSTRING(AuthorNames,”,”,2) == 0 ? LEN(AuthorNames) : 1 + LEN(AuthorNames) – FINDSTRING(AuthorNames,”,”,2)))))

Compared to Power Query, this is complex and not intuitive. While Power Query is not intended for enterprise ETL use, it’s simplicity helps test complex scenarios such as our author name split without having to create and equally complex SQL statement or expression.

The next post will take a look at some of the visualization options for the test results.

Power Testing ETL with Power BI – The Process

PowerTool_1This is a short blog series on using Power BI tools to support testing ETL processes. I have presented on this subject at few SQL Saturdays over the past few years and am finally succumbing to multiple request to turn it into a blog post. Realizing the amount of content is more than I typically would put into a single post, I will be putting together this short series to cover the material. The first post is this one. It will walk through the entire process at a high level. I will follow this post with a deeper look at Power Query’s role in the process. The third post will cover Power Pivot and building out test cases. Finally, we will wrap the series up with some visualization ideas for Excel and Power View. You can find all the posts as they come online here. Let’s get started.

The Problem Area

Why use Power BI to test ETL? While working as the architect on an ETL project for moving data from third party web service to an on-premise financial solution, we needed to put together a testing strategy that could be implemented by non-developers on the project. Our situation was that our project was “too small” to engage our QA team but the requirement for reusable testing needed to be fulfilled. Our project team consisted of a BI architect (that would be me), an ETL developer, and a business analyst (Chuck Whittemore).

NOTE: We are testing the data transformations and data load. This is not intended for auditing or performance. There are other tools for reviewing those including the built in reporting in SSIS and Pragmatic Works’ BI xPress tool. If you are tracking whether a package fails or succeeds, you should use either of these options not this process.

The Big Idea

The BA and I were discussing options for testing and we theorized that we could use a new add-in for Excel (Power Query, still in preview at imagethe time) with Power Pivot to build out tests. The key to success on this project is that we needed to be able to test with non-developer tools, no SQL Server Management Studio or SSIS could be involved in the testing. The primary reason for this is that he would be doing the testing. We also did not want to recreate every step in the ETL process the same way. So, time to put theory into practice. We determined that we would create test cases in Visual Studio then build out tests to match those cases in Excel using the Power BI add-ins. He would do the work in Excel and we, the developer and I, would provide technical support as needed.

The Recommended Tools

Before we dig into the process, I want to lay out the tools used for development and for testing. While this solution can use other tools, it is important to know what we used in practice to create our solution.

ETL Development Tools

imageThe ETL development was done using SQL Server Integration Services (SSIS). At the time, we needed to use Script tasks to consume the web service content. The financial system used a custom load process that we dumped formatted data into a file for the system to pick up and load.

In the examples, I use in the presentations and will lay out here, I will be using a text file to SQL Server implementation. While complex ETL problems are common and hard to test, this simplified version is easier to follow in examples. You should be able to apply the principles used here to test any solution.

Testing Tools

imageThe testing development for the referenced project consisted of Excel with Power Query and Power Pivot. Power Query was in preview at the time, so we had some of the performance issues and early bugs to work through. None of these issues, prevented us from completing the project.

The presentation solution relies on the latest version of Power Query (which changes every month) and Power Pivot in Excel 2013. Most of the examples are easy to follow, but you should be able to solve most transformation tests with the combination of Power Query and Power Pivot. Definitely do not discount the capabilities of Power Query and the fact that new functionality is being added each month.

Team Foundation Server/Visual Studio Online

imageBoth projects use the online version of TFS. If you are currently not using a source control and work tracking solution, I highly recommend you look at the online version of TFS. It will allow you up to 5 users free and give you ability to use source control, create test plans, create test cases, log bugs and track changes. These are key features necessary to complete a good solution that can be managed and tracked.

The Process

image

I am going to walk through my demo to build out the process steps. This will allow you to see examples. I will call out any thing of relevance related to the project here as well.

1. Business Rules

The first part of any project, especially in ETL, is to understand the business rules. If you are working with a data warehouse project, this may be fairly well documented in a dimensional model. In both of our cases here, we are moving data from one system to another. The transformations and business rules are primarily driven by the target system. Here are some examples of business rules in the media library sample project.

  • Author names are stored in separate columns – FirstName and LastName
  • If an author’s name include a middle name or initial or some variation, this combination should be stored in the first name column. For example, J.R.R. Tolkien would be stored as follows:
    — FirstName: J.R.R.
    — LastName: Tolkien
  • Copyright year should be stored as a 4 digit value
  • Page numbers should not exceed 1000

Every project has some type of business rules. It is hard to build out transformations and create test cases without these rules.

2. Source to Target Map

This is the single most important document for the tester. It tells the tester how the developer is getting from source to destination and what type of data massaging needs to be handled. Typically, people use some variation of the example created by the Kimball Group over the years.

3. Developing SSIS

The developer begins the process of creating the SSIS package. He will be using the Source to Target Map as his guide and will update that document to handle special cases in the data as needed. Ideally he is working in a development environment that will allow for test build outs as well.

4. Creating Test Plans and Test Cases

The tester creates test plans and test cases in TFS. These tests are based on business rules and the source to target map. Depending on both the complexity of the solution and the time to develop, some test cases could be did the table move the correct data field for field and row count. This method can be particularly useful when working with large tables or simple data flows. However, you should have a test case for every transformation that massages the data. This will insure that the data is being transformed as expected.

image

Keep in mind, this solution will support test cases for each field in a data load if required. The tester and architect should evaluate what is the appropriate amount of coverage to guarantee the highest level of quality in the data transform. As always, there is a diminishing rate of return if you “test everything” at the lowest level. It will be expensive in terms of cost of development when the chance for error is minimal. It will also take substantially longer to test everything. You need to understand and be able to articulate how the testing was accomplished and your level of confidence in the results.

5. Building the Tests

This is the most extensive part of the process besides the SSIS development. I will not go into all the details here, but will walk through the overall process and principles. I will provide detailed examples in the follow up posts as noted above.

Let’s start with the end result. Chuck and I were able to determine that we could use DAX to create comparative formulas on data that could be brought into Power Pivot from both the source and the destination. Essentially, we wanted to use math to determine the results of the tests. So in our example, we use a formula like “if Source.CopyrightYear = Destination.CopyrightYear, then it passes, else it fails.” Depending on how you want to measure, pass could be 0 or 1. Then we add the values up to determine if data passed or failed the test. We can even tell you failure rates.

In order to get the data in a comparative state, we needed each table in the destination with a table that matched from the source. However, it is very common that sources and destinations are not one-for-one table matches. This is where Power Query comes in. Using Power Query in our example set we bring in the text file and massage or shape the data to look like the destination. Most importantly, we need to apply all business rules and transformations to the source. Once this is done, we do no massaging on the destination data. This allows us to compare what the ETL process did with what our tests say it should have done.

A key part of being able to compare is the ability to relate the two tables in Power Pivot. You need to be able to match natural keys or derived keys between the two sources. The relationship should be from the destination table to the source table. Without this relationship, you will not be able to build the calculations for the tests. Keep in mind the goal is to get our source to look like expected results. Any data in the destination should match the source in our scenario.

image

Once both tables are created and loaded into Power Pivot, we can complete the tests using DAX. In some cases, we create calculations on both tables to be compared. A classic example is row count. We count the number of rows in the source table and the destination table. Then we create a calculation on the destination to compare values. This meets the requirements of a row count test case (e.g. all data was successfully imported).

Another example of a test is to compare the content in a field from source to destination. This is where we use a lot of conditional logic to verify the contents of a field in a row is the same in both tables. Calculated columns (not measures) are used to create the comparison results. The conditional statement should result in a number. This is important in order to create a measure that sums up the results to determine if errors exist or not. If you choose success to be 1, then you will check your results against the row count to determine if there are errors. If you choose failure to be 1, then a nonzero count means you have errors. There is no right or wrong way to handle this, you would choose based on visualization techniques. Most of the time, using 1 for failures is fine. However, if you want to create KPIs, you will likely need success to be one so you have a good target to work with.

6. Testing the Initial Load

Once you have created the tests, you are ready to test the initial load. You will connect to both sources. Ideally, your source will not change so you can redo the test multiple times, but this will work regardless. Refresh the data which may require rerunning the Power Query query. Once you have refreshed the data you should be able to check the calculations in a simple pivot table to determine what tests have succeeded or failed. This is the beauty of this solution. Each subsequent execution of SSIS, you will be able to refresh your data and review your results to determine how successful the ETL is.

image      image     image

A side effect of this work is that the developer can review the test results in Excel and Power Pivot with you to more easily find the discrepancy in the data transform. In some cases, the tests are in error as well. It is important that the developer and tester work together to determine cause as well. A good team will be able to work through issues rather quickly.

7. Recording Bugs and Issues

You will need to go back to Visual Studio to change the pass/fail for each test. If a test fails you can log a bug for the developer and you that information to determine if it was fixed prior to a subsequent run. It is likely that multiple sprints will be required to complete the work so you can work with your team to determine the best ways to communicate what is ready. If you track the work in TFS, you will queries are available to help you see what work has been completed.

You can determine if the fix worked and then set the test results accordingly. This will help show progress on the project as well.

8. Visualizing the Results

You can visualize your results using KPIs, conditional formatting and even Power View. If you have a project that needs to be easily evaluated you can publish your results to SharePoint and use charts and graphs to show how accurate the process is so far.

image  image

We will dig into visualization options more in a following blog post.

Tracking Test History

No solution is perfect and that is true here as well. One of the most common questions is how do we see the historical results? This solution does not easily provide for that. I am looking at options, but for the moment the idea is that the history will be tracked through TFS. However, you could save the workbook after each iteration. This will give you some history, but you would want to make sure that you don’t refresh data on a historical workbook or the results would be overwritten.

Some final thoughts.

Power Query is not an ETL tool. It’s target destination is always the same – Power Pivot. While it’s ease of use makes it appear to be a tool to be used for ETL, it is not there yet. However, it is in its ease of use that we have a place to work with it here.

My plan is to have some deeper technical dives into parts of the solution in the future.

Oracle Tips for MSBI Devs #5: Working with Oracle on Windows Azure

As you have likely noticed in my series, Oracle Tips for MSBI Devs, I have done a lot of work with Oracle through the years while delivering BI solutions. One of the pain points of working with Oracle in development is setting up an Oracle development server. Even though I have installed Oracle a number of times, it is never seems to be an easy process.

So, I decided to try out the Oracle Virtual Machine template in Windows Azure. I will walk through the setup process here. I need to use Oracle as a data source for some SSIS development.

Setting Up the VM

From the Windows Azure portal, select the Virtual Machines tab then “Create a Virtual Machine”. This will open up the variety of options available to create the VM. Select the FROM GALLERY option which will open another dialog.

image

On the next screen, you pick the edition Oracle you want to use for the VM. (NOTE: at the moment, Oracle images are in preview. Microsoft recently that Oracle VMs will be be available on March 12. You can find more information here.)

image

I will be using the Oracle 11g R2 Standard Edition on Windows Server 2008 R2. The next step is to name and pick the size of the VM. The minimum size for this is Small and is what I used. I then completed the setup including setting up the endpoints and creating a new user.

I had originally tried to use Oracle 12c, but significant changes have been made to support multitenancy which make set up considerably more tedious with very few good examples available on the web. Most of the advice given by Oracle pros was to “Read the _____ Manual!” While “sensible”, I just needed a simple dev environment. This is one of the significant advantages of working with SQL Server, community help is abundant and usually pleasant. For instance, Microsoft recently published a document for setting up the Oracle 12c VM. I used it to work through some of the setup instructions below.

Once the initialization was complete I used the connect image button to open an RDP connection to the VM from the Azure dashboard. One thing to keep in mind, be sure to keep track of the user name and password you created. This your admin account and you will need it to log in to the VM. Now you have a running VM. At this point, I went and found the Oracle tools that I typically use and pinned them to the task bar.

Creating and Loading an Oracle Schema

Because I always for get some of these steps, and I really don’t want to read the manual, I listed the steps I used to create a simple set of data for use. This is not good enough for production, but it is a starting point.

Step 1: Create a Listener. This is required before you can create a database. To do this open the Oracle Net Configuration Assistant. From here you can create your first listener. I left the default settings for the setup.

Step 2: Create the database. (This is the equivalent of an instance in SQL Server.)  I used the Database Configuration Assistant for Oracle Database to create my first database on the server. This can be found in the Oracle home directory on the start menu.I chose the General Purpose template for my database. Most of the steps make some sense. I did choose to add the sample schemas as this is the easiest way to verify I can connect and work with the data. After all of this, the database will be created based on your choices.

Step 3: Using SQL*Plus, I connected to the SYSTEM schema. The user-name in this case is “SYSTEM”. Now we

Step 4: Create a new user and schema. (This is similar to the SQL Server database, not a SQL Server Schema.) This will give a location to create tables for loading data in the next steps. In accordance with typical Oracle support you can read about how to do this here: http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_8003.htm#i2065278. Or I can give you a good starting script and save you time.

CREATE USER someusername
IDENTIFIED BY <<add password here>>
DEFAULT TABLESPACE example
QUOTA 10M ON example
TEMPORARY TABLESPACE temp
QUOTA 5M ON system;

Your welcome! This will create a database using existing tablespaces. This is NOT a production level script and it is barely good enough for development. But in my case, I am using Oracle as a source and don’t plan to do much development or work on it so it meets my needs. If you need more insight, I guess you will need to read the documentation.

Step 5: Create a table and add rows. I continued to use the SYSTEM login and created a couple of simple tables within my new schema. I then used simple INSERT INTO statements to add data.

Now you have some basic data to work with to test connectivity with SSIS or SSAS.

Making the Oracle Database Accessible

In order to access your Oracle database on the VM you need to enable the port. You do this by going to the Azure portal and selecting the VMs tab. Once there, go to the Endpoints tab. You may recall that when you created the VM, you were asked about the Remote Desktop and PowerShell ports. Here are the steps to create the Endpoint to support Oracle.

  1. Click Add to open the Add Endpoint dialog.
  2. On the first page, leave the default which will add a stand-alone endpoint.
  3. On the second page you need to add a name (I used “Oracle”), select the TCP protocol, and put port 1521 in both the private and public port textboxes.

Once completed you should see the new endpoint in the list of available endpoints as shown below.

image

Connecting SSIS to that Oracle Database

Now that we have data in the db, it is time to create the connection to SSIS and load data and run queries.

The first thing I needed to do was load the client tools. I used the newer Oracle Data Access Components (ODTwithODAC12012). Once that was loaded, I added the following entry to my TNSNames.ora file (look for that in a directly like the following: c:\app\<username>\product\12.1.0\client_1\Network\Admin):

ORACLEDW =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = <servername>.cloudapp.net) (PORT  1521)
(CONNECT_DATA =
(SERVER=DEDICATED)
(SERVICE_NAME = ORACLEDW)
)
)

The key parts to get right are the HOST, PORT, and SERVICE_NAME as highlighted above.

Once TNS was in place, I was able to create an ODP.NET connection to the database and proceed to load the data.

I know that some of this has been simplistic but it is great that I don’t have to install Oracle myself. This functionality makes Azure even more appealing as a hosting solution.

O, There’s the Data: Using OData in SSIS

image

The Open Data Protocol (OData) is an open specification created Microsoft to enable exposing data in a standard way from a variety of sources.  OData is natively supported in many of Microsoft’s products including PowerPivot, Excel 2013, SQL Server 2012 Analysis Services Tabular Model, Windows Communication Foundation (WCF), and Entity Framework to name a few.  Furthermore, Microsoft uses OData to expose data feeds from the Windows Azure Data Marketplace as well.

I pursued adding an OData source to SSIS as a result of Mark Souza’s presentation at the Minnesota SQL Server User Group in April 2012.  I posed a question about easier interaction with Oracle.  He mentioned that OData would be a good way to solve that issue.  This led me to put together a presentation which I delivered for PASSMN in July 2012 entitled O, There’s My Data: The Open Data Protocol.  At that presentation, I reviewed the “pain and agony” of a data pro putting together a data feed using Entity Framework in C# and WCF to expose it.  For the most part, with the help of .NET pros at Magenic including Dave Stienessen ( B ) and Sergey Barskiy ( B ), I was able to create my first entity model and expose it using WCF.  After that I worked on how to consume the feed without purchasing a 3rd party tool.  Here is the rest of the story.

Using ATOM as Shown in a Channel 9 Exercise

While looking for solutions that allowed me to implement an OData feed into an SSIS package, I came across a Hands on Lab on Channel 9.  While the focus was on Reporting Services, I was able to use the steps to create a package that would read a feed and make the data available to the ETL process.  In a nutshell, this exercise involved three tasks – creating an ATOM file, processing the ATOM file and loading the data using an HTTP connection manager pointed to the OData feed.  While you are creating this package, you should run each step after you have created it in order to use the files created in the following steps.

image

Task 1 – Create ATOM File (Script Task)

In the Main method,  I used the following code which was copied and adapted from the Channel 9 exercise. (NOTE: The code for this script has an error.  The object declaration should be condensed to one line to work properly.)

public void Main()
 {
 // Get the unmanaged connection
 object nativeObject = Dts.Connections["TestSvc"].AcquireConnection(null);
    // Create a new HTTP client connection
 HttpClientConnection connection = new HttpClientConnection(nativeObject);
    // Save the file from the connection manager to the local path specified
 string filename = "C:\\Source\\SSIS 2012 Projects\\ODataIntegration\\Departments.atom";
 connection.DownloadFile(filename, true);
Dts.TaskResult = (int)ScriptResults.Success;

}

This task will create an ATOM file that will be used in the next step to retrieve the data.

Task 2 – Process ATOM File (XML Task)

This task will use the new ATOM file to create an XML file with the data.  It uses the XSLT operation type pointing to the File Connection Manager created in the previous step as the source.  This will result in another File Connection Manager to support the destination XML file with the data.  Finally, in the exercise as second operand set of XML is used to clear unsupported headers.  Admittedly, I just copied this straight from the example and still am not sure of the details of what it does.

Here is a look at the XML Task Editor so you can see the settings I used.

image

Here is the code from the Channel 9 exercise used in the SecondOperand property:

<?xml version="1.0" encoding="utf-8" ?>
  <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="no" />
 <xsl:template match="/|comment()|processing-instruction()">
 <xsl:copy>
 <xsl:apply-templates />
  </xsl:copy>
  </xsl:template>
  <xsl:template match="*">
  <xsl:element name="{local-name()}">
  <xsl:apply-templates select="@*|node()" /> </xsl:element>
  </xsl:template>
  <xsl:template match="@*">
  <xsl:attribute name="{local-name()}">
  <xsl:value-of select="." />
  </xsl:attribute>
  </xsl:template>
  </xsl:stylesheet> 

Task 3 – Load Data (Data Flow Task)

The final task is a straightforward data load using the XML Source Component pointed at the file XML file I created.  Then I created a matching table in a database which I used as the destination.image

Wrap Up on the ATOM Feed Option

This will work with SSIS 2008 and SSIS 2012.  I tested most of the work in 2012, but the code in the illustration supports 2008.  This option does require that the package write at least two files to the server to work correctly.  In some cases, this will not work in enterprise environments as the data will now rest on the server for a period of time or the admins do not want files created on the server.

Using a Custom SSIS Source to Get the Data

NOTE: This is the preferred solution, but is not available in SSIS 2008 which uses the .NET 2.0 Framework.  This solution requires the .NET 3.5 Framework.

This version uses a custom SSIS source to connect to the OData feed and populate the data flow pipeline.  I did not find this option illustrated anywhere and used help from the Dave and Sergey to put this together.  I spent many hours trying to solve this issue and at the end of the day, it is fairly simple.  So, hopefully, this will save you some time as well.

This package only has one workflow task – a data flow task which contains the rest of the code.  In the data flow task, I have a Script Component implemented as a source and a Row Count with a data viewer on the pipeline to check results.

image

This was my first experience creating a custom source.  I used a post from SSIS Talk – SSIS Using a Script Component as a Source as a reference.  If you need help creating your first script source check it out.

Be sure to set your outputs prior to creating the script or you will not have them available to map to in the code.  You also need to add the HTTP Connection Manager you are using to point to your OData feed.

Add References, Using Statements, and Declarations

Once you have the basics set up, you need to add some references including the targeted data service and System.Data.Services.Client.  These are the key references for the code we are implementing.

image

Once you have these references you will need to add the following to the Using statements to the Namespaces region.

using System.Data.Services.Client;
 using SC_68e99fec2dce4cd794450383662f6ac7.TestSvc;

The SC_ reference is the internal name for your script component and will be different from mine, although it will likely be in the same format.

Next, you need to add the following declarations in the ScriptMain class as shown here.

public class ScriptMain : UserComponent
 {
private Uri svcUri = new Uri  (http://localhost/ODataSQL/TestSvc1.svc);
 private AdventureWorksEntities context;

The AdventureWorksEntities is from the service reference I created. You will need to know the context name for the service reference you are using.

The Working Code: Using DataServiceQuery

In the CreateNewOutputRows method in the SSIS script you will add code that runs a DataServiceQuery which adds the data to the data flow pipeline. In my case, my Output was called Departments and created the buffer reference you see in the code.  It has the output fields I defined for my source.  Here is the code I used to implement the solution.

public override void CreateNewOutputRows()
 {
 context = new AdventureWorksEntities(svcUri);
 DataServiceQuery<Department> dept = context.Departments;
    foreach (Department d in dept)
 {
 DepartmentsBuffer.AddRow();
        DepartmentsBuffer.DeptID = d.DepartmentID;
 DepartmentsBuffer.DeptName = d.Name;
 DepartmentsBuffer.GroupName = d.GroupName;
 }

This will query the service and return the rows. Alas, that is all it really took to solve this problem.  While this solution does not work in SSIS 2008, if you are planning to use a lot of OData, I would recommend using this as another reason to upgrade to SQL Server 2012.

SSIS Needs an OData Source Component

What I found interesting is that Microsoft does not have a native method to load OData feeds into the Data Flow Task in SSIS.  I have since created an Connect item to see if we can get this added.  Vote here if you agree.

Resources Used throughout the Process

Connecting to Windows Azure SQL Database Through WCF

Loading Data from an ATOM Data Feed into SQL Server

SSIS – Using a Script Component as a Source

DataServiceContext Class

Chris Woodruff – 31 Days of OData Blog Series

PASSMN Presentation – July 17, 2012

Consuming SharePoint Lists via OData and SSIS – Uses Linq