SQL PASS Summit–Day 4 (BI Sessions)

Here is the summary of the session I attended while at my 4th day at the Summit. (Technically, this is the second day of the Summit.)  I went to four sessions all of which were in the BI tracts.

Top 10 Things You Didn’t Know about SSIS by Matt Masson, Microsoft (@mattmasson)

Be sure to check out the Tech Ed Video with CTP 3 enhancements.

Two New Concepts in SQL Server 2012 Integration Services:  Projects and Parameters.  SSIS projects now compile down to a single ispac file (“ice packs”).  This will allow you to deploy packages in a project.  Parameters have been added to better support configurations and they are exposed by the package execution UIs.

The SSIS Catalog.  A SQL Server User database called “SSISDB” is where the catalog is stored.  This provides a central place for configuration including new parameters and connection managers.  Security is also much improved as encryption is now managed at the server level.  Logs and reporting will also be recorded in the catalog.

Top 10 List:

1. Change Data Capture and 2. ODBC Support through Attunity partnership.

CDC for Oracle databases will result in better integration similar to SQL Server CDC.  CDC components in SSIS will work with  both SQL Server and Oracle.  Three new tasks have been added to SSIS – CDC Control Task , CDC Source (reads changes) , CDC Splitter (handles insert/update/delete).

ODBC components have also been added to support many of the non-Microsoft data sources including bulk load capabilities with the ODBC destination object.

3. Connection Manager

We now have connection managers which are shared at the project level.  Connection managers get marked as offline if the data source is not available or you want to bring it offline.  The capability to share in-memory cache across packages at the project level has been added.

4. Flat File Source Improvements

Varying number of columns and embedded qualifiers are now supported.

5. Package Format Changes

Updated the XML format using attributes.  Layout information can now be ignored.  Better support for merging.  Lineage IDs have been replaced with refid to support better merge. Still not recommeded, but better supported.

6. Visual Studio Designer Changes

Parameters are in a Visual Studio design tab which allow you to set up defaults for various environments.

7. Script Components

Uses compiled IL in .NET for better performance.  Script component debugging is in 2012 with the ability debug script components and see the values of the data in the pipe.  All script components support debugging.

8. Troubleshooting and Logging

Because all packages use the SSIS catalog, all error and performance logging including task execution timing and row counts are recorded in the SSISDB.  Much of this information will be available in  built in reports.

9. Data Taps

Allows data in a pipeline to be dumped to a flat file.  This allows for troubleshooting packages with errors in production.  You can create data taps in SSMS using T-SQL.

10.  Server Management using PowerShell

The whole SSIS catalog can be managed or accessed through PowerShell or TSQL.  This will allow you to automate or script out deployment and frequently used tasks.

You can do a lot of these tasks right in SSMS as well using menu and shortcut functionality.  You can query the SSISDB for performance counters while the package is executing.  You can query SSISDB for package execution progress at the task and component levels.

There is built in reporting.  You start from the Integration Services Dashboard with a bunch of stats on the front page.  It includes a number of built in reports with performance data and other relevant topics.  Jamie Thompson has a CodePlex report pack (http://sqlblog.com/blogs/jamie_thomson/archive/2011/09/04/ssis-reporting-pack-v0-2-now-available.aspx).  Check out post from Jamie on the SSIS blog about tables as well(http://sqlblog.com/blogs/jamie_thomson/archive/2011/07/16/ssis-logging-in-denali.aspx).

They bubble up errors related to connections.  There are drill down reports available to the error messages in the tool. Logs have a configurable retention value. Versions of projects are also maintained and the number of versions is configurable.

11.  Project Barcelona Metadata Discovery

This is a project that is coming soon.  Demos available online.  This is not a metadata repository.  This is zero touch.  Targets get crawled and added to an Index Server.  Check out the project blog at http://blogs.msdn.com/b/project_barcelona_team_blog/.

Tier-1 BI in the World of Big Data – Denny Lee and Thomas Keiser, SQLCAT

Two models to work with Dimensional (UDM) and Tabular (formerly  BISIM).  UDM is the current large scale engine. Tabular puts all of the data in the memory, so not very suitable large cubes.

New themes for workload scale:

  • Cheap storage at scale
  • Massive query scale (both size and concurrent)
  • Scaling ETL another order of magnitude
  • Scaled and integrated reporting/BI

NoSQL ecosystem (e.g. Facebook, Yahoo, Twitter, Google, Amazon data offerings) are being used frequently.  Not only are people using these systems, they are building solutions on top of them.  This is why Microsoft has announced the HADOOP integration.

MapReduce systems (e.g. HADOOOP)  and RDBMS systems are actually complimentary, not competing in most cases.  HADOOP is about dynamic schemas.  Getting the data somewhere is the important piece.  Hadoop / NoSQL: Move compute to the Data. BASE basically available, Soft State Eventually Consistent. Each row may have differnent schema.

Tradintional RDBMS move data to compute.  They answer the question quickly and are ACID compliant.

The balance between scale and performance is getting more important.  What do you want? Guaranteed Response, but get it slow.

Hive Connector is the first Step in integraion with the BI Platform.  Hive is a data warehouse framework for Hadoop.  This is the starting point, not the end game for adding structure to Big Data for consumption by other tools such as SSAS.  HiveODBC Driver will allow the cube to be loaded directly from Hadoop.

Yahoo – “Fast” Oracle Load – 24 TB cube. This is the largest known cube in the world.  Yahoo moves data from Hadoop into Oracle and processes the cube from there.

Check out the new SSAS 2008 R2 Performance Guide from SQLCAT at http://sqlcat.com/sqlcat/b/whitepapers/archive/2011/10/10/analysis-services-2008-r2-performance-guide.aspx.

What’s New in StreamInsight?  Isaac Kunen, Microsoft

Stream Data Processing – unbounded streams of data with queries lasting forever.  This engine handles the data coming through.  Temporal Semantics can be added to the StreamInsight data to make it more understandable.

Event processing varies greatly based on business function – latency versus aggregated data rate.  StreamInsight is designed to handle high volumes of low latency data. Common use cases – Fraud Detection, Shipping, Sales, Clickstream Analysis, Security, Building Management.  Common scenario is pattern analysis.

StreamInsight Releases:

  • StreamInsight 1.2 is available today on premise.
  • StreamInsight on Azure currently private CTP, should open up later this year.

Temporal Semantics

  • 3 event types: intervals (begin end), points (“one tick”), edge (only the beginning with an end to be defined later) events.
  • CTI = Current Time Increment.  Up to this point in time, all events are processed.  From one CTI to the next CTI is the period reported.  They can have to increment into the future, but do not need to be equal time periods.  After the CTI is issued, the data is gone from memory.
  • You write declarative queries against StreamInsight in LINQ.  It is the actual query language for SI, not just a wrapper.
  • Select and Project Operators return events from a stream.
  • Join Operator returns events that overlap streams.
  • Snapshot Window is any time the event stream changes, start a new window.  This window is completely data driven.

Resiliency through checkpointing has been added to StreamInsight.  This allows you to restart from specific points.  Performance Counters and Admin Logs have also been added to StreamInsight 1.2.  Check out his blog for more information: http://blogs.msdn.com/b/isaac/archive/2011/07/25/introducing-checkpointing-in-streaminsight-1-2.aspx.

Building the Perfect BI Semantic Model for Power View – Carl Rabeler (SQLCAT), Kasper de Jonge – Microsoft SSAS

You can import a PowerPivot workbook into SQL Server Data Tools as a Tabular SSAS project.  This will create a Tabular Model in SSAS.

Changes in SQL Server Data Tools with Tabular models change the deployed Tabular SSAS model.  You should use a small set of data when developing the model because the data needs to be loaded in memory. You can add data mining models to the tabular model by creating a connection to a SSAS Muultidimensional data source.  When you add a data source to the tabular model in design, it will add the data to the tabular model (the workspace database).  Columns can now be added into the model in any position.

There is an “effective user name” property in the user string to handle the doublehop issues without Kerberos.  Set execution context to this account sets this property in the Reporting Services Data Servivces (RSDS) connection in SharePoint.  This functionality is currently available in SQL Server 2008 R2. In SQL Server 2012, this can be seen in the profiler trace in the discoverer events.

To add images to the Tabular model, you can use an embedded image, stored image (e.g. in the database), or link.  If you use a link, be sure to set the ImageURL property to “true” for this column so Power View knows that it is an image.

The tabular model has a wealth of metadata options that will be carried forward to Power View.  For instance, if you set the Default Label and Default Image in the tabular model table, Power View will display these in many visualizations including Card View.

Be aware that in Power View, the refresh on the tool refreshes data wheras an IE refresh will reload the metadata.

Time intelligence functions in DAX will allow you to create functions as Year To Date.  However, if your tabular model uses surrogate keys to map to the Date table you need to specify the table as a Date table and specify the date column to use.  This is needed to make sure the Time Intelligence functions work correctly.

The RANKX(Table, Expression, [Value], [Order], [Time]) function was illustrated to show how to rank items within a table. Ranking:=RANKX(all(Products), Sum(Margin)).  Watch out for situations where you are using a complex key.

DirectQuery functionality will allow you to directly query SQL Server relational data.  There is a list of limitations to using DirectQuery.  Model with it carefully.