Day at the Office: SSIS, DiffGrams, XML, Dimension Loading using Dapper

Part of my project is to pull configuration data from a web service provided by my client and save it locally for our app to use.  The client wrote a very simple .ASMX, which returns a DataSet. 

imageMy first thought was to use SSIS to extract the data from the web service, and then do a dimension load into my local tables.   But I hit a problem:  When the SSIS control task saved the DataSet, it saved it as a diffgram

The result of this is that when using the SSIS Data Flow task for “pull me some data from an XML File”, while it does get the schema correct, it gets 0 rows, every time.

There is a workaround to use XSLT to transform this XML so that it works correctly:  Examples here and here;  but I discovered this a bit late in the game.  I ended up writing my own code to retrieve the data from the web service, and do a simple dimension load (insert, update, delete) against my local table.  The code that calls it looks like this:

image

And the generic part of the code is this:

image

This code is using Dapper to take care of the parameter-to-SQL bits.   It took about an hour or two to get the first one working just right.. and then another 1.5 hours for the remaining 6 dimension tables.

Its still not as fast as doing the work directly in SSIS, however, with SSIS I then have connection string management and other such fussy things to deal with, which could have taken just as long.

One nice fun thing I was able to do is take the saved XML from the web service (thank you SSIS), pull it local, embed it as a resource, and write a Stub version of the web service. 

image

image

image

Good Times.

Having Fun with SSIS

My new project involves getting data from 500+ remote sites into a target database; in parallel, quickly.  Thanks to some proof of concept work done by my coworker Mike “The Devastator” Hurd (I made up the nickname), I got to learn SSIS, and I’m cursing that I had not picked it up earlier. 

What Is SSIS?

  • Imagine you wanted to write a generic program to get data from point A to point B, but modify it along the way.
  • You wanted pluggable components that could talk to each other.
  • You wanted anything that could be done in parallel to be done in parallel.  Ie, read from two tables at the same time if you have bandwidth.
  • You want it to work buffered rather than row by row.  But dynamically figure out buffer sizes so that they fit well to the page size on the computer (to minimize swapping)
  • You want a GUI that you can configure these components with.
  • You want every component to have a configurable “if things fail” state, to redirect failed rows to other components, etc.
  • You want it to be solid – no leaking memory, etc
  • You want to parameterize the stuff getting into the components
  • You want it to be runnable from a command line, with ability to override parameters and connection strings.
  • You want the whole configuration to be saveable as an XML file, or hosted in a SQL Server
  • You want to be able to schedule it to run from SQLAGent (SQL Server’s “Cron” thing)
  • You want to be able to run it directly from C# via managed code.

Then you would write SSIS.

What Does It Look Like?

image

image

image

image

Initial Gotchas

SSIS requires some level of licensing, so this does NOT work:

  • Install SQLEXPRESS
  • install Visual Studio 2012 Premium
  • Install SSIS add-on
  • Install SSDT (“database tooling for VS2012”)

It successfully loads, and you can debug packages within Visual Studio, but you cannot run them outside via dtexec and dtexecui (the command line thingies for SSIS).  Instead, you need to add:

  • Install SQL Server Developer Edition (which includes SSIS)  (or standard edition)
    • Select Integration Services & Data Tools
    • Do NOT have to install the engine.
    • Might have to install an updated SSDT to get VS2012 support

Then everything runs fine.

The other thing I quickly learned is it is very Schema-dependent.  The reason for this is it allocates row counts into buffer sizes based on the size of the rows – so a Varchar(MAX) field is very different from a Varchar(250) field – and the schema will fail to validate if the underlying data source it is reading from does not match perfectly.

The solution I had to apply was to do a select cast(x as varchar(250)) as x to extract the data to guarantee the schema coming in to SSIS.  

Conclusion

Very glad to have this tool in my arsenal.   Thanks Mike!