Articles

Tips on SSIS at SQL Saturday Albuquerque

sqlsat358_ABQOn February 7, I was fortunate enough to be selected to speak at SQL Saturday in Albuquerque, New Mexico on Top 10 SSIS Tuning Tricks. Having worked with SSIS for a number of years, I’ve needed to research what was the best methods to employ to ensure my SSIS ETL was running optimally. I’ve compiled the most valuable items, with examples of course, into this presentation. I’m assuming that everyone attending already has been using SSIS for a while, so I will skip straight into more in-depth ways of tuning SSIS. One of the questions that I know I have heard most often is “When should I do X in SQL or SSIS?” If you are able to attend this session, you will have the answer to that question.

I really enjoy the opportunity to speak on data related topics and meeting people who may have come upon my blog in the past. Having spoken at this event last year, I know what a good job Keith, Chris and Meredith and friends do organizing this event. I want to take the time to say thank you for all of your hard work as I really appreciate it. These events are a great place to learn and keep up with a lot of the changes going on in the industry. I anticipate there will be many lively discussions both before and after the event. That reminds me. If you get a chance, on Friday there are two great precons scheduled on Friday, February 6th , Powershell Basics with Mike Fal and Query Tuning, Troubleshooting and Execution Plans with Jason Kassay. Having been fortunate enough to meet both of them, I know that they are both extremely knowledgeable in their respective topics, and if you are in Albuquerque I encourage you to sign up for either of them as I am sure both will be excellent.

I hope that you will be able to attend as I know I will enjoy seeing you there.

 

Yours Always
Ginger Grant
Data aficionado et SQL Raconteur

Upcoming and Up and Coming Topics

It’s funny the different meanings words have when you put them in different order, a point which anyone who has imitated the dialectic of Yoda can tell you. I find words fascinating as they are not static but have meanings which change over time. For example the Iron Maiden meant something totally different before there were electric guitars. Thinking of works and things changing, as one year closes and another year begins, I start to evaluate past and future topics. Earlier this year, I held an informal poll on twitter to find out how long people tend to talk on the same topic. The answers were quite varied. Some people keep on talking about the same topic as long as there seems to be interest in hearing about it. That way you can get to be a really good speaker on that topic. Another feels obligated to create a new topic each time out to provide him a challenge. The answer that personally I related to, was keep on talking about the topic until you are tired of hearing about it, which takes about a year.

SQL Saturday Albuquerque

sqlsat358_ABQMy first upcoming engagement for 2015 will be as SQL Server Albuquerque where I will be talking about SSIS. I generally talk about things I am interested in or presently working on, and having working on a lot of ETL recently, I thought that it would be an interesting topic which I think most people would find helpful. As a consultant, I see a lot of code and wonder why parts of it were written that way. One big reason is someone thought the design was a good one. Since that is an objective decision, I thought it might be helpful to clarify design decisions with facts so that that people would be able to employ good logic for their design decisions.

Technology changes and their Impact on Data Development

Another topic which really interests me is the changes that new technologies are having on the database world. With the increased implementation of Hadoop and cloud things are really changing in the way data is being both stored and used. Predictive Analytics, Machine Learning, Cloud implementations, Interactive Data visualizations are changing what people are expecting from the way their data is stored and used. Expectations for data professionals are increasing as the business is looking away from HIPPO and towards the knowledge that they have gathered or integrated data from public sources.

Modern Data Warehouse

I have the pleasure of assisting in a day-long session to talk about Architecting the Modern Data Warehouse . During this one day session we will be showing how to use new technology such as HD Insight and Machine Learning to implement a modern data warehouse. Instead of just talking about new technologies we will be putting them to use to show how they can be used today. I’m really looking forward to it.

If you are able to attend any of these or any upcoming sessions, please stop by and introduce yourself as I would love to meet readers of my blog in person.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Balanced Data Distributor and Other Features in the Pack

BBDRunFullSometimes the basics aren’t enough.  Car manufactures know this, that’s why they have a base model and one with all the options.  Let’s face it, most of the time you want all the options.  Well, with SQL Server you can get the options and you don’t have to spend hours debating the cost of clear coat. But I digress. If you want all the options, you can have them, right after you install SQL Server, and I am not talking about the cool codeplex stuff, which I will have to bring up another time. The genuine Microsoft list of add-ons are available for you at no additional cost. When Microsoft releases SQL Server, just because you installed it, doesn’t mean you have everything.  For versions 2008R2, 2012, 2012 SP1 as well as 2014 all have Feature Packs. Looking at the SQL Server 2014 Feature Pack List , there are a number of things which are about as useful as heated seats in the desert, like the DB2 or SAP drivers, which is probably why they are not automatically included as part of the release. When looking at the list, there is a very good chance many SSIS developers may be interested in some of them.  For example, wouldn’t you want the option for the more powerful and fuel efficient motor?  I know I would. There is an item in the feature pack which provides that feature, the Balanced Data Distributor.

Threading the Memory since 2011

Balanced Data Distributor (or BDD for short) was first released as a new SSIS transform for 2008 and 2008 R2. It was designed to take advanced of multi-threading hardware capabilities by spreading the data load so that the data can be broken into chunks and processed at the same time, rather than serially starting at the beginning and loading a stream until it ends.  This means you can go much faster on a single tank of gas, which is awesome. As you might imagine, processing more data at one time, decreases the time needed to process it.  This is really usefully if you are using blocking transforms like a row-by-row script component or a Sort or Aggregate transform, which require SSIS to look at every single row before processing.  Another situation where you might find it useful is if you have to write to a really slow output.

DIY of BDD –  Divide and Conquer

What the BDD does is divide the copies into multiple threads. How many is determined by the developer because while you will not need to configure it, you will need to add code to tell it how many times to divide the task. Let me show you what I mean through a series of SSDT screen shots.

BBDfull

This screen shows that I have a Data Flow task where I am reading in 121,317 rows then doing an Aggregate and a Sort, both of which are blocking transforms. This is just meant to be an example as with 121,317 records, you probably won’t see that much of a performance improvement, but you get the idea.

In this screen shot you can see that I have added the BDD task, which I’ve highlighted in the SSIS toolbox so you will see where it shows up when the component is installed. You will also see that I copied the code so that the same tasks from the aggregate on appear twice. What happens when this version is run?

BBDRun2

Check out the outputs underneath the BDD component. The number of records was split, but it isn’t an even split.  There are 62,337 records on one side and 58,980 records on the other side.  The record counts in the output is determined by the component as the optimal number based on the available threads.  You configure nothing, just drag it onto the screen. Pretty cool, isn’t it?

This example shows how easy it is to speed up SSIS processing without a huge amount of effort, allowing you to drive laps of code with speeds you may not of thought possible with the help of a free download. Have fun and let me know what kind of performance gains you see.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

When IsDate Does Not Work

IsDate is an often used feature used in TSQL to determine whether or not a field contains a date. I am sure there are a number of people who have used this successfully and may wonder what I could possibly be talking about. To maintain my technical bona fides I’ve included some code, which you can run inside of SQL Server Management Studio.

 Declare @ValidDateTest as table ( Somefield varchar(15), Works bit)
 Insert into @validDateTest values ('Not a date' , 0)
 Insert into @validDateTest values ('4-1-2-14' , 0)
 Insert into @validDateTest values ('5-2-7' , 0)
 Insert into @validDateTest values ('2014.2.3' , 1)
 Insert into @validDateTest values ('08/02/10' , 0)
 Insert into @validDateTest values ('7/3/2015' , 1)
 Insert into @validDateTest values ('2014-3-14' , 1)
 Insert into @validDateTest values ('12-3-1' , 0)
 Insert into @validDateTest values ('14-3-4' , 0)
 Insert into @validDateTest values ('20140301' , 1)
 Insert into @validDateTest values ('201123' , 1)
 Insert into @validDateTest values ('2011204' , 0)
 Insert into @validDateTest values ('7/023/2015' , 0)
 Insert into @validDateTest values ('6/02/014' , 0)
 Insert into @validDateTest values ('003/02/014' , 0)
 Insert into @validDateTest values ('3/010/2014' , 0)
 Insert into @validDateTest values ('4/02/012' , 0)
Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Convert(datetime, somefield, 110)
 end as ConvertedDate , Somefield
 from @validdatetest
 where works = 1

It is obvious by looking at my example, that some of the values inserted into the temp table @validDateTest are not dates. The code below the insert statements will Convert only the values which IsDate says are dates, not all of the values as I have a where condition in the query.  The values that will Convert without giving me an error have a works value of 1. If you run this query, you will see the difference in the values returned by IsDate, and the values I have provided that work without returning an error.

 Select Somefield, isdate(somefield)as IsDateValue, works
 from @ValidDateTest

Here are the results

Covertresult

Now if you run the query listed above and take off the where condition

 Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Convert(datetime, somefield, 110)
 end as ConvertedDate , Somefield
 from @validdatetest

You’ll get this error
Msg 241, Level 16, State 1, Line 21
Conversion failed when converting date and/or time from character string.

Casting Around for Bad IsDate solution

If you are fishing around for a solution, the simplicity of this will probably make you smack your desk. Now ideally, the source system should be fixed, as it really should be making sure that the dates are valid. I can’t make that happen. But I do have a solution to resolve this problem, just Cast it. Cast and Convert are on the same page in Microsoft online help and a lot of the time you might think they are synonymous, but they don’t work the same way.

 Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Cast (somefield as datetime)
 end as ConvertedDate , Somefield
 from @validdatetest

This works with no errors whatsoever. Simple. Easy. If you ever run into this, I hope this saves you some time.

Yours Always,
Ginger Grant
Data aficionado et SQL Raconteur

SSIS Tuning – What Size is my row, your DIY Buffer Sizing Guide

When looking to improve the performance of an SSIS package, one of the common recommendations listed in many places is to modify your buffers so that you can load as much data into a single buffer size as possible. According to Microsoft, for optimal performance, SSIS should be configured for maximum memory utilization by having buffers with as many rows as possible without exceeding the internal 100 MB limit. Ok now that you know what Microsoft says about the topic, how does one put it in practice? I thought it might be nice to post the How-To as it is a little complicated.

Buffer Tuning Parameters

There are a couple of default settings that are used for loading data into memory and they all need to line up correctly to keep data from being sent from memory to disk. In order to set the size appropriately, it is best to apply the Price Is Right Axiom.

DefaultMaxBufferSize – the default is 10 MB. The maximum size is 100MB, which SSIS stores as MaxBufferSize. This value can never be any bigger than 100MB, so size it appropriately for it’s environment.

DefaultMaxBufferRows – The default is 10,000 rows. This setting needs to be set the same way that you win with the Price is Right . To win, the default max buffer rows * (times) the row size needs to be as close to DefaultMaxBufferSize as possible without going over.

Size Matters

To figure out what to put in the DefaultMaxBufferRows, one needs to know how big the rows are. Remember to win the value must be as close to the size of DefaultMaxBufferSize without going over. If the buffer is sized too large, the package will be slower as rows will be cached to disk, so make sure you don’t miss by even a little. How big is the row size? There are a number of ways of figuring this out. You can look at each field in the query SSIS is loading, and based on the data type, add up the value of all the fields. Fortunately that isn’t the only way to figure out the row size. If your data source is SQL Server, you are in luck, as the system tables can help to determine what the size is. Here is a sample the query, assuming your table name is Address, which you can run on the AdventureWorks database.

Exec sp_SpaceUsed 'Person.Address'

The Results are

name         rows    reserved       data          index_size    unused
Address    19614   5960 KB      2784 KB     2688 KB      488 KB

To figure out what the size of your buffer should be for this entire table is to take the number of (data *1024)/ Rows as 100MB is the max size you can set. To calculate the row size, use the formula values  2784 / 19614 * 1024 = 145.346, or 146 bytes per row.  If you set DefaultMaxBufferRows to 100MB, which is the maximum and what I recommend in most cases, it is 104857600 bytes is the Buffer Size.  Buffer Size/ Row size = DefaultMaxBufferRows. 104857600 / 146 = 718202.73 so set the DefaultMaxBufferRows to 728203  If  you are using the columns, you can get the same information by looking at the  syscolumns.  By using the column length, it is relatively easy to figure out what the appropriate size of your buffer should be, by adding up the column lengths. One word of caution. I do not wish to imply that because the information is available on a per table basis one should pick Table or View in the SSIS source. Au contraire. Always access the data by using a Select statement as it performs better.

Validate

To ensure that you have improved the buffer size performance, check it. After you are done with the settings, Enable logging on the data flow task, and select the BufferSizeTuning event to see how many rows are contained in each buffer.

Please feel free to drop me a line if you find this helpful.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur