5 Tips for Writing Better SQL

More people are starting to write SQL who have a background in programming as more and more people are analyzing data as data science related jobs are exploding.  SQL is a little different than other languages and here are some common patterns which should be avoided.  Here are 5 things, in no particular order, you can do to improve your SQL.

1. Use the correct join for the job

There are a number of different joins you can use in SQL, but there are 3 common ones, Left Join, Right Join and Inner join.  When joining two tables you will get only the records which match with an inner join.  Any records which cannot be joined will be excluded.  For this reason, many people only write left or right joins which automatically return all of the rows in the left or right hand side, respectively.  These joins are always slower that Inner joins, and may not be necessary.  Evaluate your joins to see if the record count changes if you do an inner join before you do a left or right join.

2. Do not Loop through rows of data

SQL is designed to process data in sets, not one record at a time.  In fact, it does not work well processing one row at a time.  Looping through records one at a time in a cursor or a while loop is to be avoided.  Think about how you might write the code so that it processes the records at one time.

3. Avoid Nesting SQL code

SQL is not an object-oriented language.  This means the performance and readability does not improve when views, stored Procedures Or cursors are called from other objects. Try to do all of the work in one place.  For more information on nested views, check out my post on the topic.

4. Do not use column names which include spaces or reserved words

While you can get away with using spaces or reserved words by putting the name of the column in brackets, most people are not used to writing code this way and it is inevitable that someone will leave off the square brackets and the code will not work. Have pity on those who will be supporting your code and do not include spaces.  If you are unsure of what the reserved words are, see if they change color in the editor.

5.Use Sets in SQL and Avoid Temp tables

Someone who attended I taught a class recently told me that they didn’t think that it was possible to write a stored procedure without a temp table because all of the stored procedures at her work had them.  A common pattern which I see is to processing a bunch of records and putting them in a temp table, then processing them again. This pattern should be avoided. You can process records in one query. Look at writing more complex queries which look at all of the data in one query.

Resolving Common SQL Pitfalls for People who Develop Lots of Code

Sometimes all it takes to figure out how to write SQL using sets is to have someone show you how.  If you are in Phoenix, I can show you how on February 5th when I am doing a talk at Galvanize on this topic. I would be happy to show you some tips and tricks which would help gather and analyze data for an experiment.  Not in the area or can’t make it?  Check out my class at DataCamp.  You can try the class for free to see what you think. I look forward to hearing your feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Views Calling Views – A Bad Idea

Recently I talked to someone about a view calling another view, a practice which is known as nested views. TL;DR Do not write Nested Views. Give this person a little credit, he had heard it was a bad idea, but he had a really good reason. There is no such thing. There is also no reason to take my word for it. Here are a number of other very good posts which detail why nested views are bad from a performance perspective.

7 sins against TSQL Performance

Nested Views Causing Query Performance Woes

Poor Little Misunderstood Views

Proving Views Can Hurt Performance

T-SQL Anti-Patterns: Nested Views

I understand that on the internet, no matter how cockamamie your idea is, you can find someone or some post which will agree with you. That is why I trust sources who I know and trust rather than the random chatter of places like Stack Overflow.

Non-Performance Reasons Not to Use Nested Views

In addition to the performance reasons, there are other reasons not to use nested views. Supporting and maintaining nested views can be a nightmare. If there is an issue with the accuracy of the data, finding the problem is just that much harder. And what about when one needs to go investigate or fix a problem? Then someone has to test all of the places the view is called. I worked with some code recently where a view was created on a set of data just to create 3 new values based up 3 case statements. That view was nested many levels below where the field was actually called. I know people from a development background look at nested views as modular development. Don’t. TSQL is not the same as object oriented code.

Good Use Cases for Creating Views

Views are not always a bad idea. If you are creating an Analysis Services or Power BI project, it is a good idea to create views of the tables you are going to be using, even if there is no difference between the view and the table. Why? That way if the table is changed, your project will still work. Exposing a single view to users who are going to use self-service BI against it can also be a good idea. None of these are reasons to nest views, ensuring that the source data is anything other than a table.hammer_and_bolts

If all you have is a Hammer Everything Looks like a Nail

It takes a while to learn how to write good SQL and I have seen people rely on a view because it takes a while longer to write good SQL. Relying on a view though is like having one tool in the toolbox. If you don’t have a screwdriver, one may be tempted to use a Hammer instead. Too often nested views are written because sometimes writing SQL is hard and it is easier to just write a query and save it as a view rather than use a CTE or a derived table. Do not succumb to the temptation. There is a lot of code written where someone as in a hurry and they were planning on going back and fixing it someday. Someday appears with the same frequency as a lottery. Make it easier on the person who has to figure out or maintain the code left in your wake. Don’t write nested views. Your replacement will thank you for it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Limiting the Number of Results with TABLESAMPLE

In the world of exploding piles of data, there are times you just need less not more. If you are working on a proof of concept, just want to test out some code, or want to pass a selection of data to a machine learning experiment, you don’t need several million records. Having that much data will just keep you from getting to the answer in a reasonable period of time. There are a number of standard ways people limit the data, the most common being some kind of a date filter. Using a date range though often times does not provide the variability needed. This is a particular problem with data used in a machine learning experiment it is designed to create an algorithm based on data pattern extrapolated over time. For example if you are doing any kind of regression analysis on a retail client and you either exclude or include the Christmas shopping season, the algorithm created will not be correct. The goal is to have less data for analysis pulled from the entire set of data. Fortunately SQL Server since 2005 has several methods for selecting random data

Using TABLESAMPLE

Until recently, I hadn’t used the Transact SQL TABLESAMPLE clause, but I ran into a situation where I needed to test some things and not wait all year for a result. TABLESAMPLE to the rescue. There are a couple of things where it won’t work. If you have derived tables, tables from a linked server (bad idea), or are writing a view and you want to return random data you can’t use TABLESAMPLE. If those conditions do not apply, you can use it.

The random sample that TABLESAMPLE provides is based on the number of data pages, not the number of records. If you want the number of rows to be specifically limited you will need to use Top(n) as well. I’ve written all my samples based upon AdventureWorksDW so you can run them for yourself later. I’ve listed the variety of ways to call TABLESAMPLE and shown the number of records returned.

SELECT *
FROM [dbo].[FactInternetSales] –60398 Rows returned prior to Table Sample
TABLESAMPLE(10 PERCENT)

(6073 row(s) affected)

Let’s say you want to return the same sample set multiple times. For that you will need some value. I picked 11, but you could pick any other you like.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]
TABLESAMPLE(10 PERCENT)
REPEATABLE (11)

 (6489 row(s) affected)

When looking at the number of records returned, the values are not 10 percent exactly or particularly consistent in the number of rows returned. If you only want 6039 records returned, you can try the following code, but it doesn’t really do what it says.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]  
TABLESAMPLE(6039 ROWS)

 (5640 row(s) affected)

This code will provide the exact number of records desired. Noticed I upped the number of rows returned in order to get 6039 rows. If the sample is 6039 you cannot guarantee that you have enough rows returned.

SELECT top 6039 *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales] --60398
TABLESAMPLE(6200 ROWS)

(6039 row(s) affected)

Really Random Sample

RandomSampleTABLESAMPLE has some really useful functionality, but for machine learning I need a truly random sample, which TABLESAMPLE does not provide. For that I need NEWID(). The following sample returns approximately 1% (.01) of the 60398 rows.

Select * from [AdventureWorksDW2014].[dbo].[FactInternetSales]
Where 0.01>= Cast(checksum(newid(), [ProductKey]) & 0x7fffffff as float) / Cast(0x7fffffff as int)

 

Just for fun I ran the same code 3 times and got a variety of rows returned.

(600 row(s) affected)
(607 row(s) affected)
(622 row(s) affected)

The ProductKey is added so that the NEWID() function will calculate a sample for each row. The WHERE statement calculates a random float between 0 and 1. This will truly give me the random sample I would need for a machine learning experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Creating a Date Dimension with a CTE

I am looking forward to SQL Server 2016, which includes an update to DAX which will generate dates for you, but since it hasn’t been released yet, the need still exists to create a date Table. Since I am creating a Tabular Analysis Service instance, an identified date dimension is needed for time functions. There are a number of different ways to create a date dimension, here I am going to take advantage of recu There are a number of ways you can create a date dimension, here I am going to describe a neat way of creating one using a recursive CTE, which decreases the SQL code required to generate one.

Date Dimension Fields

There are a number of fields which are pretty standard for date tables. The fields can vary if you need to add things like Fiscal years or Month End dates which have business specific rules like last Friday of the month. The most common fields needed for a Date Dimension are listed here.

  • DateKey – Integer key representing the date, for example 20150524
  • CalendarDate – Date time field
  • CalendarYear – Four digit year
  • QuarterNumber – Number from 1 to 4
  • QuarterName – First Quarter, Second Quarter, 1 Qtr 2015, First Quarter 2015 are some of the ways the name is listed. Pick whichever format people would like to see.
  • MonthNumber – 1 for January, 2 for February, 3 for March etc.
  • NameMonth –   January, February, March etc.
  • MonthDayNumber – July 15 would have the value of 15 here
  • WeekDayNumber – Date of the Month. July 4 would have a 4 here.
  • CalendarWeekofMonth – Counts the weeks within each month
  • NameDay – Monday, Tuesday etc.
  • DayNumber – Whether you choose 1 for Sunday or 1 for Monday is a business rule you need to find out first. This example shows 1 for Sunday, 2 for Monday etc.
  • YearDay – Sometimes referred to as the Julian number this counts the days from 1- 365 and can be useful in some calculations
  • YearWeek – Counts the weeks from 1 -52

As a best practice, remember reserved words such as DATE or Index or any other word which shows up as blue or pink in SQL Server Management Studio, should never be the names of any columns. Always select a name other than a reserved word when creating tables. Also friendly names containing spaces are great for exposing to users, but they are annoying in SQL Server, so leave the spaces out of the column names. Evaluate what dates you are going to be needing in the table so that you don’t have to go back and redo it. I am creating a smaller table here, but that is just because it is an example. Look at the dates you will be storing in your data warehouse when determining your starting dates, and set the end dates for probably about five years longer than you think the data warehouse will still be in use.

Using a CTE to Generate a Date Dimension

CTEs, which Microsoft added in 2005, is a great way to generate a date table by harnessing the power of computer to spin through your code, decreasing the need to write a lot of code. I am using a recursive CTE method here first published by Itzik Ben-Gan to generate the number table being used in the code below.

CREATE TABLE [dbo].[DimDate](
[DateKey] int NULL,
[CalendarDate] [smalldatetime] NULL,
[CalendarYear] [int] NULL,
[QuarterNumber] [int] NULL,
[QuarterName] [varchar](14) NULL,
[MonthNumber] [int] NULL,
[NameMonth] [nvarchar](30) NULL,
[MonthDayNumber] [int] NULL,
[CalendarWeekOfMonth] [int] NULL,
[NameDay] [nvarchar](30) NULL,
[DayNumber] [int] NULL,
[YearDay] [int] NULL,
[YearWeek] [int] NULL
) ON [PRIMARY]

GO

/*Make sure you change the start and end dates listed here to the dates you wish to use in your table*/
DECLARE @StartDate smalldatetime = '01/01/2014'
DECLARE @EndDate smalldatetime = '12/31/2016' ; /* don't forget the semi-colon or you will get an error*/

/*This CTE is used to create a list of numbers used to generate the calendar*/
WITH
A00(N) AS (SELECT 1 UNION ALL SELECT 1),
A02(N) AS (SELECT 1 FROM A00 a, A00 b),
A04(N) AS (SELECT 1 FROM A02 a, A02 b),
A08(N) AS (SELECT 1 FROM A04 a, A04 b),
A16(N) AS (SELECT 1 FROM A08 a, A08 b),
A32(N) AS (SELECT 1 FROM A16 a, A16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM A32),
/*Calendar dates are created here*/
CalendarBase as (
SELECT
DateKey = n
, CalendarDate = DATEADD(day, n - 1, @StartDate )
FROM cteTally
WHERE N <= DATEDIFF(day, @StartDate , @EndDate +1)
)

/*Using the list of dates created above will populate your date table here*/
insert into dbo.DimDate(DateKey
,CalendarDate
,CalendarYear
,QuarterNumber
,QuarterName
,MonthNumber
,NameMonth
,MonthDayNumber
,CalendarWeekOfMonth
,NameDay
,DayNumber
,YearDay
,YearWeek)

SELECT
DateKey       = CONVERT(char(8), CalendarDate, 112)
, CalendarDate
, CalendarYear  = YEAR(CalendarDate)
, QuarterNumber =  (DATEPART(QUARTER,CalendarDate) )
, QuarterName = 'Quarter ' + cast((DATEPART(QUARTER,CalendarDate) ) as char(1)) +' ' + cast(YEAR(CalendarDate) as char(4))
, MonthNumber = MONTH(CalendarDate)
, NameMonth     = DATENAME(Month, CalendarDate)
, WeekDayNumber   = DATEPART(DAY, CalendarDate)
, CalendarWeekOfMonth = DATEDIFF(week, DATEADD(day,1, CalendarDate - DAY(CalendarDate) + 1) -1, CalendarDate) +1
, NameDay       = DATENAME (Weekday,CalendarDate )
, DayNumber       = DATEPART(Dw, CalendarDate)
, YearDay       = DATEPART(DAYOFYEAR, CalendarDate)
, YearWeek      = DATEPART(WEEK, CalendarDate)
FROM CalendarBase

After running this code you will have a date table created and loaded.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

When IsDate Does Not Work

IsDate is an often used feature used in TSQL to determine whether or not a field contains a date. I am sure there are a number of people who have used this successfully and may wonder what I could possibly be talking about. To maintain my technical bona fides I’ve included some code, which you can run inside of SQL Server Management Studio.

 Declare @ValidDateTest as table ( Somefield varchar(15), Works bit)
 Insert into @validDateTest values ('Not a date' , 0)
 Insert into @validDateTest values ('4-1-2-14' , 0)
 Insert into @validDateTest values ('5-2-7' , 0)
 Insert into @validDateTest values ('2014.2.3' , 1)
 Insert into @validDateTest values ('08/02/10' , 0)
 Insert into @validDateTest values ('7/3/2015' , 1)
 Insert into @validDateTest values ('2014-3-14' , 1)
 Insert into @validDateTest values ('12-3-1' , 0)
 Insert into @validDateTest values ('14-3-4' , 0)
 Insert into @validDateTest values ('20140301' , 1)
 Insert into @validDateTest values ('201123' , 1)
 Insert into @validDateTest values ('2011204' , 0)
 Insert into @validDateTest values ('7/023/2015' , 0)
 Insert into @validDateTest values ('6/02/014' , 0)
 Insert into @validDateTest values ('003/02/014' , 0)
 Insert into @validDateTest values ('3/010/2014' , 0)
 Insert into @validDateTest values ('4/02/012' , 0)
Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Convert(datetime, somefield, 110)
 end as ConvertedDate , Somefield
 from @validdatetest
 where works = 1

It is obvious by looking at my example, that some of the values inserted into the temp table @validDateTest are not dates. The code below the insert statements will Convert only the values which IsDate says are dates, not all of the values as I have a where condition in the query.  The values that will Convert without giving me an error have a works value of 1. If you run this query, you will see the difference in the values returned by IsDate, and the values I have provided that work without returning an error.

 Select Somefield, isdate(somefield)as IsDateValue, works
 from @ValidDateTest

Here are the results

Covertresult

Now if you run the query listed above and take off the where condition

 Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Convert(datetime, somefield, 110)
 end as ConvertedDate , Somefield
 from @validdatetest

You’ll get this error
Msg 241, Level 16, State 1, Line 21
Conversion failed when converting date and/or time from character string.

Casting Around for Bad IsDate solution

If you are fishing around for a solution, the simplicity of this will probably make you smack your desk. Now ideally, the source system should be fixed, as it really should be making sure that the dates are valid. I can’t make that happen. But I do have a solution to resolve this problem, just Cast it. Cast and Convert are on the same page in Microsoft online help and a lot of the time you might think they are synonymous, but they don’t work the same way.

 Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Cast (somefield as datetime)
 end as ConvertedDate , Somefield
 from @validdatetest

This works with no errors whatsoever. Simple. Easy. If you ever run into this, I hope this saves you some time.

Yours Always,
Ginger Grant
Data aficionado et SQL Raconteur

SSIS Tuning – What Size is my row, your DIY Buffer Sizing Guide

When looking to improve the performance of an SSIS package, one of the common recommendations listed in many places is to modify your buffers so that you can load as much data into a single buffer size as possible. According to Microsoft, for optimal performance, SSIS should be configured for maximum memory utilization by having buffers with as many rows as possible without exceeding the internal 100 MB limit. Ok now that you know what Microsoft says about the topic, how does one put it in practice? I thought it might be nice to post the How-To as it is a little complicated.

Buffer Tuning Parameters

There are a couple of default settings that are used for loading data into memory and they all need to line up correctly to keep data from being sent from memory to disk. In order to set the size appropriately, it is best to apply the Price Is Right Axiom.

DefaultMaxBufferSize – the default is 10 MB. The maximum size is 100MB, which SSIS stores as MaxBufferSize. This value can never be any bigger than 100MB, so size it appropriately for it’s environment.

DefaultMaxBufferRows – The default is 10,000 rows. This setting needs to be set the same way that you win with the Price is Right . To win, the default max buffer rows * (times) the row size needs to be as close to DefaultMaxBufferSize as possible without going over.

Size Matters

To figure out what to put in the DefaultMaxBufferRows, one needs to know how big the rows are. Remember to win the value must be as close to the size of DefaultMaxBufferSize without going over. If the buffer is sized too large, the package will be slower as rows will be cached to disk, so make sure you don’t miss by even a little. How big is the row size? There are a number of ways of figuring this out. You can look at each field in the query SSIS is loading, and based on the data type, add up the value of all the fields. Fortunately that isn’t the only way to figure out the row size. If your data source is SQL Server, you are in luck, as the system tables can help to determine what the size is. Here is a sample the query, assuming your table name is Address, which you can run on the AdventureWorks database.

Exec sp_SpaceUsed 'Person.Address'

The Results are

name         rows    reserved       data          index_size    unused
Address    19614   5960 KB      2784 KB     2688 KB      488 KB

To figure out what the size of your buffer should be for this entire table is to take the number of (data *1024)/ Rows as 100MB is the max size you can set. To calculate the row size, use the formula values  2784 / 19614 * 1024 = 145.346, or 146 bytes per row.  If you set DefaultMaxBufferRows to 100MB, which is the maximum and what I recommend in most cases, it is 104857600 bytes is the Buffer Size.  Buffer Size/ Row size = DefaultMaxBufferRows. 104857600 / 146 = 718202.73 so set the DefaultMaxBufferRows to 728203  If  you are using the columns, you can get the same information by looking at the  syscolumns.  By using the column length, it is relatively easy to figure out what the appropriate size of your buffer should be, by adding up the column lengths. One word of caution. I do not wish to imply that because the information is available on a per table basis one should pick Table or View in the SSIS source. Au contraire. Always access the data by using a Select statement as it performs better.

Validate

To ensure that you have improved the buffer size performance, check it. After you are done with the settings, Enable logging on the data flow task, and select the BufferSizeTuning event to see how many rows are contained in each buffer.

Please feel free to drop me a line if you find this helpful.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur