In the world of exploding piles of data, there are times you just need less not more. If you are working on a proof of concept, just want to test out some code, or want to pass a selection of data to a machine learning experiment, you don’t need several million records. Having that much data will just keep you from getting to the answer in a reasonable period of time. There are a number of standard ways people limit the data, the most common being some kind of a date filter. Using a date range though often times does not provide the variability needed. This is a particular problem with data used in a machine learning experiment it is designed to create an algorithm based on data pattern extrapolated over time. For example if you are doing any kind of regression analysis on a retail client and you either exclude or include the Christmas shopping season, the algorithm created will not be correct. The goal is to have less data for analysis pulled from the entire set of data. Fortunately SQL Server since 2005 has several methods for selecting random data
Until recently, I hadn’t used the Transact SQL TABLESAMPLE clause, but I ran into a situation where I needed to test some things and not wait all year for a result. TABLESAMPLE to the rescue. There are a couple of things where it won’t work. If you have derived tables, tables from a linked server (bad idea), or are writing a view and you want to return random data you can’t use TABLESAMPLE. If those conditions do not apply, you can use it.
The random sample that TABLESAMPLE provides is based on the number of data pages, not the number of records. If you want the number of rows to be specifically limited you will need to use Top(n) as well. I’ve written all my samples based upon AdventureWorksDW so you can run them for yourself later. I’ve listed the variety of ways to call TABLESAMPLE and shown the number of records returned.
SELECT * FROM [dbo].[FactInternetSales] –60398 Rows returned prior to Table Sample TABLESAMPLE(10 PERCENT)
(6073 row(s) affected)
Let’s say you want to return the same sample set multiple times. For that you will need some value. I picked 11, but you could pick any other you like.
SELECT * FROM [AdventureWorksDW2014].[dbo].[FactInternetSales] TABLESAMPLE(10 PERCENT) REPEATABLE (11)
(6489 row(s) affected)
When looking at the number of records returned, the values are not 10 percent exactly or particularly consistent in the number of rows returned. If you only want 6039 records returned, you can try the following code, but it doesn’t really do what it says.
SELECT * FROM [AdventureWorksDW2014].[dbo].[FactInternetSales] TABLESAMPLE(6039 ROWS)
(5640 row(s) affected)
This code will provide the exact number of records desired. Noticed I upped the number of rows returned in order to get 6039 rows. If the sample is 6039 you cannot guarantee that you have enough rows returned.
SELECT top 6039 * FROM [AdventureWorksDW2014].[dbo].[FactInternetSales] --60398 TABLESAMPLE(6200 ROWS)
(6039 row(s) affected)
Really Random Sample
TABLESAMPLE has some really useful functionality, but for machine learning I need a truly random sample, which TABLESAMPLE does not provide. For that I need NEWID(). The following sample returns approximately 1% (.01) of the 60398 rows.
Select * from [AdventureWorksDW2014].[dbo].[FactInternetSales] Where 0.01>= Cast(checksum(newid(), [ProductKey]) & 0x7fffffff as float) / Cast(0x7fffffff as int)
Just for fun I ran the same code 3 times and got a variety of rows returned.
(600 row(s) affected)
(607 row(s) affected)
(622 row(s) affected)
The ProductKey is added so that the NEWID() function will calculate a sample for each row. The WHERE statement calculates a random float between 0 and 1. This will truly give me the random sample I would need for a machine learning experiment.
Data aficionado et SQL Raconteur