T-SQL Tuesday #79 – Creating R Code to run on SQL Server 2016

TSQL2SDAY-300x300

As SQL Server 2016 was recently released, many people have not yet had used R with SQL Server. I thought that T-SQL Tuesday would be a great way to introduce this topic.  This post contains everything you need to run your first R program from the UI, get data from SQL Server and run the R code on SQL Server from the UI. If you are running open source R, this code will not work. If you are using Microsoft R Open, this code will not work. Only if you are running the version of R which Microsoft released with SQL Server 2016 will this code work.

The Two Versions of Microsoft R

Microsoft has not one version of R, they have two but two. These two different versions are needed because they have two different purposes in mind. Microsoft R Open, is open source and fully R compatible and is faster than open source R because they rewrote a number of the algorithms to include multi-threaded math libraries. If you want to run R code on SQL Server, this is the not the version you want to use. You want to use the non-open source version designed to run on R Server, which is included with SQL Server 2016, Microsoft RRE Open. This version will run R code not only in memory but swap to disk, to create code which can access SQL Server data without needing to create a file, and can run code on the server from the client. The version of RRE Open which is included in SQL Server 2016 is 8.0.3.

Running R on SQL Server

As a handy mnemonic device, all the RRE functions start with Rx, like prescription drugs. None of these features will work in R, unless you are using the Microsoft RRE Open version. For more information on how to set up Visual Studio 2015 to use the correct libraries, please read my previous post for instructions.

SQL Server R Code Walk-through

This code was created on a PC with SQL Server 2016 Developer Edition installed with the R tools, and the Community Edition of Visual Studio 2015. On my SQL Server instance, I have created a database called TestR and loaded the sample file AirlineDemoSmall.csv included with R server as a table with the same name. If you have SQL Server 2016 installed, the real directory for the sample files can be found here

C:\Program Files\Microsoft SQL Server\130\R_SERVER\library\RevoScaleR\SampleData

The table dbo.AirlineDemoSmall has 600,000 rows. Prior to running this code, create a table on SQL Server to hold the data. The code will load the table data and using some Rx commands, load the data from SQL Server, run the code on the R Server, and draw a histogram.

library(RevoScaleR)


sqlConnString <- "Driver=SQL Server;Server=MyLaptop\\SQLSERVER2016;Database=TestR;Uid=ReadData;Pwd=readd@t@"
sqlsampleTable <- "AirlineDemoSmall"
# Set ComputeContext.
sqlShareDir <- paste("C:\\Ginger\\AllShare\\", Sys.getenv("USERNAME"), sep = "")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
serverside <- RxInSqlServer(connectionString = sqlConnString, shareDir = sqlShareDir,
wait = sqlWait, consoleOutput = sqlConsoleOutput)


rxSetComputeContext(serverside)

sqlPlaneDS <- RxSqlServerData(connectionString = sqlConnString, verbose = 1, table = sqlsampleTable)
rxGetInfo(data = sqlPlaneDS, getVarInfo = TRUE, numRows = 3)
rxHistogram( ~ CRSDepTime, data = sqlPlaneDS)

Detailed Description of the R Code

To better understand each line of code, I provided the description for each line, along with some tips to resolve some possible erors.

library("RevoScaleR")

If you get an error running this line, chances are the R compiler doesn’t know where to find the library. Maybe you need to install it. If so run this command in the interactive window

install.packages('RevoScaleR')

If this command gives you an error, R can’t find where the library is. Resolve this issue by adding the path Run this command in the immediate window. Notice the slashes go the opposite way file explorer puts them

.libPaths(c(.libPaths(),"C:/Program Files/Microsoft SQL Server/130/R_SERVER/library"))

After setting the path, run the previous command to resolve the package, and then run the first line again, as this should resolve any previous errors.

sqlConnString <- "Driver=SQL Server; Server=MyLaptop\\SQLSERVER2016;Database=TestR;Uid=ReadData;Pwd=readd@t@"

This line sets the value of the connection string. I am running SQL Server 2016 on my laptop, in an instance called SQLServer2016. Notice I had to put two slashes going the wrong way to set my connection. I have hard coded a user id and password in plain text. For test, I would use a window authentication, which does require an ODBC connection so that I would not have to put the user id and password in code in plain text.

sqlsampleTable <- "AirlineDemoSmall"

This line of code sets a variable to the name of the table created in SQL Server with the data from the csv file.

sqlShareDir <- paste("C:\\Ginger\\AllShare\\", Sys.getenv("USERNAME"), sep = "")

R needs a temporary directory to serialize the R objects when the connection is created, which I am creating here.

sqlWait <- TRUE

Setting the state to wait means that I am creating a blocking transaction which will prevent the later code from being run until this statement is complete. This is a good setting for testing and if you other commands which cannot be run until you have data, such as rxHistogram which requires the dataset to wait.

sqlConsoleOutput <- FALSE

Setting the console output to false decreases the amount of informational messages I get in the immediate window. Since the messages aren’t really that helpful as they show things like how many records were read at the time, I generally set it to false.

serverside <- RxInSqlServer(connectionString = sqlConnString, shareDir = sqlShareDir,
wait = sqlWait, consoleOutput = sqlConsoleOutput)

This line uses the Revo R function RxInSqlServer (remember unlike SQL case is important) to create a connection to SQL Server, using the variables we created earlier to a variable called serverside.

rxSetComputeContext(serverside)

Setting the compute context dictates where my code is going to run. If the compute context is set to local, I am going to run on my local PC. Since I set it to the variable I set connecting my SQL Server connection, this means all of my R code will be using the available memory on the SQL Server PC, not mine. Yes,this does mean that I can starve out the resources on the server, a topic I will address at a later time. Since I am running everything on my laptop it doesn’t matter, but it could.

sqlPlaneDS <- RxSqlServerData(connectionString = sqlConnString, verbose = 1,
table = sqlsampleTable )

This line gets the data from SQL Server, using the connection string, and specifies what data to get. I could have used a query to get data as well, but in this case I grabbed everything from the table.

rxGetInfo(data = sqlPlaneDS, getVarInfo = TRUE, numRows = 3)

To validate that some data was retrieved, rxGetInfo shows the information retrieved from three rows. Why three rows? Because numRows = 3

rxHistogram( ~ CRSDepTime, data = sqlPlaneDS)

One of the big strengths of R is the ability to create data visualizations, so I felt compelled to include the command which creates a Histogram. HistogramThe ~ (tilde) is in front of the column name CRSDepTime from the table AirlineDemoSmall, and the data comes from the variable sqlPlaneDS where all of the data was loaded.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Using R in SQL Server 2016

As SQL Server 2016 will be released soon and to follow up on the talk I gave at SQL Saturday Atlanta, this post will guide you through the steps needed to make R work. Like many features of SQL Server 2016, R is optionally installed.  To install R make sure that when you install R, the Option for R Services (In-Database) is checked, as shown below.

SQLServer2016_InstallR

Microsoft is incorporating the version of R sold by Revolution Analytics, which they called R Server. This version, while fully compatible with Open Source R, has some additional features which allow the R code to be run not only in memory, but use of a chunking technology to swap the data to disk so that the R code will not run out of memory. The commands to use this additional functionality all start with rx and are part of the proprietary ScaleR feature set. To use the R Server as a separate stand along product, instead of selecting the R Server in database option, select the R Server Standalone shared features. A R server could be useful if you want to perform large scale data analysis on a Hadoop Cluster, or other Non-SQL database like Teradata.

SQL Server Steps to Enable R

SQL Server 2016 is installed, especially if it was installed by others, you may be wondering if the R service really is installed. Take a look at the services running on the machine with SQL Server. If the SQL Server Launchpad service is running, as shown below, the R services are installed and running.  ServicesRunningR

The last thing needed to run R is to configure and restart the SQL Server Services. In a new query type the following command
sp_configure 'external scripts enabled', 1
GO
Reconfigure
GO

After restarting the SQL Server Service, SQL Server will now run R code. The following text can be run on any SQL Server 2016 instance to see if R has been configured correctly

EXEC sp_execute_external_script @language =N'R',
@script=N'OutputDataSet <-InputDataSet',
@input_data_1 =N'SELECT 1 as CheckToSeeIfRIsWorking'
WITH result sets (([CheckToSeeIfRIsWorking] int not null));
GO

The code is executed as an external script, specifying that the language used should be R. @script contains the R code, which is a simple command to take the mean of the data coming from the InputDataSet. @Input_Data_1 contains the location of the data to be processed. Of course the R code could of course be more complicated, but this code example is generic enough to test for everyone to ensure R is set up properly on SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2016 Polybase Setup

After reading my last post you should be all ready to install SQL Server 2016 to be able to use Hadoop.  I went through all of these steps in my webinar, but I thought it might also be a good idea to include them here so you don’t have to watch the video.

Polybase Install Feature

When installing SQL Server in the Feature Selection List shown below,  PolyBase Query Service for External Data must be selected. SQL Server 2016 Polybase

To check to see if Polybase has been successfully installed, go to Control Panel->Administrative Tools->Services.  There are two services added for polybase, SQL Server PolyBase Data Movement and SQL Server Polybase Engine, as well as  a bunch of other new ones for SQL Server 2016. The polybase services and SQL Server will need to be restarted in a later step.  When starting SQL Server Management Studio for SQL Server 2016, it is hard to spot the differences between it and previous versions of SQL Server as even the icon looks exactly the same. Once you create a database for SQL Server, you will notice a slight difference under the tables folder. When polybase is installed there are two new folders, External Tables and External Resources. Data which is accessed via polybase will be stored under the External Files folder and the External Resources will contain the references to the external resources, the HDFS cluster where the data is stored, as well as the file formats of the underlying data.

Configuring Polybase on SQL Server

SQL Server needs to be configured to use polybase by using the sp_configure command

EXEC sp_configure ‚'hadoop connectivity', 5;
GO
RECONFIGURE; 

The number 5 indicates the kind of Hadoop connectivity desired.  If you were using HDInsight’s Azure Blob storage or Hortonworks on Windows, the number listed would be 4.  The number 5 indicates connectivity to Hortonworks Data Platform on Linux.  Go ahead and exit out of SQL Server as you will be restarting it soon.

Configuration File Modification

Open up your favorite text editor to modify the Hadoop.config file.  The default location is

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

Keep in mind, if you installed SQL Server with a named instance the path will include the name of that instance.  The config file contains a default password for Hadoop.  The password, which is initially set to pdw_user is probably a holdover from pre-2016 days when polybase was only available on Microsoft’s Big Data Appliance, the Analytics Platform System [APS] which was previously called Parallel Data Warehouse [PDW].  Hortonworks’ default password is hue, so you will want to modify the file so that the HadoopUserName has the right password, hue. I circled it below in a clip of the Hortonworks.config file.

Once the changes to the Hadoop.config file are saved, to get polybase to work, the last thing which must be done is to restart the two new polybase services and SQL Server 2016.

Setting up the External Data Source and External Data Files

At this point, we can now tell SQL Server where the Hadoop files are loaded. Open SQL Server Management Studio [SSMS] and create a new query to create the new an external data source, which is going to appear in the new folder, External Data Source.

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
TYPE = HADOOP,
LOCATION = 'hdfs://sandbox.hortonworks.com:8020'
)

After running this command and refreshing, the source HDP2 will appear in the folder External Data Source ->Data Sources

SQL Server needs to know what the underlying data file formats of the data stored in Hadoop. Assuming the data is stored in a tab delimited format, this command will tell SQL Server how to read the tab delimited data and the format of the dates.  This will allow polybase to read a sample file which comes with Hortonworks, the file SAMPLE_07

CREATE EXTERNAL FILE FORMAT TSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '\t',
DATE_FORMAT = 'MM/dd/yyyy'
)
)

SQL Server needs a schema to read the data, as it doesn’t know where the fields are without it. The following code creates a table which will appear under the External Tables folder in SSMS as well as load the data by telling it where the data lives and to use the file format which we just added.

CREATE EXTERNAL TABLE Sample_07
(
code nvarchar(255),
description nvarchar(255),
total_emp int,
salary nvarchar(255)
)
WITH
(
LOCATION = '/apps/hive/warehouse/sample_07',
DATA_SOURCE = HDP2,
FILE_FORMAT = TSV,
REJECT_TYPE = value,
REJECT_VALUE=0
)

After this step is complete, you can use the new table to join to data on SQL Server from inside a HDFS cluster.

I hope this helps you get started with using polybase with SQL Server 2016.  Let me know what you think by posting a comment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Introduction to R – The Follow Up

gdiPhxI really enjoyed the opportunity to share R with GDI Phoenix. Since very few people in the room were familiar with the language, it was great to be able to show something new. The audience was great, and provided some wonderful feedback. With Microsoft’s purchase of R in 2015, many of the disadvantages of the open source version of R, including not being able to process huge data sets, speed of running the code and deployment have been resolved with R Server, which will be released with SQL Server 2016. The integration of R in SQL Server will only increase the demand for R skills, providing a great incentive for people to get started to learn the language now.

As promised I have provided links to the items I covered in the talk. Thanks so much for inviting me and I hope to have the opportunity to present again.

All the Links Need to Get Started Learning R

R Language

R Studio (UI)

Visual Studio Community R Tools

Microsoft R Open

Swirl – R Tutorial

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SSIS – Resolving “Failed to Deploy the Project” Messages

Have you seen this screen when trying to deploy?
SSISFailDeploy

Needless to say, this indicates the SSIS package didn’t load to the server. This is especially frustrating as this particular package worked fine in Visual Studio. Believe it or not this screen actually contains information which can be used to resolve the error.

Finding a Useful Error Message

How do you go about determining what to do? Click on the word Failed. This will bring up some useful information, although it may not appear that way at first. Here’s the error message I received

SSISErrMessage

The message Failed to deploy project isn’t very useful, but the rest of the message is. The operation_messages view lives in SSISDB, and the operation identifier number is how to determine what the error is. Run this query, using the number provided in the error message, which in this case is 173

Select * from catalog.operation_messages where operation_id = 173

Here are the results from that query.

operation_message_id operation_id message_time message_type message_source_type message extended_info_id
50719 173 2016-02-29 15:02:08.2478928 -07:00 120 20 Failed to deploy the project. Fix the problems and try again later.:SqlDateTime overflow. Must be between 1/1/1753 12:00:00 AM and 12/31/9999 11:59:59 PM. NULL

 

Now this message is quite useful as it provides information that I can use to fix the issue. This SSIS Project contains a date parameter BackDate, which I had not set. Here’s the parameter.

SSISVSParameters

I set this parameter to a date between 1/1/1753 and 12/31/9999 and deployed the project again. This time, no error.

I hope that you have found this post helpful, especially if you haven’t deployed a package to SSIS in SQL Server 2012 or later.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Best Practices for the care and use of the SSISDB

Deploying SSIS packages since the release of SQL Server 2012 and beyond, has moved to the SSISDB database. If you are using SQL Server 2012 and beyond and are not using the SSISDB for deployment, my sincere condolences as maintaining and deploying packages any other way is a maintenance headache which thankfully has been resolved. As the SSISDB has not been used for very long, the appropriate care and feeding of this database is not well known. SSISDB is part of Integration Services Catalogs. It is not possible to create an SSISDB without first creating an SSISDB catalog, which does not happen when SQL Server is installed. Follow Microsoft’s instructions for creating a SSISDB Catalog, which creates the SSIS catalog and the SSISDB database. You cannot rename the database, as SSISDB is the name of the database that Integration Services uses internally.

Backup the SSISDB

A client asked me recently why he should back up the SSISDB database. While you can recreate everything inside of the SSISDB, it will take time and you will have to remember exactly how all of your variables were set. Restoring the backup decreases this issue and having a backup allows a server to be redeployed quickly. When you do back up the database, make sure that you remember to backup the database certificate, which is created when the SSISDB is created as well, as you will need this to do a restore. By default. the recovery model of the SSISDB is set to Full. If the packages in SSISDB are changing minute by minute, full would make sense, but given that an SSISDB contains packages which are run on a scheduled basis, most likely the changes made are infrequent. Change the recovery model to simple.

Managing SSISDB Growth Over Time

SSISCatalogSettingsSSISDB contains all of the data used for the reports created when SSIS packages are run. Right click on the SSISDB icon underneath the Integration Services Catalog and take a look at the settings. The default settings are listed here, and to decrease the size of the SSISB over time, you may which to change them. The Retention Period is set to 365 days. Many environments don’t look at reports greater than 90 days, as information prior to that timeframe isn’t very meaningful. If that is the case, change the retention period to the number of days someone is actually going to look at the report, which will decrease the amount of data stored in the database. To get rid of the logs, the setting Clean Logs Periodically needs to be set to True, so don’t change it.

The Server-wide Default Logging Level is by default set to Basic. The information provided at this level is generally what is needed to troubleshoot any issues. Don’t set the logging to None just to save space. If the SSIS code ever crashes, the person doing this may be cursed. If you have simple jobs though, you may find the logging level of Performance may be adequate. Check out Microsoft’s documentation on Logging to better understand the differences between levels.

Lastly if using version control in another application, such as TFS, there may not be a need to set the Maximum Number of Versions per Product to 10. Generally speaking most people don’t look past the last 3 versions. The number should reflect what is practically, which is nearly always less than 10.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

2016 Releases of SQL Server and Excel

It’s been 2016 for over a month now, so hopefully you are still not changing 2016your 5 into a 6 still. Chances are you haven’t changed your software to reflect the new year yet. Office 2016 came out in September of 2015 and SQL Server 2016 hasn’t been released yet. It’s hard to keep up with all of the version changes that have come out, especially when you throw in Power BI which is has something new every month. If you know you are going to be upgrading to one or both of these versions, or want to learn more about SQL Server or Excel so you can decide if  is worth the upgrade effort, this week I might be able to help out. I am going to be talking about the 2016 version of SQL Server on Wednesday, February 10 at the Arizona SQL Server User Group meeting and then talk about the 2016 version of Excel on Thursday at the Excel BI SQL Pass Virtual Chapter. If you are not in Arizona right now, you are missing out as we are having Department of Tourism weather of 80 degrees. You can get back to me in August when I am melting in the 115 degree heat.

Polybase in SQL Server 2016

Since there are many new features to talk about in SQL Server 2016, I picked Polybase. As big data matures many places are looking to keep their structured data right where it is and create an HDFS cluster to store other data. Polybase allows SQL Server 2016 users to look at both all in one place.

Excel 2016

It’s been a while since September 22, 2015, the date Office 2016 was released , but I still know very few people who have upgraded. I’ve been to a few clients that hope to upgrade to Excel 2012 this year. In this session, I will show where things got moved and renamed, what’s new and what is on the deprecated list. If you don’t have 2016 installed yet, or if you do and wonder where Power Query went, please join me to hear all about it. Generally speaking, the Virtual Chapters are posted on Youtube, and when they are I will have a link available. Unfortunately for those who attended my last Excel BI Virtual Chapter Meeting, due to technical difficulties that recording is not available, but hopefully this time everything will work.  When the recording is available I will make sure a link it is available on my blog for those who can’t make it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Which Power BI Gateway: Personal or Enterprise?

UPDATE: Power BI now has one Gateway, with the option to use a personal gateway. Check out this post on the new gateway for more information.

Power BI has two data gateways, Personal and Enterprise.  Since I assist companies with Power BI, the name powerBIGatewayPersonal Gateway made no sense to me, especially when I used a personal gateway to update an Organizational Pack. While this is a valid reason for a name change, after all Power Query is now called Get & Transform, so why not just rename Personal Gateway? I digress. Enterprise Gateway is not a replacement for Personal Gateway. It is partially an upgrade of the Power BI Analysis Services Connector, as it contains the features in that app and more. In the future, Enterprise Gateway going to be a way to manage all of the data connections within Power BI. I look forward to writing about that once it is available. The Enterprise Gateway, which I should mention is a Preview Release, supports three different Data Source Types: SQL Server, SAP HANA, and Analysis Services. One enterprise feature which is available is the ability to add users to the gateway you just added, rather than relying on one ID to grant access to everyone. For connections to SQL Server or SAP, the connection to the database is made via the user you entered, so that user should have appropriately limited reporting connections to the data source.

Power BI Data Access Based on User Security

Please note that right now, data access based upon user credentials only works for Analysis Services. For reports with an Analysis Services data source, the information passed to the server is the User Name of the user accessing the report. Using Active Directory, this user is granted the same access to the data on the server that they have on the on-premises network. Here’s an example, let’s say Jason is the sales manager for the Eastern Region, and doesn’t have access to the Western Region within Analysis Services security. Jennifer is the sales manager for the Western Region and has only been granted the ability to see the Western Region information in Analysis Services. If a sales report is created in Power BI which uses Analysis Services as it’s data connection via the enterprise gateway, Jason will only see the information on the Power BI report for the Eastern Region and Jennifer will see the same report with only the information for the Western Region. If Jason gets promoted to National Sales Manager and needs to see everything, once the security in Analysis Services is updated granting him access to all sales regions, he will see everything. Unfortunately, if you have 2008R2 or Standard Edition for SQL Server 2012 or greater, you won’t be able to connect to the server via the Enterprise Gateway.

Factors for selecting  Power BI’s Personal Gateway

The Personal Gateway takes the data and imports it into Power BI. If you want to extract data from a variety of different places such as an Oracle Database, and Excel Spreadsheets, the Personal Gateway will support this, and the Enterprise Gateway won’t.   Remember the Enterprise Gateway only connects to three different data sources, and Excel and Oracle are not on that list. If you want to manage connection and refresh of the data as the administrator or provide access to the data to everyone who needs it, use the Personal Gateway.

When might one want to use Power BI’s Enterprise Gateway?

All of the connections via the Enterprise Gateway are live connections to the underlying server, so there is no need to have a scheduled refresh. After all you are always using the connection to live connect to the server accessing the data. Enterprise Gateway imports nothing, so if you have really large databases which you are reporting upon, it probably makes sense to user the Enterprise Gateway as nothing gets copied. If you have a security policy which forbids storing data in the cloud, Enterprise Gateway meets that requirement as all of the data is stored locally and is merely accessed when needed, like a web page.

Future Plans for the Enterprise Gateway

Microsoft is planning on providing the ability to monitor and audit all of the data sources in the Enterprise Gateway. That would make it truly enterprise as it will provide the ability to see what data is being used throughout Power BI. When that happens, it might be time to get rid of the Personal Gateway. Right now, if the Personal Gateway is working for you for connecting to SQL Server, Analysis Services or SAP and the data refreshes are working, I would hold off upgrading. It’s not terribly easy to what refresh methods are configured in Power BI right now, which is also something that I hope gets fixed in the future as part of the Admin features. When Microsoft releases new versions of the Enterprise Gateway, I will definitely discuss them here. To get future updates, please subscribe to my blog to be notified when they happen.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

2015: Year End Wrap up for Releases and More

As 2015 draws to a close, I started thinking back about everything that has happened this year. 2015 GraphicTechnically this has been a big year as a many new applications were released. Here are just some of them, with links included to provide more detail.

This short list could be a lot longer as it doesn’t count the number of updates released to Power BI, which occur several times a month, the CTP releases for SQL Server 2016, the new web version of BIML, or PowerShell. It’s really hard to keep up with everything that is changing. It’s a good thing that so many people are willing to help others learn how through speaking and blogs which make learning new things easier.

Community Involvement in 2015

Keeping up with all of these events is difficult, especially given the pace of releases.  I spend a lot of time reading various blogs, watching videos and going to hear people speak. I also have been able to talk about topics of particular interest, many Power BI and Machine Learning. This year I spoke a different times at a number of different events including: Speaker Idol, two different user groups, seven webinars, five SQL Saturdays and other Tech Events. I’ve got a number of engagements on the books for next year, including PASS BA Con and SQL Saturday #461 – Austin. 2016 is shaping up to be busy too and hopefully our paths will cross.  I list all of my speaking events on my Engagement Page and I hope that you might take a look at it from time to time if you are interested in catching up in person sometime. Next year I am hoping my list of speaking engagements changes somewhat as I plan on trying harder to get accepted to speak at events where I submitted and was turned down in 2015. On a more positive note, views of my blog are up 1000%, and the number of website subscribers has more than doubled. Thank you very much for continuing to read this site and I hope you find my thoughts helpful. I posted once a week this year, which I thought was pretty good until I talked to Ken Fischer b | t who blogs twice a week. I’ll have to try harder next year. If you think of a topic you think would make a good blog post, let me know as I am always interested in feedback.

Keeping Up the Pace in 2016

Next year there will be no slowdown in the things to learn as SQL Server 2016 is going to be released. Although the exact date has not been announced, my sources tell me to look for it around May-June. The next release of SQL Server is going to be huge as it will include new tools Microsoft added to integrate Big Data and open source platforms to SQL Server. PolyBase, JSON and R are all going to be part of with SQL Server. Personally, I find the R integration most Datazen and SSRS are going to be integrated in the next release too which should really increase the implementation of mobile reporting visualizations.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Limiting the Number of Results with TABLESAMPLE

In the world of exploding piles of data, there are times you just need less not more. If you are working on a proof of concept, just want to test out some code, or want to pass a selection of data to a machine learning experiment, you don’t need several million records. Having that much data will just keep you from getting to the answer in a reasonable period of time. There are a number of standard ways people limit the data, the most common being some kind of a date filter. Using a date range though often times does not provide the variability needed. This is a particular problem with data used in a machine learning experiment it is designed to create an algorithm based on data pattern extrapolated over time. For example if you are doing any kind of regression analysis on a retail client and you either exclude or include the Christmas shopping season, the algorithm created will not be correct. The goal is to have less data for analysis pulled from the entire set of data. Fortunately SQL Server since 2005 has several methods for selecting random data

Using TABLESAMPLE

Until recently, I hadn’t used the Transact SQL TABLESAMPLE clause, but I ran into a situation where I needed to test some things and not wait all year for a result. TABLESAMPLE to the rescue. There are a couple of things where it won’t work. If you have derived tables, tables from a linked server (bad idea), or are writing a view and you want to return random data you can’t use TABLESAMPLE. If those conditions do not apply, you can use it.

The random sample that TABLESAMPLE provides is based on the number of data pages, not the number of records. If you want the number of rows to be specifically limited you will need to use Top(n) as well. I’ve written all my samples based upon AdventureWorksDW so you can run them for yourself later. I’ve listed the variety of ways to call TABLESAMPLE and shown the number of records returned.

SELECT *
FROM [dbo].[FactInternetSales] –60398 Rows returned prior to Table Sample
TABLESAMPLE(10 PERCENT)

(6073 row(s) affected)

Let’s say you want to return the same sample set multiple times. For that you will need some value. I picked 11, but you could pick any other you like.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]
TABLESAMPLE(10 PERCENT)
REPEATABLE (11)

 (6489 row(s) affected)

When looking at the number of records returned, the values are not 10 percent exactly or particularly consistent in the number of rows returned. If you only want 6039 records returned, you can try the following code, but it doesn’t really do what it says.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]  
TABLESAMPLE(6039 ROWS)

 (5640 row(s) affected)

This code will provide the exact number of records desired. Noticed I upped the number of rows returned in order to get 6039 rows. If the sample is 6039 you cannot guarantee that you have enough rows returned.

SELECT top 6039 *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales] --60398
TABLESAMPLE(6200 ROWS)

(6039 row(s) affected)

Really Random Sample

RandomSampleTABLESAMPLE has some really useful functionality, but for machine learning I need a truly random sample, which TABLESAMPLE does not provide. For that I need NEWID(). The following sample returns approximately 1% (.01) of the 60398 rows.

Select * from [AdventureWorksDW2014].[dbo].[FactInternetSales]
Where 0.01>= Cast(checksum(newid(), [ProductKey]) & 0x7fffffff as float) / Cast(0x7fffffff as int)

 

Just for fun I ran the same code 3 times and got a variety of rows returned.

(600 row(s) affected)
(607 row(s) affected)
(622 row(s) affected)

The ProductKey is added so that the NEWID() function will calculate a sample for each row. The WHERE statement calculates a random float between 0 and 1. This will truly give me the random sample I would need for a machine learning experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur