With the release of SQL Server 2017, you now have the capability to incorporate both R and Python into SQL Server. As there is a lot of material on the topic, this is the first post in the series which covers installation. In future topics we will be covering the internal components, monitoring a R and Python code to determine performance impact on SQL Server, creating and maintaining R code, creating and maintaining Python code, and other related topics.
Installing Machine Learning Services
R was first introduced in the SQL Server 2016 and it was called R Services. For SQL Server 2017, this same service was renamed to Machine Learning Services, and expanded to include Python. In the install there are three options, installing the Machine Learning Services, then selecting R and/or Python as you see in the attached picture.
Why you want to select Machine Learning Services(In-Database)
There are two installation options: In-Database or Standalone. If you are evaluating Machine Learning Services and you have no knowledge of what the load may be, start by selecting the Machine Learning Service In-Database. There are several reasons why by default you want to select the In-Database option. One of the problems that Microsoft was looking to solve by incorporating advanced data analytics was to improve performance of the native code by greatly reducing data latency. If you are analyzing a lot of data which is stored within SQL Server, the performance will be improved if the data does not need to be moved around on a network. Also, the licensing costs of installing R Server standalone also need to be evaluated with a Microsoft representative as well. An evaluation of the resource load on the network, as well as analysis of the code running on SQL Server should be performed prior to the decision to install the Machine Learning Server Standalone.
Internet Access Requirements for installing R and Python
The Machine Learning Service is an optional part of the SQL Server Install. Because R and Python are both open source applications, Microsoft cannot include the R or Python executables within the install of SQL Server. The executables must be downloaded from their respective locations on the internet, and the installation process is a little different if there is no internet access. Each language has two installs, one for the executable and one for the server. If you do not access to the internet on the server where you are installing SQL Server 2017, you can download the files needed for the install.
Here are the links for SQL Server 2017 CU2.
If you are installing a different version, use the links provided on the Offline Installation screen. These links each will download a .cab file. You will need to copy the cab files to a location where the server can access them and provide the path in the Offline Installation window.
What is Installed with Machine Learning Services
Machine Learning is an external process and communicates with launchpad.exe to access either R or Python. For a quick check to see if the Machine Learning Services were installed, look for the SQL Server Launchpad service in the list of running services. It will also create by default 20 different external users which are used to call R or Python. There will be a subdirectory created for each user, with the name of the SQL Server instance name, which is by default MSSQLSERVER, followed by a number 00-20. These subdirectories have nothing in them, as they are used temporarily when R or Python needs them and the information in them is eventually deleted by SQL Server. The default location is C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData.
The R tools are located in the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES directory, and include RLaucher.dll which will load R. If you want to run R on the server, which can be handy when updating R libraries, run the RGUI.exe, which will load up an interface where you can run R code. In SQL Server 2017 CU2, Microsoft R Open 3.3.3 is installed along with R Server 9.2.0. Microsoft R Open is a version that is 100% Open Source and completely compatible with the standard Open Source version of R, which is commonly referred to as Comprehensive R Archive Network [CRAN] R. Microsoft rewrote some of the underlying functions so that they would be multi-threaded, which R is not, and incorporate the Intel Math Kernel libraries to improve the performance. R Server is the version of R which contains the proprietary functions which Microsoft created for SQL Server which include the ability to load code in and out of memory, which will be discussed in a future post.
The Python tools are located in the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES directory, and include Pythonlauncher.dll which will load Python 3.5. The installation includes the Anaconda distribution for Python, which includes not only the data science components of Anaconda, but also SciKitLearn and Pandas. Microsoft has also included machine learning algorithms which were created for Python in the microsoftml and revoscalepy libraries. There is a lot of interesting content will be discussing more about these libraries in a later post.
Configuring SQL Server to Run R or Python
Once the Machine Learning components are installed, there are some configuration steps which must be completed to permit R or Python to run on SQL Server. If this is a new server, make sure to install SQL Server Management Studio, since it is not included in the SQL Server Install. From within an SSMS query window, the following script needs to be run to enable R
SP_CONFIGURE 'external scripts enabled', 1
After this step completes successfully, a restart of SQL Server Services is required. When stopping the service, you will be notified that SQL Server Launchpad also will need to be restarted. I have noticed that for some reason, Launchpad does not always restart when SQL Server is restarted, so you might want to check to make sure that it is running, as you cannot run R or Python unless the SQL Server Launchpad service is running. After the restart, to check to see if R is working properly, run the following code from within an SSMS query window.
EXEC sp_execute_external_script @language =N'R',
@input_data_1 =N'SELECT 1 as CheckToSeeIfRIsWorking'
WITH RESULT SETS (([CheckToSeeIfRIsWorking] int not null));
When run successfully, this script will return a 1. SQL Server is now ready to run R. To check to see if Python can be run successfully, run this script.
EXEC sp_execute_external_script @language =N'Python',
@script=N'OutputDataSet = InputDataSet',
@input_data_1 = N'SELECT 1 AS CheckToSeeIfPythonIsWorking'
WITH RESULT SETS ((CheckToSeeIfPythonIsWorking int not null));
In my next post I will cover the SQL Server internal components which are run when R or Python code is run. Please subscribe to my blog to be notified when the next installment will be available. If you have any questions, comments or ideas for future post topics, please leave me comments as I would really enjoy any feedback.
Data aficionado et SQL Raconteur
Could you describe performance considerations of ML services.
I have many shared SQL Servers on Vm’s.
What is the best VM config per MicrosoftML(Two-Class, Multi-Class,Regression, Anomaly detection, etc)?
The performance of ML Services has a lot to do with three things, the language which you choose to do your analysis, the complexity of the algorithm being performed and the amount of data used in the analysis. If You chose to use R, you will generally need more memory as it primarily processes in memory. If however you chose to use many of the Revo library functions, it is less memory dependent. The number of different algorithms used to create say anomaly detection matters too. If you use a number of algorithms to process the data it will take more resources than merely using pandas to do a simple PCA process. Are you going to process 100 columns of a terabyte of data in a batch or test a few hundred records with 25 columns? All of these factors will impact the resources used. To determine how much your analysis is using you will need to monitor usage over time.