Preparing for Exam 70-774 – Perform Cloud Data Science with Azure Machine Learning

There are a number of reasons why you might want to take a Microsoft cert exam. Maybe you want to focus your studies on a tangible thing, or you think it will help further your career, or you work for a Microsoft Partner and they required a certain number of people to pass the exam to maintain their current partner status.  I am not going to get into the long argument regarding whether or not a cert will help you in your career, or not, I can tell you why you might want to take the 70-774 exam. Machine Learning, or Data Science if you prefer, is an important analytic skill to have to analyze data.  I believe that it will only become more useful overtime. Azure Machine Learning is a good tool for learning the analysis process.  Once you have the concepts down, then should you need to use other tools to perform analysis it is just a matter of learning a new tool.  I talk to a number of people who are trying to learn new things, and the study them in their spare time.  It’s very easy to spend time vaguely studying something, but you may find that having a target set of items to study will focus your time, and as a bonus you get a neat badge and some measure of proof that you were spending time on the computer learning new things and not just watching cat videos.

Exam 70-774 Preparation Tips


While you could always buy the book for the exam (shameless plug as I was one of the authors), the book will not be enough and you will still need to write some code, and do some additional studying. This exam one of two needed for the MCSA in Data Science and you an take the exams in any order. The best place to start is by first looking at the 70-774 exam reference page from Microsoft.  There are four different sections in the exam, and I have created some links for each section which will help you prepare for the exam. In studying for exams in the past, the best way I have found to prepare is to look at everything on the outline and make sure that I know it.

Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning

Normalizing Data
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data

TanH
https://reference.wolfram.com/language/ref/Tanh.html

ZScore
http://stattrek.com/statistics/dictionary.aspx?definition=z-score
http://howto.commetrics.com/methodology/statistics/normalization/

Min Max
https://www.quora.com/What-is-the-meaning-of-min-max-normalization

PCA
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://stackoverflow.com/questions/9590114/importance-of-pca-or-svd-in-machine-learning

SVD
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Canonical-correlation analysis (CCA)
https://en.wikipedia.org/wiki/Canonical_correlation

Singular Value Decomposition (SVD)
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Develop Machine Learning Models

Team Data Science
https://docs.microsoft.com/fi-fi/azure/machine-learning/team-data-science-process/python-data-access

K-Means
https://www.datascience.com/blog/k-means-clustering

Confusion Matrix
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
https://en.wikipedia.org/wiki/Confusion_matrix
https://en.wikipedia.org/wiki/F1_score

Ordinal Regression
https://en.wikipedia.org/wiki/Ordinal_regression

Poisson regression
https://en.wikipedia.org/wiki/Poisson_regression

Mean Absolute Error and Root Mean Squared Error
http://www.eumetrain.org/data/4/451/english/msg/ver_cont_var/uos3/uos3_ko1.htm

Cross Validation
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f

Operationalize and Manage Azure Machine Learning Services

Connect to a published Machine Learning web service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/publish-a-machine-learning-web-service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/consume-web-service-with-web-app-template
https://docs.microsoft.com/en-us/azure/machine-learning/studio/manage-new-webservice

Use Other Services for Machine Learning

Microsoft Cognitive Toolkit
https://www.microsoft.com/en-us/cognitive-toolkit/

BrainScript
https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-basic-concepts

Streamline development by using existing resources
https://docs.microsoft.com/en-us/azure/machine-learning/studio/gallery-how-to-use-contribute-publish
Perform database analytics by using SQL Server R Services on Azure
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/provision-vm
https://docs.microsoft.com/en-us/machine-learning-server/install/r-server-vm-data-science
https://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
http://blog.revolutionanalytics.com/2017/07/xgboost-support-added-to-rattle.html
https://github.com/JohnLangford/vowpal_wabbit/wiki

I hope you have found this test preparation material helpful.  If you passed the exam, let me know by sending me a comment.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Azure Machine Learning Workbench

Microsoft released Azure Machine Learning Workbench at the Ignite conference on September 25, 2017 as a public preview.  This tool is a new tool which they are adding to their Azure ecosystem, which includes the machine learning tool they introduced three years ago, Azure Machine Learning Studio. Microsoft has said they plan on keeping both products. When asked about the two products, they said that the earlier tool, Azure Machine Learning Studio, is targeted to developers who wanted to add machine learning to their current applications, as it is an easy to use tool that doesn’t require a person to be a trained data scientist.  Azure Machine Learning Workbench is targeted to data scientists who want to bring in other libraries, like TensorFlow for Python, and delve deep into the data.

Microsoft Moves into Machine Learning Management

Microsoft is looking for Azure Machine Learning Workbench for more than a tool to use for Machine Learning analysis. It is part of a system to manage and monitor the deployment of machine learning solutions with Azure Machine Learning Model Management. The management aspects are part of the application installation.  To install the Azure Machine Learning Workbench, the application download is available only by creating an account in Microsoft’s Azure environment, where a Machine Learning Model Management resource will be created as part of the install. Within this resource, you will be directed to create a virtual environment in Azure where you will be deploying and managing Machine Learning models.

This migration into management of machine learning components is part of a pattern first seen on the on-premises version of data science functionality.  First Microsoft helped companies manage the deployment of R code with SQL Server 2016 which includes the ability to move R code into SQL Server.  Providing this capability decreased the time it took to implement a data science solution by providing a means for the code can be deployed easily without the need for the R code to be re-written or included in another application. SQL Server 2017 expanded on this idea by allowing Python code to be deployed into SQL Server as well.  With the cloud service Model Management, Microsoft is hoping to centralize the implementation so that all Machine Learning services created can be managed in one place.

Hybrid Cloud, Desktop, and Python

While you must have an Azure account to use the Machine Learning Workbench, the application is designed to run on a locally on either a Mac or Windows computer.  There is a developer edition of the tool so that one can learn the tool and not incur a bill, which is the case with the previous product, Azure Machine Learning.  The download of Machine Learning Workbench must be accessed within an Azure account and is installed to your local computer.  When running the application from your computer, the application will prompt to log into your Azure account to load Azure Machine Learning Workbench.

The application is designed to use and create Python code.  Azure Machine Learning Workbench does not contain any accommodation to incorporate machine learning components written in R, just Python.  If you have created machine learning components using R, they can be incorporated into the Azure Machine Learning Model Management if you create webservices which encapsulate the R code. The R code does not interface into Workbench, but can be made to be a part of the managed projectes in Azure. While it is possible to create a webservice for R with the earlier product with Azure Machine Learning, there is no direct way to include R with Azure Machine Learning Workbench.  There are a number of sample templates to get started using Python templates including the ubiquitous Iris dataset, Linear regression and several others.   Once the project is created, you can use your favorite IDE, it creates python code which can be read anywhere.

Staying within Machine Learning Workbench application allows you access to arguably one of the neatest parts of the Machine Learning Workbench, the data parser. This tool which was originally code-named project Pendleton and designed to be an intuitive way to modify the contents of data even better than the previous leader in parsing data, Power BI’s Power Query.

You can select the option “Derive column by example” or “Split Column by Example” and then start typing in a new column.  For example, if you want to separate a column which contains the date and the time, if you right click on that date column and select “Split Column by Example” then type the date in the new column provided, the application will immediately determine that you want two columns and crate them. The date column and a time column be created for you after typing in one date.  After the sample columns have been created, you can approve the change or reject it if does not work how you want to.

Like Power Query, each change made to the data is included in the window called Steps on the right side of the application window. When you are done modifying the data, right click on the Data Preparations source icon, which in my example is called UFO Clean, to and the UI changes made to the data are used to create Python code to perform the changes. The generated Python code can be used to the source data programmatically.

The next step in the process is to write the python code needed to evaluate the data and create a model which would in my case determine where and when you are most likely to see the next UFO based on the dataset I have included in my project.  Unlike it’s counterpart Azure Machine Learning, you will need to know how to write the necessary code needed to create a machine learning analysis in Python for Azure Machine Learning Workbench. One could write the Python code to create a machine learning analysis in any Python editor.  If you chose to use Azure Machine Learning, the Python library scikit-learn is installed as part of the application.  Other libraries which you may want to use, such as the common library matplot, you will need to load within Azure Machine Learning Workbench.

Web Service: How Azure Machine Learning Workbench Solutions are Deployed

To deploy a package, you will need to export the completed model serialized Python object, with the Python Module, Pickle. This will create a file with the suffix of pkl, which is the file that you will be deploying. Azure Machine Learning Workbench expects that you will be deploying via Docker containers or creating an Azure cluster.  You will need to register the Docker container in the Machine Learning Container for it to be deployed.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 2

As advanced analytical techniques become more popular, companies are looking to hire people to find answers in their data. What kind of answers?  Predicting the future, determining what factors make customers leave, what kind of products can they get good customers to buy, what conditions are related, is this a valid transaction and similar questions. What answers can be provided have everything to do with the available data. The data I chose to analyze was the salary data provided by Brent Ozar, which is publicly available here.  I started looking at the data in a previous post where I did an initial review of the data and discussed the data analysis process.

Regression Analysis – What kind of relationships can I find in the Data?

Looking at how data is related is a very important step in data analysis.  Most often various items are analyzed using some linear regression algorithms to compare one or more variables together. For this kind of analysis, generally speaking all of the data needs to be represented numerically, which means that if data instead exists as categories of items, the data will need to be transformed. For example, on the Salary Data which I analyzed and published in Power BI, the job experience and salary roe compared using the ggplot library from R, and the two different values are included on respective axis. What I was hoping to find is that there was a strong relationship between these two values.  If you look at the graph, you can see this is not the case. Interestingly enough while the line shows an upward trend, you can see a drop in salaries for those with a lot of job experience.  Those people with the most experience, above 35 years are making less money than those with less experience.  The graph also shows those who are just starting in their careers are not necessarily making the very little money. What the data shows is there is no guarantee that the longer you work the more money you make.

Data Cleansing for Analysis

Because I am looking for data trends and not anomaly detection, I normalized the survey data.  I eliminated the 100 people who did not fill in the salary amount, and cut off the high and the low.  I used the box plots generated in Power BI to serve as a guideline for the ranges to exclude.  As I was also interested in determining the difference in the responses between male and female, so I did some data substitution on some of the values as I wanted to included more records.  In 2018, the only year that this question was asked, 87.6% of the respondents were male.  I made the decision to include all of the respondents where the number of respondents was less than .22% as male so that I would have more data to evaluate. I modified all of the data in Power BI using M code.  You can take a look at all of the modifications I made to the data here in the Power BI report I created, as I am making it available for the next 30 days.

Examining the Top 5%

Recently I have had some conversations with some colleges regarding salary, and that led me to want to review what people would like to make.  Most people would like to be making the most money possible in their profession, and are not interested in moving, which is why I chose not to do much with the geographic data.  I ran a number of different machine learning algorithms on the data trying to find a definitive set of results among those who reported making the most money.  The results of those experiments were inconclusive.  While I found some items which were common among the highest earners, the results were not statistically conclusive. There are a number of conceptions that people have regarding salary, and I chose to illustrate some of them to dispel some myths surrounding data. I also grouped the salaries into groups: 95% for above 153,565, 75 for above 67,789, and the rest for the average.  These numbers were based on the values in the box chart in the top left of the Power BI report.

Salary Conclusions – Myth vs Reality

I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers.  According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%?  Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute.  Those who commute 100% of the time do very well, as well as those who spend every day at a job site.  Males report working more hours and telecommuting less than females do as well.  If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.

After spending quite a bit of time analyzing and visualizing the data, I was unable to determine a specific set of skills which to provide a roadmap of exactly what one needs to do to be in the top 5% of the salary for a data professional.  What I can tell you is more than likely there is someone with your level of work experience and position who is doing really well, and there is no reason why by the time that the next survey comes out, you are not the person who is in the top 5 percent.  This may mean working harder at your job and perhaps changing employers as the analysis shows that is the best way to make more money.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 1

CRISP-DM Diagram

The standard method for analyzing data is the CRoss Industry Standard Process for Data Mining [CRISP-DM].    Rather than describe the method, this post will walk through the process to illustrate how to analyze data using it. The data that I selected for analysis is the Brent Ozar Salary Survey information.The data is available under open source license and contains two years of answers to salary data with a total of over 6,000 responses. Understanding what is in the data determines what kind of answers it can provide. What can the data reveal?  Prior to drawing any conclusions, one needs to examine the data to determine the level of completeness, correctness and whether or not you have enough data to make decisions based upon the data.

Data Understanding

The first step in the process is to analyze the data to evaluate what kind of knowledge you can gather from the data.  The primary perspective of the data is salary, and the survey describes the characteristics which people with a certain salary level have.  The survey used to gather the data contained a number of drop down boxes and those fields can be used as categorical variables as there are a fixed number of possible responses.  Other survey items allowed people to freely enter anything, which makes it harder to statistically analyze some of the data.

 

Where Do Data Professionals Make the Most Money?

In the survey for 2018, the people who made the most money were from Hong Kong with an average salary of $263,289.  Before you start planning on moving, you will might want to look at the data a little closer.  There were 2 people who responded from Hong Kong.  One of them said he was making over 1.4 million dollars, the highest amount reported in the survey.  Given the fact that we only have two responses from Hong Kong, we will be unable to draw a definitive conclusion with 2 records. To be able to answer that question, more analysis will need to be done on the location and salary information and you will probably want to add market basket criteria because a dollar say in Hong Kong doesn’t go as far as the average apartment rental is $3,237 a month as it does say in Uganda where the rent is around $187 a month.

Using Power BI to Provide Data Understanding

The data modeling step of the CRISP-DM process anticipates that you will want to modify the raw data.  There may be records containing null or erroneous values that you will need to eliminate the entire record or substitute entries for a particular value.  You can also use this analysis to determine what conclusions you will be able to derive from the data.  For example, if you wanted to analyze what criteria are required for Microsoft Access Developers to make over $100,000 a year, you could easily find out in Power BI that it is not possible to do that analysis as there is no data for that set of criteria. If you want to do a year over year analysis of people who are working as DBAs, which I show in the second tab, you will need to change some of the categories as they changed from 2017 to 2018.

Analyzing Data with R and Power BI

Many times when providing a final report to explain your analysis, you will need to provide some documentation to demonstrate your conclusions.  In addition to creating some visualizations in Power BI, I also created some in R to include visualizations and analysis with R.  While I can include any R library I wish in Power BI Desktop, there are only 364 currently added to the Power BI Service in Azure.  If there is an R visualization you would like to add, you can send an email request to rinpbifb@microsoft.com and ask for it to be added.

Power BI Salary Data

For more information on Analyzing Data with Power BI and R, I recorded a video for Microsoft’s Power BI team which is available here.  For the next 30 days, for those of you who are interested, you can download the PBIX file used to analyze the data here. The data included in the PBIX also includes a list of the R libraries included in Power BI, R Libraries not supported by Power BI and the data and the accompanying M code used to clean the data.  The cleaning process and the analysis of the Salary data itself will be included in another post.  If you would like to find out when the next post in this series is available, please subscribe to my website for all of the latest updates.

*** UPDATE: My next post on further analysis of this data is available here.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

 

SQL Server 2017 Machine Learning Services Part 3 – Internals

After you have installed SQL Server 2017 with Machine Learning Services, you may notice a couple of interesting things.  One is that by default you will have 20 new users created.  These user ids are  by default named MSSQLSQLServer01, MSSQLSQLServer02, MSSQLSQLServer03… MSSQLSQLServer20, but if you have a named instance, like I have called SQLServer2017, the users are named with the named instance.  There is a subdirectory created for each User ID with is by default located in  \Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityDataExternal .  You do not want to remove these User IDs or rename them.  You may be wondering Why do you have all of these User IDs to use Machine Learning Services and what are they for? Keep on reading to find the answer

SQL Server Launchpad and User IDs

When calling external processes, internally SQL Server uses User IDs to call the Launchpad service, which is installed as part of Machine Learning Services and must be running for SQL Server to be able to execute code written in R or Python.  The number of users is set by default.  To change the number of users, open  up SQL Server Configuration Manager by typing SQLServerManager14.msc at the run prompt. For some unknowable reason Microsoft decided to hide this application which was previously available by looking at the installed programs on the server.  Now for some reason they think everyone should memorize this obscure command. Once you have the SQL Server Configuration Manager open, right click on the SQL Server Launchpad service and select the properties which will show the window, as shown below.  You will notice I am running an instance called SQLServer2017 which is listed in parenthesis in the window name.

SQL Server 2017 Launchpad Configuration

Clicking on the Advanced Tab shows an entry for External Users Count, which is shown highlighted. This value is set by default to 20 users.  This means that 20 different threads can concurrently call an R or Python process.  If you reduce this number to 0, no R or Python code can be run, and the SQL Server Launchpad service will not run.  The minimum number of users you can have and have the launchpad service still run is two, but changing the users to that low number is not recommended as those processes are needed to run Machine Learning Services to rn.  If you have more than 20 concurrent R or Python processes running, SQL Server will wait until one of these threads is no longer in use and once one is free, will use it to call another process. While the process is running you may see some GUIs or other non-decipherable data appear in the folders for a user.  The garbage cleanup runs soon after to delete anything that is in the folder, as they will eventually all be empty. What does the Launchpad Service do and what does that have to do with Machine Learning Services and SQL Server? Well, the short answer is the launchpad.exe is used to call R and Python.

SQL Server Internal Machine Learning Components

To run R or Python code in SQL Server, you will need to execute an external script, which I talked about in the first post of this series.  The following diagram illustrates what happens when that call is made and what executables are called.  When a request to run R or Python code is received by the sqlservr.exe, using a named pipe, SQL Server calls the Launchpad.exe. Every time a stored procedure or call to run R or Python is requested an Rlauncher or Python process is run.  Windows job objects to process the are also created if none exist, but if there are unused windows job objects initiated by a previous call and not presently in use they will be utilized.

The job objects containers will execute the code using the rterm.exe or Python.exe. The rxlink.dll processes messages to the BxlServer to process any SQL/R functions written in the R code, send monitoring information to the SQLPAL, create XEvents.  The Python35.dll will run the python code.  If the Python code is using the revoscalepy library it will call the SQLPAL to create XEvents to use it. Otherwise it will call the BxlServer and call the sqlsatellite.dll to send and retrieve data from SQL Server.  The data is sent back to SQL Server from the sqlsatellite.dll back to SQL Server.  The named pipe used to call launchpad.exe is created internally and is not part of any other named pipe process.  The launchpad.exe uses the User IDs to call R or Python external processes. The R and Python code is executed outside of SQLPAL and the processed data is returned by sqlsatellite.dll to SQL Server.

Hopefully this post answered the questions you had about what SQL Server is doing when you run Machine Learning Services. If you have any additional questions, please let me know by asking me on twitter @desertislesql or leaving me a comment on this post.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

SQL Server 2017 Machine Learning Services Part 2 – Memory Allocation

SQL Server 2017 fundamentally changed the underlying structure of SQL Server for reasons that had nothing to do with Machine Learning Services.  Understanding this new architecture will help you configure SQLServer to optimally run R and Python. When Microsoft set out to get SQL Server to work on Linux, the goal was to provide the nearly 30 years of development effort to a new operating system without having to re-write all of the code used to make SQL Server run on the Linux operating system. For SQL Server 2005, Microsoft created a SQLOS, which created an abstraction layer between the hardware and SQL Server.  This abstraction layer allowed SQL Server to take advantages of hardware changes by expanding the capability of SQL Server to take advantage of hardware changes even when the operating system had not implemented all of the code needed to fully implemented it. From a practical perspective, this mean when you configured SQL Server internally to use 100% of all available memory, this didn’t mean all of the memory on the server, it mean 100% of the memory allocated to SQL Server.

For SQL Server 2017, Microsoft created the SQL Server Platform Abstraction Layer [SQLPAL].  Like SQLOS before it, SQLPAL abstracts the calls to the operating system. It implemented the ability to be operating system independent by separating SQL Server Code from the operating system by creating abstraction layer between SQL Server and the Operating system which includes the management of memory, processing thread and IO. This layer of abstraction provides the ability to create one version of SQL Server code which can then be run both platforms, Linux or Windows operating systems.  SQL PAL manages all memory and threads used by SQL Server.

Machine Learning Resources and SQL Server Memory Allocation

Enabling Machine Learning Services on SQL Server which I discussed in a previous blog post, requires you to enable external scripts.  Machine Learning Services are run as external processes to SQLPAL. This means that when you are running Python or R code you are running it outside of the managed processes of SQL Server and SQLPAL.  This design means that the resources used to run Machine Learning Services will run outside of the resources allocated for SQL Server.  If you are planning on using Machine Learning Services you will want to review the server memory options which you may have set for SQL Server.  If you have set the max server memory For example, if your server has 16 GB of RAM memory, and you have allocated  8 GB to SQL Server and you estimate that the operating system will use an additional 4 GB, that means that machine learning services will have 4 GB remaining which it can use.

By design, Machine Learning Services will not starve out all of the memory for SQL Server because it doesn’t use it.  This means DBAs to not have to worry about SQL Server processes not running because some R program is using all the memory as it does not use the memory SQL Server has allocated.  You do have to worry about the amount of memory allocated to Machine Learning Services as by default, using our previous example where there was 4 GB which Machine Learning Services can use, it will only use 20% of the available memory or  819 KB of memory.  That  is not a lot of memory.  Most likely if you are doing a lot of Machine Learning Services work you will want to use more memory which means you will want to change the default memory allocation for external services.

SQL Server Resource Allocation

SQL Server manages all resources using the application layer, SQLOS. SQLOS is the interface between SQL Server and all of the underlying hardware resources, including of course memory.  Using the Resource Governor within SQL Server it is possible to allocate the resources used by specific processes to ensure that no single process will for example use all the memory, starving out other processes running on the machine. Configuring and using Resource Pools provides more important functions such as production applications to be allocated the majority of the SQL Server resources used by the SQLOS. This will ensure for example that an ad-hoc reporting query will not adversely impact the primary application.

Machine Learning Services Resource Allocation within SQL Server

The allocations for the Resource Governor for all SQLPAL functions can be found by running

SELECT * FROM sys.resource_governor_resource_pools WHERE name = 'default'

By default, the max cpu, memory and cpu cap are all set to 100 percent. To look at the resource allocation for Machine Learning Services, you will need to  look at the the external resource pools.

SELECT * FROM sys.resource_governor_external_resource_pools WHERE name = 'default'

By default, the maximum memory that Machine Learning Services can use, outside of the memory that has been allocated to SQL Server, is 20% of the remaining memory. If the processes running require more memory, the allocated percentage amounts for memory and external pool resources may need to be adjusted. The following settings will decrease the overall memory settings for SQLOS and increase the memory allocated to external processes from 20% to 50%

ALTER EXTERNAL RESOURCE POOL "default" WITH (max_memory_percent = 40);
ALTER RESOURCE GOVERNOR reconfigure;
GO

Using our previous example of 4 GB of memory available after the memory allocation to SQL Server and the OS, the memory available for Machine Learning Services would go from .819 GB to 2 GB.  Setting resources for the external resource pool will in no way impact the resources SQL Server uses.  If you run the previous queries listed above you will see the changes made to the external pool while the standard resource governor pool is not changed.

Determining How Much Memory is needed for Machine Learning Services with SQL Server

How do you know how much memory SQL Server needs for Machine Learning Services? Well since I am a consultant I feel compelled to say, it depends.  Given the relative newness of the Machine Learning Tools, there are not any really good guidelines as the memory which you are using greatly depends on the complexity and quantity of the R or Python code you are running as well as how much data these processes are running against.  It also depends what language you are using.  R is more memory intensive than R and unless you are using the Rx functions which are a part of the Machine Learning Services service, will not swap items in and out of memory. The best way to determine how much memory you are using is to monitor its use over time, and the best way to do that is to create a process for monitoring the external resources.

Best Practice Method for Monitoring Machine Learning Services Resources

Creating resource pools for machine learning to monitor use over time is considered a best practice method for ongoing monitoring of resources. The following code will create an external resource pool for processes running Machine Learning Services and classifying the resources run to use it. If you are familiar with setting up resource pools in SQL Server, this process is the same, it just needs to be applied to external resources as well to use the external resources. To monitor the Machine Learning Services, the first step is to create an external resource pool called ML_Resources instead of just using the default. I am going to allocate all of the external resources to it.

CREATE EXTERNAL RESOURCE POOL ML_Resources WITH (max_memory_percent = 100);

The next step in the process is to create a workload group.  The workload group, named MLworkloadGroup  in the code, is used as a container to hold processes which have been classified as ML processes.

CREATE WORKLOAD GROUP MLworkloadGroup WITH (importance = medium) USING "default", EXTERNAL "ML_resources";

The next step is to create a function for classifying processes running as R or Python so that they can be monitored in the workload group.

USE master
GO
CREATE FUNCTION is_ML_app()
RETURNS sysname
WITH schemabinding
AS
BEGIN
IF program_name() in ('Microsoft R Host', 'RStudio', ‘Python’, ‘Pythonw’) RETURN 'MLworkloadGroup';
RETURN 'default'
END;
GO

Once the function has been created, then the Resource Governor is directed to use the function so that all of the Python and R code are monitored in the external resource pool and turns on the Resource Governor with the reconfigure command.

ALTER RESOURCE GOVERNOR WITH (classifier_function = dbo.is_ML_app);
ALTER RESOURCE GOVERNOR   reconfigure;
GO

Going forward, all processes running R or Python will be classified and use all available memory.  After these steps are completed, you can obtain performance information from the DMVs sys.dm_resource_governor_resource_pool and  sys.dm_resource_governor_workload_groups by creating a query like this

USE master
GO
SELECT a.session_id, a.login_name,  b.name
FROM sys.dm_exec_sessions AS a
JOIN sys.dm_resource_governor_workload_groups AS b
ON a.group_id = b.group_id

 

Using the Windows Performance Monitor, you will now be able to take a look at the resources being used for Machine Learning Services and can then determine how much memory is needed based upon actual usage on the server.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2017 Machine Learning Services Part 1- Installation

With the release of SQL Server 2017, you now have the capability to incorporate both R and Python into SQL Server. As there  is a lot of material on the topic, this is the first post in the series which covers installation.  In future topics we will be covering the internal components, monitoring a R and Python code to determine performance impact on SQL Server, creating and maintaining R code, creating and maintaining Python code, and other related topics.

Installing Machine Learning Services

R was first introduced in the SQL Server 2016 and it was called R Services.  For SQL Server 2017, this same service was renamed to Machine Learning Services, and expanded to include Python. In the install there are three options, installing the Machine Learning Services, then selecting R and/or Python as you see in the attached picture.

Why you want to select Machine Learning Services(In-Database)

There are two installation options:  In-Database or Standalone.  If you are evaluating Machine Learning Services and you have no knowledge of what the load may be, start by selecting the Machine Learning Service In-Database.  There are several reasons why by default you want to select the In-Database option. One of the problems that Microsoft was looking to solve by incorporating advanced data analytics was to improve performance of the native code by greatly reducing data latency.  If you are analyzing a lot of data which is stored within SQL Server, the performance will be improved if the data does not need to be moved around on a network. Also, the licensing costs of installing R Server standalone also need to be evaluated with a Microsoft representative as well. An evaluation of the resource load on the network, as well as analysis of the code running on SQL Server should be performed prior to the decision to install the Machine Learning Server Standalone.

Internet Access Requirements for installing R and Python

The Machine Learning Service is an optional part of the SQL Server Install.  Because R and Python are both open source applications, Microsoft cannot include the R or Python executables within the install of SQL Server.  The executables must be downloaded from their respective locations on the internet, and the installation process is a little different if there is no internet access.  Each language has two installs, one for the executable and one for the server. If you do not access to the internet on the server where you are installing SQL Server 2017, you can download the files needed for the install.

Here are the links for SQL Server 2017 CU2.

Microsoft R Open

Microsoft R Server

Microsoft Python Open

Microsoft Python Server

If you are installing a different version, use the links provided on the Offline Installation screen. These links each will download a .cab file.  You will need to copy the cab files to a location where the server can access them and provide the path in the Offline Installation window.

 

What is Installed with Machine Learning Services

Machine Learning is an external process and communicates with launchpad.exe to access either R or Python.  For a quick check to see if the Machine Learning Services were installed, look for the SQL Server Launchpad service in the list of running services. It will also create by default 20 different external users which are used to call R or Python. There will be a subdirectory created for each user, with the name of the SQL Server instance name, which is by default MSSQLSERVER, followed by a number 00-20.  These subdirectories have nothing in them, as they are used temporarily when R or Python needs them and the information in them is eventually deleted by SQL Server. The default location is C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData.

The R tools are located in the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES directory, and include RLaucher.dll which will load R.  If you want to run R on the server, which can be handy when updating R libraries, run the RGUI.exe, which will load up an interface where you can run R code.  In SQL Server 2017 CU2, Microsoft R Open 3.3.3 is installed along with R Server 9.2.0.  Microsoft R Open is a version that is 100% Open Source and completely compatible with the standard Open Source version of R, which is commonly referred to as Comprehensive R Archive Network [CRAN] R.  Microsoft rewrote some of the underlying functions so that they would be multi-threaded, which R is not, and incorporate the Intel Math Kernel libraries to improve the performance.   R Server is the version of R which contains the proprietary functions which Microsoft created for SQL Server which include the ability to load code in and out of memory, which will be discussed in a future post.

The Python tools are located in the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES directory, and include Pythonlauncher.dll which will load Python 3.5. The installation includes the Anaconda distribution for Python, which includes not only the data science components of Anaconda, but also SciKitLearn and Pandas.  Microsoft has also included machine learning algorithms which were created for Python in the microsoftml and revoscalepy libraries.  There is a lot of interesting content will be discussing more about these libraries in a later post.

Configuring SQL Server to Run R or Python

Once the Machine Learning components are installed, there are some configuration steps which must be completed to permit R or Python to run on SQL Server.  If this is a new server, make sure to install SQL Server Management Studio, since it is not included in the SQL Server Install. From within an SSMS query window, the following script needs to be run to enable R

SP_CONFIGURE 'external scripts enabled', 1
GO
RECONFIGURE
GO

After this step completes successfully, a restart of SQL Server Services is required.  When stopping the service, you will be notified that SQL Server Launchpad also will need to be restarted.  I have noticed that for  some reason, Launchpad does not always restart when SQL Server is restarted, so you might want to check to make sure that it is running, as you cannot run R or Python unless the SQL Server Launchpad service is running.  After the restart, to check to see if R is working properly, run the following code from within an SSMS query window.

EXEC sp_execute_external_script @language =N'R',
@script=N'OutputDataSet <-InputDataSet',
@input_data_1 =N'SELECT 1 as CheckToSeeIfRIsWorking'
WITH RESULT SETS (([CheckToSeeIfRIsWorking] int not null));
GO

When run successfully, this script will return a 1. SQL Server is now ready to run R.  To check to see if Python can be run successfully, run this script.

EXEC sp_execute_external_script  @language =N'Python',
@script=N'OutputDataSet = InputDataSet',
@input_data_1 = N'SELECT 1 AS CheckToSeeIfPythonIsWorking'
WITH RESULT SETS ((CheckToSeeIfPythonIsWorking int not null));
GO

In my next post I will cover the SQL Server internal components which are run when R or Python code is run.  Please subscribe to my blog to be notified when the next installment will be available. If you have any questions, comments or ideas for future post topics, please leave me comments as I would really enjoy any feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Applying Data Science to SQL Server

Data has been getting a lot of attention in the business world for a while now.  First there was big data, which was another way to store data so that later the data could be analyzed.  Recently the talk has been all about analyzing the data with new tools such as R and Python.  The reality is that people who have been working with databases doing work in business intelligence have been analyzing data for a while.  Learning a different toolset for analyzing data is not such a big leap, but an expansion of what they know.  As the field is rapidly expanding now, and demand is huge, now is a great time to learn the tools.

Traditional Data Science Development

Data scientist have created analysis solutions with data for a number of years.  The data is analyzed, cleaned, processed with various algorithms, and results are created.  When the process is complete, code has been created to provide meaning from a portion of the data and is ready to be migrated to production. Traditionally there has been a big gap between creating a solution and implementing the solution to be run against data on a regular basis.  Data Scientists traditionally are not part of the IT organization, they are actuaries or analysts, not the people who have anything to do with system processing. Recently I did some work for a company and after the data scientists were done creating a solution, they turned over all of their code to the Java team.  Six weeks later the code was released into production. This solution made no one happy.  Management thought it took too long.  The data scientist didn’t believe that the code that they created was what was implemented into production, and the java developers were tired of people blaming them for wrong code which required a long time to implement.

SQL Server Implementation of Data Science

Since SQL Server 2016 incorporates R and SQL Server 2017 has added the ability to include Python code into SQL Server, data science solutions can be incorporated as part of a scheduled process with SQL Server.  There is now a dev ops solution for incorporating R and Python into SQL Server.  One way of learning about the technology is through blogs and other online training which can help you get up to speed.  Many times though there is no substitute for hands on learning.  If you are attending PASS Summit 2017, and want to learn not only about data science, but how to incorporate it into SQL Server, I hope you can sign up for my all day training session on Applied Data Science for the SQL Server Professional.  I hope to see you there.

I have recently created a You Tube channel where I plan on sharing more data related content where I have included my first video about this conference.

If you are at PASS Summit, please introduce yourself as I would love to meet people who read my blog personally.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

What to do when SQL Server 2016 R Installation Does Not Work

I was recently helping a customer with their R installation.  After R was installed successfully, I went to the SQL Server Management Studio and ran the script needed to configure external services

EXEC sp_configure  'external scripts enabled', 1
RECONFIGURE WITH OVERRIDE
GO
RECONFIGURE;
GO

After this script ran successfully, I closed SSMS, and restarted the SQL Server service, noting when I did so that it told me that it was also going to be restarting SQL Server Launchpad, which is installed as part of R Server, was also going to be restarted.  Everything looked good until I tried to run my first R command.  I went for the super generic script

EXEC sp_execute_external_script @language=N'R'
,@script=N'OutputDataSet <-InputDataSet'
,@input_data_1=N'SELECT 1 as CheckToSeeIfRIsWorking'
WITH RESULT SETS
(
(
[CheckToSeeIfRIsWorking] INT NOT NULL
)
);

Much to my surprise after this I received an error

Msg 39019, Level 16, State 1, Line 1
An external script error occurred:
Unable to launch the runtime. ErrorCode 0x80070490: 1168(Element not found.).
Msg 11536, Level 16, State 1, Line 1
EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time.

I looked in the log files and didn’t find any errors.  I checked the configuration manager to ensure that I had some user ids configured in the configuration manager.  Nothing seemed to make any difference.  Looking online, the only error that I saw which might possibly be close was a different error message about 8.3 naming and the working directory.

Resolving Error Code 0x80070490 Unable to launch the runtime for R in SQL Server

When I saw this error, I first checked the various SQL Log files. I didn’t find any errors.  I checked the configuration manager to ensure that I had some user ids configured in the configuration manager.  The default 20 users were there as expected. Nothing seemed to make any difference.  Looking online, the only error that I saw which might possibly be close was a different error message about 8.3 naming and the working directory.  I ran into a similar problem with R and Power BI.  R natively doesn’t understand anything but directories which are only 8 characters long.  If there are more letters than this in a directory, which is always the case with the default SQL Server directories, internally the directory is read as the first 8 characters and a tilde(~). This particular server also had an installation of SQL Server 2014 on it, and if you looked at the filename, it was impossible to tell which one was which as the directory read like this C:\PROGRA~1\MICROS~2\MSSQL1~1.MSS\MSSQL\EXTENS~1    I created a new directory with short names D:\Rwork and changed the working directory to the new subdirectory Rwork.  I was really hopeful that this would work.  But when I ran the external script listed above, it still did not work.  What had I done wrong?  I didn’t delete the previous working directory, which was in the default location C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\ExtensibilityData . When SQL Server creates this subdirectory, it also includes subdirectories for all of the users.  When I created the new directory, it didn’t have those subdirectories added.  I copied the subdirectories from the old working directory to the new one and tried again.  This time, success!

I hope you find this post helpful, as I know I could have used it when I was trying to figure this out. If it saved you time too, drop me a line and let me know.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Database Table Design

http://michaeljswart.com/2016/06/t-sql-tuesday-079-its-2016/comment-page-1/#comment-186750There are a number of different ways that you could decide to organize your data in a database. If you are creating a database to be used in a transactional system, your table design should follow a normalized design as much as possible.  Data should be grouped in logical groups, such as customers, products, sales, orders, quotes, tickets etc.  Redundantly repeating data in multiple places will cause problems in the future.  Your design may include hundreds of tables, and that is perfectly fine.

If the purpose of the database is for Power BI or for a data warehouse, dimensional modeling techniques should be deployed.  In this type of database design, generally speaking there are a number of tables containing descriptive data, such as product and customer and few fact tables which contain the actions which happened.  The actions include things like Sales.  The database design will look much like a star with the fact tables in the center and the dimension tables connected to it like satellites.  If you have one dimensional model connected to another dimension, that design is called a snowflake and some applications, like Analysis Services Multi-dimensional will not process it well.  Power BI and Analysis Services Tabular work very well with snowflake dimensions.

SnowflakeModel

Snowflake Data Model from Power BI

 

Table Design Gone Wrong

To paraphrase Ron White, the reason that I described database modeling is so that even people who know nothing about database design could appreciate my interview story. When I was working at a previous location, I assisted in providing technical reviews for database developer jobs.  We asked a number of typical questions about indexing and stored procedures, but I always tried to come up with at least one question which the candidate could not readily answer by cramming interview questions found on the internet.  I decided to ask one candidate, who did correctly answer the previous stock questions something that would let us know what kind of work he had really done.  I asked him “What do you do to determine how to design a table?” I was interested to find out what his thought process was, see if he would mention normal form or describe something he had done in the past.  I was completely surprised by his answer.

“Well, you can only have 256 columns in a table. After that you have to create a new one.”  This answer was a complete surprise.  I was really curious to find out where he had developed this completely warped view of how to determine what fields should go in a table.  It turned out that he learned all of his database skills from a co-worker, who had recently retired.  His co-worker had worked at the same location for a very long time and when he started used mainframes without any databases.  He had migrated some of the applications to databases and they wrote them this way because it “made sense”.   After that the interview was over, and we hired someone else.

I challenge anyone who is learning databases to please look up what people tell you to do on the internet.  This is useful for two reasons, the first being that it will probably help you learn the concept better to read about it another way.   The other reason is that you can find out if the person teaching you really knows what they are doing, so you will learn the correct way to do something.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur