Articles By Ginger Grant

Analyzing Salary Data with Power BI and R – Part 2

As advanced analytical techniques become more popular, companies are looking to hire people to find answers in their data. What kind of answers?  Predicting the future, determining what factors make customers leave, what kind of products can they get good customers to buy, what conditions are related, is this a valid transaction and similar questions. What answers can be provided have everything to do with the available data. The data I chose to analyze was the salary data provided by Brent Ozar, which is publicly available here.  I started looking at the data in a previous post where I did an initial review of the data and discussed the data analysis process.

Regression Analysis – What kind of relationships can I find in the Data?

Looking at how data is related is a very important step in data analysis.  Most often various items are analyzed using some linear regression algorithms to compare one or more variables together. For this kind of analysis, generally speaking all of the data needs to be represented numerically, which means that if data instead exists as categories of items, the data will need to be transformed. For example, on the Salary Data which I analyzed and published in Power BI, the job experience and salary roe compared using the ggplot library from R, and the two different values are included on respective axis. What I was hoping to find is that there was a strong relationship between these two values.  If you look at the graph, you can see this is not the case. Interestingly enough while the line shows an upward trend, you can see a drop in salaries for those with a lot of job experience.  Those people with the most experience, above 35 years are making less money than those with less experience.  The graph also shows those who are just starting in their careers are not necessarily making the very little money. What the data shows is there is no guarantee that the longer you work the more money you make.

Data Cleansing for Analysis

Because I am looking for data trends and not anomaly detection, I normalized the survey data.  I eliminated the 100 people who did not fill in the salary amount, and cut off the high and the low.  I used the box plots generated in Power BI to serve as a guideline for the ranges to exclude.  As I was also interested in determining the difference in the responses between male and female, so I did some data substitution on some of the values as I wanted to included more records.  In 2018, the only year that this question was asked, 87.6% of the respondents were male.  I made the decision to include all of the respondents where the number of respondents was less than .22% as male so that I would have more data to evaluate. I modified all of the data in Power BI using M code.  You can take a look at all of the modifications I made to the data here in the Power BI report I created, as I am making it available for the next 30 days.

Examining the Top 5%

Recently I have had some conversations with some colleges regarding salary, and that led me to want to review what people would like to make.  Most people would like to be making the most money possible in their profession, and are not interested in moving, which is why I chose not to do much with the geographic data.  I ran a number of different machine learning algorithms on the data trying to find a definitive set of results among those who reported making the most money.  The results of those experiments were inconclusive.  While I found some items which were common among the highest earners, the results were not statistically conclusive. There are a number of conceptions that people have regarding salary, and I chose to illustrate some of them to dispel some myths surrounding data. I also grouped the salaries into groups: 95% for above 153,565, 75 for above 67,789, and the rest for the average.  These numbers were based on the values in the box chart in the top left of the Power BI report.

Salary Conclusions – Myth vs Reality

I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers.  According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%?  Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute.  Those who commute 100% of the time do very well, as well as those who spend every day at a job site.  Males report working more hours and telecommuting less than females do as well.  If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.

After spending quite a bit of time analyzing and visualizing the data, I was unable to determine a specific set of skills which to provide a roadmap of exactly what one needs to do to be in the top 5% of the salary for a data professional.  What I can tell you is more than likely there is someone with your level of work experience and position who is doing really well, and there is no reason why by the time that the next survey comes out, you are not the person who is in the top 5 percent.  This may mean working harder at your job and perhaps changing employers as the analysis shows that is the best way to make more money.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 1

CRISP-DM Diagram

The standard method for analyzing data is the CRoss Industry Standard Process for Data Mining [CRISP-DM].    Rather than describe the method, this post will walk through the process to illustrate how to analyze data using it. The data that I selected for analysis is the Brent Ozar Salary Survey information.The data is available under open source license and contains two years of answers to salary data with a total of over 6,000 responses. Understanding what is in the data determines what kind of answers it can provide. What can the data reveal?  Prior to drawing any conclusions, one needs to examine the data to determine the level of completeness, correctness and whether or not you have enough data to make decisions based upon the data.

Data Understanding

The first step in the process is to analyze the data to evaluate what kind of knowledge you can gather from the data.  The primary perspective of the data is salary, and the survey describes the characteristics which people with a certain salary level have.  The survey used to gather the data contained a number of drop down boxes and those fields can be used as categorical variables as there are a fixed number of possible responses.  Other survey items allowed people to freely enter anything, which makes it harder to statistically analyze some of the data.

 

Where Do Data Professionals Make the Most Money?

In the survey for 2018, the people who made the most money were from Hong Kong with an average salary of $263,289.  Before you start planning on moving, you will might want to look at the data a little closer.  There were 2 people who responded from Hong Kong.  One of them said he was making over 1.4 million dollars, the highest amount reported in the survey.  Given the fact that we only have two responses from Hong Kong, we will be unable to draw a definitive conclusion with 2 records. To be able to answer that question, more analysis will need to be done on the location and salary information and you will probably want to add market basket criteria because a dollar say in Hong Kong doesn’t go as far as the average apartment rental is $3,237 a month as it does say in Uganda where the rent is around $187 a month.

Using Power BI to Provide Data Understanding

The data modeling step of the CRISP-DM process anticipates that you will want to modify the raw data.  There may be records containing null or erroneous values that you will need to eliminate the entire record or substitute entries for a particular value.  You can also use this analysis to determine what conclusions you will be able to derive from the data.  For example, if you wanted to analyze what criteria are required for Microsoft Access Developers to make over $100,000 a year, you could easily find out in Power BI that it is not possible to do that analysis as there is no data for that set of criteria. If you want to do a year over year analysis of people who are working as DBAs, which I show in the second tab, you will need to change some of the categories as they changed from 2017 to 2018.

Analyzing Data with R and Power BI

Many times when providing a final report to explain your analysis, you will need to provide some documentation to demonstrate your conclusions.  In addition to creating some visualizations in Power BI, I also created some in R to include visualizations and analysis with R.  While I can include any R library I wish in Power BI Desktop, there are only 364 currently added to the Power BI Service in Azure.  If there is an R visualization you would like to add, you can send an email request to rinpbifb@microsoft.com and ask for it to be added.

Power BI Salary Data

For more information on Analyzing Data with Power BI and R, I recorded a video for Microsoft’s Power BI team which is available here.  The video shows the cleaning process some information regarding the analysis of the process. The analysis of the Salary data itself will be included in another post.  If you would like to find out when the next post in this series is available, please subscribe to my website for all of the latest updates.

*** UPDATE: My next post on further analysis of this data is available here.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

 

SQL Server 2017 Machine Learning Services Part 3 – Internals

After you have installed SQL Server 2017 with Machine Learning Services, you may notice a couple of interesting things.  One is that by default you will have 20 new users created.  These user ids are  by default named MSSQLSQLServer01, MSSQLSQLServer02, MSSQLSQLServer03… MSSQLSQLServer20, but if you have a named instance, like I have called SQLServer2017, the users are named with the named instance.  There is a subdirectory created for each User ID with is by default located in  \Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityDataExternal .  You do not want to remove these User IDs or rename them.  You may be wondering Why do you have all of these User IDs to use Machine Learning Services and what are they for? Keep on reading to find the answer

SQL Server Launchpad and User IDs

When calling external processes, internally SQL Server uses User IDs to call the Launchpad service, which is installed as part of Machine Learning Services and must be running for SQL Server to be able to execute code written in R or Python.  The number of users is set by default.  To change the number of users, open  up SQL Server Configuration Manager by typing SQLServerManager14.msc at the run prompt. For some unknowable reason Microsoft decided to hide this application which was previously available by looking at the installed programs on the server.  Now for some reason they think everyone should memorize this obscure command. Once you have the SQL Server Configuration Manager open, right click on the SQL Server Launchpad service and select the properties which will show the window, as shown below.  You will notice I am running an instance called SQLServer2017 which is listed in parenthesis in the window name.

SQL Server 2017 Launchpad Configuration

Clicking on the Advanced Tab shows an entry for External Users Count, which is shown highlighted. This value is set by default to 20 users.  This means that 20 different threads can concurrently call an R or Python process.  If you reduce this number to 0, no R or Python code can be run, and the SQL Server Launchpad service will not run.  The minimum number of users you can have and have the launchpad service still run is two, but changing the users to that low number is not recommended as those processes are needed to run Machine Learning Services to rn.  If you have more than 20 concurrent R or Python processes running, SQL Server will wait until one of these threads is no longer in use and once one is free, will use it to call another process. While the process is running you may see some GUIs or other non-decipherable data appear in the folders for a user.  The garbage cleanup runs soon after to delete anything that is in the folder, as they will eventually all be empty. What does the Launchpad Service do and what does that have to do with Machine Learning Services and SQL Server? Well, the short answer is the launchpad.exe is used to call R and Python.

SQL Server Internal Machine Learning Components

To run R or Python code in SQL Server, you will need to execute an external script, which I talked about in the first post of this series.  The following diagram illustrates what happens when that call is made and what executables are called.  When a request to run R or Python code is received by the sqlservr.exe, using a named pipe, SQL Server calls the Launchpad.exe. Every time a stored procedure or call to run R or Python is requested an Rlauncher or Python process is run.  Windows job objects to process the are also created if none exist, but if there are unused windows job objects initiated by a previous call and not presently in use they will be utilized.

The job objects containers will execute the code using the rterm.exe or Python.exe. The rxlink.dll processes messages to the BxlServer to process any SQL/R functions written in the R code, send monitoring information to the SQLPAL, create XEvents.  The Python35.dll will run the python code.  If the Python code is using the revoscalepy library it will call the SQLPAL to create XEvents to use it. Otherwise it will call the BxlServer and call the sqlsatellite.dll to send and retrieve data from SQL Server.  The data is sent back to SQL Server from the sqlsatellite.dll back to SQL Server.  The named pipe used to call launchpad.exe is created internally and is not part of any other named pipe process.  The launchpad.exe uses the User IDs to call R or Python external processes. The R and Python code is executed outside of SQLPAL and the processed data is returned by sqlsatellite.dll to SQL Server.

Hopefully this post answered the questions you had about what SQL Server is doing when you run Machine Learning Services. If you have any additional questions, please let me know by asking me on twitter @desertislesql or leaving me a comment on this post.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

SQL Server 2017 Machine Learning Services Part 2 – Memory Allocation

SQL Server 2017 fundamentally changed the underlying structure of SQL Server for reasons that had nothing to do with Machine Learning Services.  Understanding this new architecture will help you configure SQLServer to optimally run R and Python. When Microsoft set out to get SQL Server to work on Linux, the goal was to provide the nearly 30 years of development effort to a new operating system without having to re-write all of the code used to make SQL Server run on the Linux operating system. For SQL Server 2005, Microsoft created a SQLOS, which created an abstraction layer between the hardware and SQL Server.  This abstraction layer allowed SQL Server to take advantages of hardware changes by expanding the capability of SQL Server to take advantage of hardware changes even when the operating system had not implemented all of the code needed to fully implemented it. From a practical perspective, this mean when you configured SQL Server internally to use 100% of all available memory, this didn’t mean all of the memory on the server, it mean 100% of the memory allocated to SQL Server.

For SQL Server 2017, Microsoft created the SQL Server Platform Abstraction Layer [SQLPAL].  Like SQLOS before it, SQLPAL abstracts the calls to the operating system. It implemented the ability to be operating system independent by separating SQL Server Code from the operating system by creating abstraction layer between SQL Server and the Operating system which includes the management of memory, processing thread and IO. This layer of abstraction provides the ability to create one version of SQL Server code which can then be run both platforms, Linux or Windows operating systems.  SQL PAL manages all memory and threads used by SQL Server.

Machine Learning Resources and SQL Server Memory Allocation

Enabling Machine Learning Services on SQL Server which I discussed in a previous blog post, requires you to enable external scripts.  Machine Learning Services are run as external processes to SQLPAL. This means that when you are running Python or R code you are running it outside of the managed processes of SQL Server and SQLPAL.  This design means that the resources used to run Machine Learning Services will run outside of the resources allocated for SQL Server.  If you are planning on using Machine Learning Services you will want to review the server memory options which you may have set for SQL Server.  If you have set the max server memory For example, if your server has 16 GB of RAM memory, and you have allocated  8 GB to SQL Server and you estimate that the operating system will use an additional 4 GB, that means that machine learning services will have 4 GB remaining which it can use.

By design, Machine Learning Services will not starve out all of the memory for SQL Server because it doesn’t use it.  This means DBAs to not have to worry about SQL Server processes not running because some R program is using all the memory as it does not use the memory SQL Server has allocated.  You do have to worry about the amount of memory allocated to Machine Learning Services as by default, using our previous example where there was 4 GB which Machine Learning Services can use, it will only use 20% of the available memory or  819 KB of memory.  That  is not a lot of memory.  Most likely if you are doing a lot of Machine Learning Services work you will want to use more memory which means you will want to change the default memory allocation for external services.

SQL Server Resource Allocation

SQL Server manages all resources using the application layer, SQLOS. SQLOS is the interface between SQL Server and all of the underlying hardware resources, including of course memory.  Using the Resource Governor within SQL Server it is possible to allocate the resources used by specific processes to ensure that no single process will for example use all the memory, starving out other processes running on the machine. Configuring and using Resource Pools provides more important functions such as production applications to be allocated the majority of the SQL Server resources used by the SQLOS. This will ensure for example that an ad-hoc reporting query will not adversely impact the primary application.

Machine Learning Services Resource Allocation within SQL Server

The allocations for the Resource Governor for all SQLPAL functions can be found by running

SELECT * FROM sys.resource_governor_resource_pools WHERE name = 'default'

By default, the max cpu, memory and cpu cap are all set to 100 percent. To look at the resource allocation for Machine Learning Services, you will need to  look at the the external resource pools.

SELECT * FROM sys.resource_governor_external_resource_pools WHERE name = 'default'

By default, the maximum memory that Machine Learning Services can use, outside of the memory that has been allocated to SQL Server, is 20% of the remaining memory. If the processes running require more memory, the allocated percentage amounts for memory and external pool resources may need to be adjusted. The following settings will decrease the overall memory settings for SQLOS and increase the memory allocated to external processes from 20% to 50%

ALTER EXTERNAL RESOURCE POOL "default" WITH (max_memory_percent = 40);
ALTER RESOURCE GOVERNOR reconfigure;
GO

Using our previous example of 4 GB of memory available after the memory allocation to SQL Server and the OS, the memory available for Machine Learning Services would go from .819 GB to 2 GB.  Setting resources for the external resource pool will in no way impact the resources SQL Server uses.  If you run the previous queries listed above you will see the changes made to the external pool while the standard resource governor pool is not changed.

Determining How Much Memory is needed for Machine Learning Services with SQL Server

How do you know how much memory SQL Server needs for Machine Learning Services? Well since I am a consultant I feel compelled to say, it depends.  Given the relative newness of the Machine Learning Tools, there are not any really good guidelines as the memory which you are using greatly depends on the complexity and quantity of the R or Python code you are running as well as how much data these processes are running against.  It also depends what language you are using.  R is more memory intensive than R and unless you are using the Rx functions which are a part of the Machine Learning Services service, will not swap items in and out of memory. The best way to determine how much memory you are using is to monitor its use over time, and the best way to do that is to create a process for monitoring the external resources.

Best Practice Method for Monitoring Machine Learning Services Resources

Creating resource pools for machine learning to monitor use over time is considered a best practice method for ongoing monitoring of resources. The following code will create an external resource pool for processes running Machine Learning Services and classifying the resources run to use it. If you are familiar with setting up resource pools in SQL Server, this process is the same, it just needs to be applied to external resources as well to use the external resources. To monitor the Machine Learning Services, the first step is to create an external resource pool called ML_Resources instead of just using the default. I am going to allocate all of the external resources to it.

CREATE EXTERNAL RESOURCE POOL ML_Resources WITH (max_memory_percent = 100);

The next step in the process is to create a workload group.  The workload group, named MLworkloadGroup  in the code, is used as a container to hold processes which have been classified as ML processes.

CREATE WORKLOAD GROUP MLworkloadGroup WITH (importance = medium) USING "default", EXTERNAL "ML_resources";

The next step is to create a function for classifying processes running as R or Python so that they can be monitored in the workload group.

USE master
GO
CREATE FUNCTION is_ML_app()
RETURNS sysname
WITH schemabinding
AS
BEGIN
IF program_name() in ('Microsoft R Host', 'RStudio', ‘Python’, ‘Pythonw’) RETURN 'MLworkloadGroup';
RETURN 'default'
END;
GO

Once the function has been created, then the Resource Governor is directed to use the function so that all of the Python and R code are monitored in the external resource pool and turns on the Resource Governor with the reconfigure command.

ALTER RESOURCE GOVERNOR WITH (classifier_function = dbo.is_ML_app);
ALTER RESOURCE GOVERNOR   reconfigure;
GO

Going forward, all processes running R or Python will be classified and use all available memory.  After these steps are completed, you can obtain performance information from the DMVs sys.dm_resource_governor_resource_pool and  sys.dm_resource_governor_workload_groups by creating a query like this

USE master
GO
SELECT a.session_id, a.login_name,  b.name
FROM sys.dm_exec_sessions AS a
JOIN sys.dm_resource_governor_workload_groups AS b
ON a.group_id = b.group_id

 

Using the Windows Performance Monitor, you will now be able to take a look at the resources being used for Machine Learning Services and can then determine how much memory is needed based upon actual usage on the server.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2017 Machine Learning Services Part 1- Installation

With the release of SQL Server 2017, you now have the capability to incorporate both R and Python into SQL Server. As there  is a lot of material on the topic, this is the first post in the series which covers installation.  In future topics we will be covering the internal components, monitoring a R and Python code to determine performance impact on SQL Server, creating and maintaining R code, creating and maintaining Python code, and other related topics.

Installing Machine Learning Services

R was first introduced in the SQL Server 2016 and it was called R Services.  For SQL Server 2017, this same service was renamed to Machine Learning Services, and expanded to include Python. In the install there are three options, installing the Machine Learning Services, then selecting R and/or Python as you see in the attached picture.

Why you want to select Machine Learning Services(In-Database)

There are two installation options:  In-Database or Standalone.  If you are evaluating Machine Learning Services and you have no knowledge of what the load may be, start by selecting the Machine Learning Service In-Database.  There are several reasons why by default you want to select the In-Database option. One of the problems that Microsoft was looking to solve by incorporating advanced data analytics was to improve performance of the native code by greatly reducing data latency.  If you are analyzing a lot of data which is stored within SQL Server, the performance will be improved if the data does not need to be moved around on a network. Also, the licensing costs of installing R Server standalone also need to be evaluated with a Microsoft representative as well. An evaluation of the resource load on the network, as well as analysis of the code running on SQL Server should be performed prior to the decision to install the Machine Learning Server Standalone.

Internet Access Requirements for installing R and Python

The Machine Learning Service is an optional part of the SQL Server Install.  Because R and Python are both open source applications, Microsoft cannot include the R or Python executables within the install of SQL Server.  The executables must be downloaded from their respective locations on the internet, and the installation process is a little different if there is no internet access.  Each language has two installs, one for the executable and one for the server. If you do not access to the internet on the server where you are installing SQL Server 2017, you can download the files needed for the install.

Here are the links for SQL Server 2017 CU2.

Microsoft R Open

Microsoft R Server

Microsoft Python Open

Microsoft Python Server

If you are installing a different version, use the links provided on the Offline Installation screen. These links each will download a .cab file.  You will need to copy the cab files to a location where the server can access them and provide the path in the Offline Installation window.

 

What is Installed with Machine Learning Services

Machine Learning is an external process and communicates with launchpad.exe to access either R or Python.  For a quick check to see if the Machine Learning Services were installed, look for the SQL Server Launchpad service in the list of running services. It will also create by default 20 different external users which are used to call R or Python. There will be a subdirectory created for each user, with the name of the SQL Server instance name, which is by default MSSQLSERVER, followed by a number 00-20.  These subdirectories have nothing in them, as they are used temporarily when R or Python needs them and the information in them is eventually deleted by SQL Server. The default location is C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData.

The R tools are located in the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES directory, and include RLaucher.dll which will load R.  If you want to run R on the server, which can be handy when updating R libraries, run the RGUI.exe, which will load up an interface where you can run R code.  In SQL Server 2017 CU2, Microsoft R Open 3.3.3 is installed along with R Server 9.2.0.  Microsoft R Open is a version that is 100% Open Source and completely compatible with the standard Open Source version of R, which is commonly referred to as Comprehensive R Archive Network [CRAN] R.  Microsoft rewrote some of the underlying functions so that they would be multi-threaded, which R is not, and incorporate the Intel Math Kernel libraries to improve the performance.   R Server is the version of R which contains the proprietary functions which Microsoft created for SQL Server which include the ability to load code in and out of memory, which will be discussed in a future post.

The Python tools are located in the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES directory, and include Pythonlauncher.dll which will load Python 3.5. The installation includes the Anaconda distribution for Python, which includes not only the data science components of Anaconda, but also SciKitLearn and Pandas.  Microsoft has also included machine learning algorithms which were created for Python in the microsoftml and revoscalepy libraries.  There is a lot of interesting content will be discussing more about these libraries in a later post.

Configuring SQL Server to Run R or Python

Once the Machine Learning components are installed, there are some configuration steps which must be completed to permit R or Python to run on SQL Server.  If this is a new server, make sure to install SQL Server Management Studio, since it is not included in the SQL Server Install. From within an SSMS query window, the following script needs to be run to enable R

SP_CONFIGURE 'external scripts enabled', 1
GO
RECONFIGURE
GO

After this step completes successfully, a restart of SQL Server Services is required.  When stopping the service, you will be notified that SQL Server Launchpad also will need to be restarted.  I have noticed that for  some reason, Launchpad does not always restart when SQL Server is restarted, so you might want to check to make sure that it is running, as you cannot run R or Python unless the SQL Server Launchpad service is running.  After the restart, to check to see if R is working properly, run the following code from within an SSMS query window.

EXEC sp_execute_external_script @language =N'R',
@script=N'OutputDataSet <-InputDataSet',
@input_data_1 =N'SELECT 1 as CheckToSeeIfRIsWorking'
WITH RESULT SETS (([CheckToSeeIfRIsWorking] int not null));
GO

When run successfully, this script will return a 1. SQL Server is now ready to run R.  To check to see if Python can be run successfully, run this script.

EXEC sp_execute_external_script  @language =N'Python',
@script=N'OutputDataSet = InputDataSet',
@input_data_1 = N'SELECT 1 AS CheckToSeeIfPythonIsWorking'
WITH RESULT SETS ((CheckToSeeIfPythonIsWorking int not null));
GO

In my next post I will cover the SQL Server internal components which are run when R or Python code is run.  Please subscribe to my blog to be notified when the next installment will be available. If you have any questions, comments or ideas for future post topics, please leave me comments as I would really enjoy any feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Applying Data Science to SQL Server

Data has been getting a lot of attention in the business world for a while now. First there was big data, which was another way to store data so that later the data could be analyzed. Recently the talk has been all about analyzing the data with new tools such as R and Python. The reality is that people who have been working with databases doing work in business intelligence have been analyzing data for a while. Learning a different toolset for analyzing data is not such a big leap, but an expansion of what they know.

As the field is rapidly expanding now, and demand is huge, now is a great time to learn the tools. With the advent of more advanced software, it seems essential to create fast and reliable processing mechanisms. In addition, concepts such as Web3 seem to be developing and becoming realities at an accelerated rate. Data plays an essential role in the development of such marvels of information technology. It is high time companies consider combining traditional data science tools, like MATLAB, Apache Spark, and SQL, with modern tools, like web3 sql. This could help them organize and process the data at relative speeds while reducing data analytics costs.

Traditional Data Science Development

Data scientist have created analysis solutions with data for a number of years. The data is analyzed, cleaned, processed with various algorithms, and results are created. When the process is complete, code has been created to provide meaning from a portion of the data and is ready to be migrated to production. Traditionally there has been a big gap between creating a solution and implementing the solution to be run against data on a regular basis. Data Scientists traditionally are not part of the IT organization, they are actuaries or analysts, not the people who have anything to do with system processing. Recently I did some work for a company and after the data scientists were done creating a solution, they turned over all of their code to the Java team. Six weeks later the code was released into production. This solution made no one happy. Management thought it took too long. The data scientist didn’t believe that the code that they created was what was implemented into production, and the java developers were tired of people blaming them for wrong code which required a long time to implement.

SQL Server Implementation of Data Science

Since SQL Server 2016 incorporates R and SQL Server 2017 has added the ability to include Python code into SQL Server, data science solutions can be incorporated as part of a scheduled process with SQL Server. There is now a dev ops solution for incorporating R and Python into SQL Server. One way of learning about the technology is through blogs and other online training which can help you get up to speed. Many times though there is no substitute for hands on learning. If you are attending PASS Summit 2017, and want to learn not only about data science, but how to incorporate it into SQL Server, I hope you can sign up for my all day training session on Applied Data Science for the SQL Server Professional. I hope to see you there.

I have recently created a You Tube channel where I plan on sharing more data related content where I have included my first video about this conference.

If you are at PASS Summit, please introduce yourself as I would love to meet people who read my blog personally.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

What to do when SQL Server 2016 R Installation Does Not Work

I was recently helping a customer with their R installation.  After R was installed successfully, I went to the SQL Server Management Studio and ran the script needed to configure external services

EXEC sp_configure  'external scripts enabled', 1
RECONFIGURE WITH OVERRIDE
GO
RECONFIGURE;
GO

After this script ran successfully, I closed SSMS, and restarted the SQL Server service, noting when I did so that it told me that it was also going to be restarting SQL Server Launchpad, which is installed as part of R Server, was also going to be restarted.  Everything looked good until I tried to run my first R command.  I went for the super generic script

EXEC sp_execute_external_script @language=N'R'
,@script=N'OutputDataSet <-InputDataSet'
,@input_data_1=N'SELECT 1 as CheckToSeeIfRIsWorking'
WITH RESULT SETS
(
(
[CheckToSeeIfRIsWorking] INT NOT NULL
)
);

Much to my surprise after this I received an error

Msg 39019, Level 16, State 1, Line 1
An external script error occurred:
Unable to launch the runtime. ErrorCode 0x80070490: 1168(Element not found.).
Msg 11536, Level 16, State 1, Line 1
EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time.

I looked in the log files and didn’t find any errors.  I checked the configuration manager to ensure that I had some user ids configured in the configuration manager.  Nothing seemed to make any difference.  Looking online, the only error that I saw which might possibly be close was a different error message about 8.3 naming and the working directory.

Resolving Error Code 0x80070490 Unable to launch the runtime for R in SQL Server

When I saw this error, I first checked the various SQL Log files. I didn’t find any errors.  I checked the configuration manager to ensure that I had some user ids configured in the configuration manager.  The default 20 users were there as expected. Nothing seemed to make any difference.  Looking online, the only error that I saw which might possibly be close was a different error message about 8.3 naming and the working directory.  I ran into a similar problem with R and Power BI.  R natively doesn’t understand anything but directories which are only 8 characters long.  If there are more letters than this in a directory, which is always the case with the default SQL Server directories, internally the directory is read as the first 8 characters and a tilde(~). This particular server also had an installation of SQL Server 2014 on it, and if you looked at the filename, it was impossible to tell which one was which as the directory read like this C:\PROGRA~1\MICROS~2\MSSQL1~1.MSS\MSSQL\EXTENS~1    I created a new directory with short names D:\Rwork and changed the working directory to the new subdirectory Rwork by updating the rlauncher.config file and restarting the SQL Server Launchpad Service.  I was really hopeful that this would work.  But when I ran the external script listed above, it still did not work.  What had I done wrong?  I didn’t delete the previous working directory, which was in the default location C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\ExtensibilityData . When SQL Server creates this subdirectory, it also includes subdirectories for all of the users.  When I created the new directory, it didn’t have those subdirectories added.  I copied the subdirectories from the old working directory to the new one and tried again.  This time, success!

I hope you find this post helpful, as I know I could have used it when I was trying to figure this out. If it saved you time too, drop me a line and let me know.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Database Table Design

http://michaeljswart.com/2016/06/t-sql-tuesday-079-its-2016/comment-page-1/#comment-186750There are a number of different ways that you could decide to organize your data in a database. If you are creating a database to be used in a transactional system, your table design should follow a normalized design as much as possible.  Data should be grouped in logical groups, such as customers, products, sales, orders, quotes, tickets etc.  Redundantly repeating data in multiple places will cause problems in the future.  Your design may include hundreds of tables, and that is perfectly fine.

If the purpose of the database is for Power BI or for a data warehouse, dimensional modeling techniques should be deployed.  In this type of database design, generally speaking there are a number of tables containing descriptive data, such as product and customer and few fact tables which contain the actions which happened.  The actions include things like Sales.  The database design will look much like a star with the fact tables in the center and the dimension tables connected to it like satellites.  If you have one dimensional model connected to another dimension, that design is called a snowflake and some applications, like Analysis Services Multi-dimensional will not process it well.  Power BI and Analysis Services Tabular work very well with snowflake dimensions.

SnowflakeModel

Snowflake Data Model from Power BI

 

Table Design Gone Wrong

To paraphrase Ron White, the reason that I described database modeling is so that even people who know nothing about database design could appreciate my interview story. When I was working at a previous location, I assisted in providing technical reviews for database developer jobs.  We asked a number of typical questions about indexing and stored procedures, but I always tried to come up with at least one question which the candidate could not readily answer by cramming interview questions found on the internet.  I decided to ask one candidate, who did correctly answer the previous stock questions something that would let us know what kind of work he had really done.  I asked him “What do you do to determine how to design a table?” I was interested to find out what his thought process was, see if he would mention normal form or describe something he had done in the past.  I was completely surprised by his answer.

“Well, you can only have 256 columns in a table. After that you have to create a new one.”  This answer was a complete surprise.  I was really curious to find out where he had developed this completely warped view of how to determine what fields should go in a table.  It turned out that he learned all of his database skills from a co-worker, who had recently retired.  His co-worker had worked at the same location for a very long time and when he started used mainframes without any databases.  He had migrated some of the applications to databases and they wrote them this way because it “made sense”.   After that the interview was over, and we hired someone else.

I challenge anyone who is learning databases to please look up what people tell you to do on the internet.  This is useful for two reasons, the first being that it will probably help you learn the concept better to read about it another way.   The other reason is that you can find out if the person teaching you really knows what they are doing, so you will learn the correct way to do something.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

What to use to Dynamically Write a Date Dimension in Power BI: M or DAX?

Calendar-clip-artRecently I needed to create a date dimension for a Power BI model as there was not one in the source database. There are two different ways that I could do this, using DAX from the Modeling Tab within the Data View or using M via the Query Editor window.  As a general rule, when it is possible data manipulation should be done in M as it offers a greater level of compression.  In this case though I am using a function in DAX, which is not the same as creating a calculated column.

Create a Date Table in DAX

To create a date table in DAX,  in Power BI go to the Data View->Modeling Table.  Click on the  New Table button on the ribbon.  For those who are wondering how you would go about writing either one, here is the source code for the DAX version.

 

DimDate = ADDCOLUMNS( CALENDAR(DATE(2017,1,1), DATE(2020,12,31)) ,
"Date Key", FORMAT ( [Date], "YYYYMMDD" ), //NumericDate
"Year", YEAR([Date]),
"Qtr Number", "Qtr " & FORMAT( [Date], "Q"),
"Q Number", "Q " & FORMAT( [Date], "Q"),
"Month Name" , FORMAT ( [Date], "mmmm" ) ,
"Month Short Name" , FORMAT ( [Date], "mmm" ) ,
"Month Number", MONTH([Date]),
"Month Year", FORMAT ( [Date], "mmm " ) & YEAR([Date]),
"Day Name", FORMAT ( [Date], "dddd" ), //Name for Each day of the week
"Day Short Name", FORMAT ( [Date],  "ddd" ),
"Day Number" , WEEKDAY ( [Date] ) //Sunday is 1
)  

This code uses the DAX CALENDAR function to create a contiguous set of dates between January 1, 2017 and December 31, 2020 with a field is named “Date”.  The remaining fields need field names and comments were added for clarity

Create a Date Table in M

In the Power BI Query Editor, click on the New Source button and select Blank Query. Select the View tab and click on the Advanced Editor button.  The Advanced Editor is where the M query is stored.  Paste the following code in to create a new table called Query1, which of course you can rename.

let
Source = List.Dates( #date(2017,1,1), Number.From( #date(2020, 12,31) - Date.From( #date(2017,1,1) ))+1, #duration(1,0,0,0)),
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Changed Type" = Table.TransformColumnTypes(#"Converted to Table",{{"Column1", type date}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"Column1", "Date"}}),
#"AddYear" = Table.AddColumn(#"Renamed Columns", "Year", each Text.End(Text.From([#"Date"], "en-US"), 4), type text),
#"AddMonth" = Table.AddColumn(AddYear, "Month Number",   each Date.ToText([Date], "MM")),
#"AddQuarter" = Table.AddColumn(AddMonth, "Quarter Number", each "Qtr " & Number.ToText(Date.QuarterOfYear([Date]))),
#"AddMonthName" = Table.AddColumn(#"AddQuarter", "Month Name", each Date.ToText([Date], "MMMM")),
#"AddMonthShortName" = Table.AddColumn(#"AddMonthName", "Month Short Name", each Date.ToText([Date], "MMM")),
#"AddShortMonthYear" = Table.AddColumn(#"AddMonthShortName", "Short Month Year", each [Month Short Name] &" " &  [Year]),
#"AddDayOfWeek" = Table.AddColumn(#"AddShortMonthYear", "Day of Week", each Date.ToText([Date], "dddd")),
#"AddDay" = Table.AddColumn(#"AddDayOfWeek", "Day", each Date.ToText([Date], "dd")),
#"AddDateKey" = Table.AddColumn(AddDay, "DateKey", each ([Year]&[Month Number]&[Day]))
in
#"AddDateKey"

To Create A Date Table Use Either M or DAX

Now the question is which one should you use?  To be honest it doesn’t matter.  I couldn’t see any difference when I tested it.  To validate this answer I consulted twitter, which sparked some very interesting comments and analysis. Marco Russo b | t  is planning on writing a blog on the details of it, but Jason Thomas b | t gave me this summary. “[The] Dictionary expands as needed as values are inserted–designed to reduce cost of re-alloc and re-org of hash buckets (at cost of memory waste).” The full explanation needs a post of it’s own to be sure, which I am looking forward to reading when Marco writes it.   I’ll quote Kasper De Jonge b | t who summed it up best “I don’t believe it will matter much, the date table will be so small regardless”. Whichever way you chose to add a date table, DAX or M, which are needed for time intelligence, you now have the code to do either one.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Missing Custom Power BI Visuals

One of the great new features of the July release was the ability to now get all of the Power BI Custom Visuals from within Power BI.  I had a bookmark to get the visuals from the Office Store, but it always seemed kind of a kludgy solution.  Personally, I liked the visuals better when they were on the Power BI website prior to March of 2017.  The filters worked better and they also included a sample file.  Now I have a different and more technical reason to not like the visuals in the Office Store, some of the Power BI Custom Visuals are not there.

Some Power BI Visuals Are Not in the Office Store

PowerBIFishCustomVisualThis week I decided to do a demo using the Aquarium custom visual.  As readers of my blog know, I have used the custom visual before, but it has been a while and I have changed PCs since then.  No worries I can always go download the visual from the store, right? Wrong. The aquarium visual is not available on the new store. Neither is Image Viewer, if one is looking to add that into your latest Power BI report it is not available. What happened?

So Long and Thanks for All of the Fish

I found out from Adam Saxton b | t  that moving Power BI custom visuals was not the simple cut and paste process that I had always assumed that it was.  The people who write custom visuals had to re-write them.  What’s more unlike when the custom visuals were housed on the Power BI Website, custom visual creators also had to pay $25 to register or $99 for their company.  This means that some custom visuals may never appear in the store as the people who created them aren’t willing to pay money to give them away.

If you have the custom visuals, or as in my case you know someone who can give you a copy of a Power BI custom visual which was published prior to the move over to the Office Store, the visual will still work when you upload it to the service.  I have also been told that Microsoft is working on adding the aquarium visual to the Office Store so at some point it will again be available for download.  For those who have noticed that the Box and Whisker custom visual is not the same as the previous version, I doubt they will be able to download the old one. If they can find someone to give it to them, it will still work.

If I do find out when the Aquarium visual will be available from the store again, I will update this post. Until then, if there is a visual you want, I would try asking on Twitter, as that worked for me.

****UPDATE: On August 4, 2017 the Enlighten Aquarium is now available again! Here’s a link to the Office Store.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur