Articles for the Month of January 2018

Azure Machine Learning Workbench

Microsoft released Azure Machine Learning Workbench at the Ignite conference on September 25, 2017 as a public preview.  This tool is a new tool which they are adding to their Azure ecosystem, which includes the machine learning tool they introduced three years ago, Azure Machine Learning Studio. Microsoft has said they plan on keeping both products. When asked about the two products, they said that the earlier tool, Azure Machine Learning Studio, is targeted to developers who wanted to add machine learning to their current applications, as it is an easy to use tool that doesn’t require a person to be a trained data scientist.  Azure Machine Learning Workbench is targeted to data scientists who want to bring in other libraries, like TensorFlow for Python, and delve deep into the data.

Microsoft Moves into Machine Learning Management

Microsoft is looking for Azure Machine Learning Workbench for more than a tool to use for Machine Learning analysis. It is part of a system to manage and monitor the deployment of machine learning solutions with Azure Machine Learning Model Management. The management aspects are part of the application installation.  To install the Azure Machine Learning Workbench, the application download is available only by creating an account in Microsoft’s Azure environment, where a Machine Learning Model Management resource will be created as part of the install. Within this resource, you will be directed to create a virtual environment in Azure where you will be deploying and managing Machine Learning models.

This migration into management of machine learning components is part of a pattern first seen on the on-premises version of data science functionality.  First Microsoft helped companies manage the deployment of R code with SQL Server 2016 which includes the ability to move R code into SQL Server.  Providing this capability decreased the time it took to implement a data science solution by providing a means for the code can be deployed easily without the need for the R code to be re-written or included in another application. SQL Server 2017 expanded on this idea by allowing Python code to be deployed into SQL Server as well.  With the cloud service Model Management, Microsoft is hoping to centralize the implementation so that all Machine Learning services created can be managed in one place.

Hybrid Cloud, Desktop, and Python

While you must have an Azure account to use the Machine Learning Workbench, the application is designed to run on a locally on either a Mac or Windows computer.  There is a developer edition of the tool so that one can learn the tool and not incur a bill, which is the case with the previous product, Azure Machine Learning.  The download of Machine Learning Workbench must be accessed within an Azure account and is installed to your local computer.  When running the application from your computer, the application will prompt to log into your Azure account to load Azure Machine Learning Workbench.

The application is designed to use and create Python code.  Azure Machine Learning Workbench does not contain any accommodation to incorporate machine learning components written in R, just Python.  If you have created machine learning components using R, they can be incorporated into the Azure Machine Learning Model Management if you create webservices which encapsulate the R code. The R code does not interface into Workbench, but can be made to be a part of the managed projectes in Azure. While it is possible to create a webservice for R with the earlier product with Azure Machine Learning, there is no direct way to include R with Azure Machine Learning Workbench.  There are a number of sample templates to get started using Python templates including the ubiquitous Iris dataset, Linear regression and several others.   Once the project is created, you can use your favorite IDE, it creates python code which can be read anywhere.

Staying within Machine Learning Workbench application allows you access to arguably one of the neatest parts of the Machine Learning Workbench, the data parser. This tool which was originally code-named project Pendleton and designed to be an intuitive way to modify the contents of data even better than the previous leader in parsing data, Power BI’s Power Query.

You can select the option “Derive column by example” or “Split Column by Example” and then start typing in a new column.  For example, if you want to separate a column which contains the date and the time, if you right click on that date column and select “Split Column by Example” then type the date in the new column provided, the application will immediately determine that you want two columns and crate them. The date column and a time column be created for you after typing in one date.  After the sample columns have been created, you can approve the change or reject it if does not work how you want to.

Like Power Query, each change made to the data is included in the window called Steps on the right side of the application window. When you are done modifying the data, right click on the Data Preparations source icon, which in my example is called UFO Clean, to and the UI changes made to the data are used to create Python code to perform the changes. The generated Python code can be used to the source data programmatically.

The next step in the process is to write the python code needed to evaluate the data and create a model which would in my case determine where and when you are most likely to see the next UFO based on the dataset I have included in my project.  Unlike it’s counterpart Azure Machine Learning, you will need to know how to write the necessary code needed to create a machine learning analysis in Python for Azure Machine Learning Workbench. One could write the Python code to create a machine learning analysis in any Python editor.  If you chose to use Azure Machine Learning, the Python library scikit-learn is installed as part of the application.  Other libraries which you may want to use, such as the common library matplot, you will need to load within Azure Machine Learning Workbench.

Web Service: How Azure Machine Learning Workbench Solutions are Deployed

To deploy a package, you will need to export the completed model serialized Python object, with the Python Module, Pickle. This will create a file with the suffix of pkl, which is the file that you will be deploying. Azure Machine Learning Workbench expects that you will be deploying via Docker containers or creating an Azure cluster.  You will need to register the Docker container in the Machine Learning Container for it to be deployed.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 2

As advanced analytical techniques become more popular, companies are looking to hire people to find answers in their data. What kind of answers?  Predicting the future, determining what factors make customers leave, what kind of products can they get good customers to buy, what conditions are related, is this a valid transaction and similar questions. What answers can be provided have everything to do with the available data. The data I chose to analyze was the salary data provided by Brent Ozar, which is publicly available here.  I started looking at the data in a previous post where I did an initial review of the data and discussed the data analysis process.

Regression Analysis – What kind of relationships can I find in the Data?

Looking at how data is related is a very important step in data analysis.  Most often various items are analyzed using some linear regression algorithms to compare one or more variables together. For this kind of analysis, generally speaking all of the data needs to be represented numerically, which means that if data instead exists as categories of items, the data will need to be transformed. For example, on the Salary Data which I analyzed and published in Power BI, the job experience and salary roe compared using the ggplot library from R, and the two different values are included on respective axis. What I was hoping to find is that there was a strong relationship between these two values.  If you look at the graph, you can see this is not the case. Interestingly enough while the line shows an upward trend, you can see a drop in salaries for those with a lot of job experience.  Those people with the most experience, above 35 years are making less money than those with less experience.  The graph also shows those who are just starting in their careers are not necessarily making the very little money. What the data shows is there is no guarantee that the longer you work the more money you make.

Data Cleansing for Analysis

Because I am looking for data trends and not anomaly detection, I normalized the survey data.  I eliminated the 100 people who did not fill in the salary amount, and cut off the high and the low.  I used the box plots generated in Power BI to serve as a guideline for the ranges to exclude.  As I was also interested in determining the difference in the responses between male and female, so I did some data substitution on some of the values as I wanted to included more records.  In 2018, the only year that this question was asked, 87.6% of the respondents were male.  I made the decision to include all of the respondents where the number of respondents was less than .22% as male so that I would have more data to evaluate. I modified all of the data in Power BI using M code.  You can take a look at all of the modifications I made to the data here in the Power BI report I created, as I am making it available for the next 30 days.

Examining the Top 5%

Recently I have had some conversations with some colleges regarding salary, and that led me to want to review what people would like to make.  Most people would like to be making the most money possible in their profession, and are not interested in moving, which is why I chose not to do much with the geographic data.  I ran a number of different machine learning algorithms on the data trying to find a definitive set of results among those who reported making the most money.  The results of those experiments were inconclusive.  While I found some items which were common among the highest earners, the results were not statistically conclusive. There are a number of conceptions that people have regarding salary, and I chose to illustrate some of them to dispel some myths surrounding data. I also grouped the salaries into groups: 95% for above 153,565, 75 for above 67,789, and the rest for the average.  These numbers were based on the values in the box chart in the top left of the Power BI report.

Salary Conclusions – Myth vs Reality

I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers.  According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%?  Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute.  Those who commute 100% of the time do very well, as well as those who spend every day at a job site.  Males report working more hours and telecommuting less than females do as well.  If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.

After spending quite a bit of time analyzing and visualizing the data, I was unable to determine a specific set of skills which to provide a roadmap of exactly what one needs to do to be in the top 5% of the salary for a data professional.  What I can tell you is more than likely there is someone with your level of work experience and position who is doing really well, and there is no reason why by the time that the next survey comes out, you are not the person who is in the top 5 percent.  This may mean working harder at your job and perhaps changing employers as the analysis shows that is the best way to make more money.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 1

CRISP-DM Diagram

The standard method for analyzing data is the CRoss Industry Standard Process for Data Mining [CRISP-DM].    Rather than describe the method, this post will walk through the process to illustrate how to analyze data using it. The data that I selected for analysis is the Brent Ozar Salary Survey information.The data is available under open source license and contains two years of answers to salary data with a total of over 6,000 responses. Understanding what is in the data determines what kind of answers it can provide. What can the data reveal?  Prior to drawing any conclusions, one needs to examine the data to determine the level of completeness, correctness and whether or not you have enough data to make decisions based upon the data.

Data Understanding

The first step in the process is to analyze the data to evaluate what kind of knowledge you can gather from the data.  The primary perspective of the data is salary, and the survey describes the characteristics which people with a certain salary level have.  The survey used to gather the data contained a number of drop down boxes and those fields can be used as categorical variables as there are a fixed number of possible responses.  Other survey items allowed people to freely enter anything, which makes it harder to statistically analyze some of the data.

 

Where Do Data Professionals Make the Most Money?

In the survey for 2018, the people who made the most money were from Hong Kong with an average salary of $263,289.  Before you start planning on moving, you will might want to look at the data a little closer.  There were 2 people who responded from Hong Kong.  One of them said he was making over 1.4 million dollars, the highest amount reported in the survey.  Given the fact that we only have two responses from Hong Kong, we will be unable to draw a definitive conclusion with 2 records. To be able to answer that question, more analysis will need to be done on the location and salary information and you will probably want to add market basket criteria because a dollar say in Hong Kong doesn’t go as far as the average apartment rental is $3,237 a month as it does say in Uganda where the rent is around $187 a month.

Using Power BI to Provide Data Understanding

The data modeling step of the CRISP-DM process anticipates that you will want to modify the raw data.  There may be records containing null or erroneous values that you will need to eliminate the entire record or substitute entries for a particular value.  You can also use this analysis to determine what conclusions you will be able to derive from the data.  For example, if you wanted to analyze what criteria are required for Microsoft Access Developers to make over $100,000 a year, you could easily find out in Power BI that it is not possible to do that analysis as there is no data for that set of criteria. If you want to do a year over year analysis of people who are working as DBAs, which I show in the second tab, you will need to change some of the categories as they changed from 2017 to 2018.

Analyzing Data with R and Power BI

Many times when providing a final report to explain your analysis, you will need to provide some documentation to demonstrate your conclusions.  In addition to creating some visualizations in Power BI, I also created some in R to include visualizations and analysis with R.  While I can include any R library I wish in Power BI Desktop, there are only 364 currently added to the Power BI Service in Azure.  If there is an R visualization you would like to add, you can send an email request to rinpbifb@microsoft.com and ask for it to be added.

Power BI Salary Data

For more information on Analyzing Data with Power BI and R, I recorded a video for Microsoft’s Power BI team which is available here.  For the next 30 days, for those of you who are interested, you can download the PBIX file used to analyze the data here. The data included in the PBIX also includes a list of the R libraries included in Power BI, R Libraries not supported by Power BI and the data and the accompanying M code used to clean the data.  The cleaning process and the analysis of the Salary data itself will be included in another post.  If you would like to find out when the next post in this series is available, please subscribe to my website for all of the latest updates.

*** UPDATE: My next post on further analysis of this data is available here.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur