The standard method for analyzing data is the CRoss Industry Standard Process for Data Mining [CRISP-DM]. Rather than describe the method, this post will walk through the process to illustrate how to analyze data using it. The data that I selected for analysis is the Brent Ozar Salary Survey information.The data is available under open source license and contains two years of answers to salary data with a total of over 6,000 responses. Understanding what is in the data determines what kind of answers it can provide. What can the data reveal? Prior to drawing any conclusions, one needs to examine the data to determine the level of completeness, correctness and whether or not you have enough data to make decisions based upon the data.
The first step in the process is to analyze the data to evaluate what kind of knowledge you can gather from the data. The primary perspective of the data is salary, and the survey describes the characteristics which people with a certain salary level have. The survey used to gather the data contained a number of drop down boxes and those fields can be used as categorical variables as there are a fixed number of possible responses. Other survey items allowed people to freely enter anything, which makes it harder to statistically analyze some of the data.
Where Do Data Professionals Make the Most Money?
In the survey for 2018, the people who made the most money were from Hong Kong with an average salary of $263,289. Before you start planning on moving, you will might want to look at the data a little closer. There were 2 people who responded from Hong Kong. One of them said he was making over 1.4 million dollars, the highest amount reported in the survey. Given the fact that we only have two responses from Hong Kong, we will be unable to draw a definitive conclusion with 2 records. To be able to answer that question, more analysis will need to be done on the location and salary information and you will probably want to add market basket criteria because a dollar say in Hong Kong doesn’t go as far as the average apartment rental is $3,237 a month as it does say in Uganda where the rent is around $187 a month.
The data modeling step of the CRISP-DM process anticipates that you will want to modify the raw data. There may be records containing null or erroneous values that you will need to eliminate the entire record or substitute entries for a particular value. You can also use this analysis to determine what conclusions you will be able to derive from the data. For example, if you wanted to analyze what criteria are required for Microsoft Access Developers to make over $100,000 a year, you could easily find out in Power BI that it is not possible to do that analysis as there is no data for that set of criteria. If you want to do a year over year analysis of people who are working as DBAs, which I show in the second tab, you will need to change some of the categories as they changed from 2017 to 2018.
Analyzing Data with R and Power BI
Many times when providing a final report to explain your analysis, you will need to provide some documentation to demonstrate your conclusions. In addition to creating some visualizations in Power BI, I also created some in R to include visualizations and analysis with R. While I can include any R library I wish in Power BI Desktop, there are only 364 currently added to the Power BI Service in Azure. If there is an R visualization you would like to add, you can send an email request to firstname.lastname@example.org and ask for it to be added.
Power BI Salary Data
For more information on Analyzing Data with Power BI and R, I recorded a video for Microsoft’s Power BI team which is available here. For the next 30 days, for those of you who are interested, you can download the PBIX file used to analyze the data here. The data included in the PBIX also includes a list of the R libraries included in Power BI, R Libraries not supported by Power BI and the data and the accompanying M code used to clean the data. The cleaning process and the analysis of the Salary data itself will be included in another post. If you would like to find out when the next post in this series is available, please subscribe to my website for all of the latest updates.
*** UPDATE: My next post on further analysis of this data is available here.
Data aficionado et SQL Raconteur