As advanced analytical techniques become more popular, companies are looking to hire people to find answers in their data. What kind of answers? Predicting the future, determining what factors make customers leave, what kind of products can they get good customers to buy, what conditions are related, is this a valid transaction and similar questions. What answers can be provided have everything to do with the available data. The data I chose to analyze was the salary data provided by Brent Ozar, which is publicly available here. I started looking at the data in a previous post where I did an initial review of the data and discussed the data analysis process.
Regression Analysis – What kind of relationships can I find in the Data?
Looking at how data is related is a very important step in data analysis. Most often various items are analyzed using some linear regression algorithms to compare one or more variables together. For this kind of analysis, generally speaking all of the data needs to be represented numerically, which means that if data instead exists as categories of items, the data will need to be transformed. For example, on the Salary Data which I analyzed and published in Power BI, the job experience and salary roe compared using the ggplot library from R, and the two different values are included on respective axis. What I was hoping to find is that there was a strong relationship between these two values. If you look at the graph, you can see this is not the case. Interestingly enough while the line shows an upward trend, you can see a drop in salaries for those with a lot of job experience. Those people with the most experience, above 35 years are making less money than those with less experience. The graph also shows those who are just starting in their careers are not necessarily making the very little money. What the data shows is there is no guarantee that the longer you work the more money you make.
Data Cleansing for Analysis
Because I am looking for data trends and not anomaly detection, I normalized the survey data. I eliminated the 100 people who did not fill in the salary amount, and cut off the high and the low. I used the box plots generated in Power BI to serve as a guideline for the ranges to exclude. As I was also interested in determining the difference in the responses between male and female, so I did some data substitution on some of the values as I wanted to included more records. In 2018, the only year that this question was asked, 87.6% of the respondents were male. I made the decision to include all of the respondents where the number of respondents was less than .22% as male so that I would have more data to evaluate. I modified all of the data in Power BI using M code. You can take a look at all of the modifications I made to the data here in the Power BI report I created, as I am making it available for the next 30 days.
Examining the Top 5%
Recently I have had some conversations with some colleges regarding salary, and that led me to want to review what people would like to make. Most people would like to be making the most money possible in their profession, and are not interested in moving, which is why I chose not to do much with the geographic data. I ran a number of different machine learning algorithms on the data trying to find a definitive set of results among those who reported making the most money. The results of those experiments were inconclusive. While I found some items which were common among the highest earners, the results were not statistically conclusive. There are a number of conceptions that people have regarding salary, and I chose to illustrate some of them to dispel some myths surrounding data. I also grouped the salaries into groups: 95% for above 153,565, 75 for above 67,789, and the rest for the average. These numbers were based on the values in the box chart in the top left of the Power BI report.
Salary Conclusions – Myth vs Reality
I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers. According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%? Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute. Those who commute 100% of the time do very well, as well as those who spend every day at a job site. Males report working more hours and telecommuting less than females do as well. If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.
After spending quite a bit of time analyzing and visualizing the data, I was unable to determine a specific set of skills which to provide a roadmap of exactly what one needs to do to be in the top 5% of the salary for a data professional. What I can tell you is more than likely there is someone with your level of work experience and position who is doing really well, and there is no reason why by the time that the next survey comes out, you are not the person who is in the top 5 percent. This may mean working harder at your job and perhaps changing employers as the analysis shows that is the best way to make more money.
Data aficionado et SQL Raconteur