Fabric Features not found in Azure Synapse or Power BI – Direct Lake

Power BI Data Architecturee for Semantic Data Modeling

With the general availability release of Fabric in November 2023, I am dedicating several posts to the features that are only in Fabric and not anywhere else. The first feature is Direct Lake. Direct Lake was created to address problems with Power BI Direct Query. Anyone who has used Direct Query knows what I am talking about. If you have implemented Direct Query, I am guessing you have run into one or all of these problems, including managing the constant hits to the source database which increase with the more users you have, user complaints about slow visuals, or the need to put apply buttons on all of your visuals to help with speed. Direct Query is a great idea. Who wants to import a bunch of data into Power BI? Directly connecting to the database sounds like a better idea, until you learn that that the data goes from Power BI to the database then back for each user one at a time, which means that Power BI must send more queries the more people are accessing reports. Users want to be able to access data quickly, have it scale well, and have access to the latest data.

Direct Lake Architecture

Direct Lake was designed to address the performance issues encountered by people who used Direct Query.  With Direct Lake you will connect to the existing data source live, but unlike Direct Query, you will be connecting not to a database, but to the Data Lake in OneLake. The diagram below from Microsoft shows the architecture of Direct Query for Power BI, Import, and of course Direct Lake.

Power BI Data Architecturee for Semantic Data Modeling

Power BI Data Architecture for Semantic Data Modeling

Direct Lake Performance

Direct Lake uses the same underlying technology that Fabric does, an implementation of Delta Lake on top of parquet formatted file. Delta Lake, is an open-source framework which is part of Spark and it allows parquet files to act more like databases as it contains ACID transations and DML operations which allow you to use common SQL Statements to modify and access the data. Microsoft has taken the underlying technology that the impemented in data warehousing, Polaris, to improve the way querying will work to make accessing data in OneLake faster.

Implementing Direct Lake

You need to have a Premium SKU, of P or F to be able to implement Direct Lake. Notice Premium Per User and A SKUs, commonly used for embedding do not have the ability to use Direct Lake. You also need to access the Admin Portal to ensure that the XMLA Endpoint is set to Read Write and the Integration settings to Allow XMLA Endpoints and Analyze in Excel is set to on. You can then provision a Fabric premium workspace lakehouse as your data source in Power BI when you use it as the sematic model for a report you create in the Power BI Service. Now you make look at this and think, well this is all well and good, but I don’t update the data lake as fast as I do my database, how am I going to make this work?

Mirroring

Mirroring proivdes the abilty to create a read only copy of your database in OneLake, which you can then use as a source for your Power BI reports. This was announced November 15, and should be releasing soon.  The way this will work is you will make a connection to the data source within Fabric, and a read only copy of the database will be created in OneLake. There is no ETL required. Just make the connection and the data is available.  I look forward to writing more about this feature when it is released as it will make it possible to have real time data access in Power BI.

Current Limitations of Direct Lake

As this is a new feature, there are some elements of Direct Query that are not there yet. Before you decide to use it you might want to understand some of the current limitations. The first is there is no way to create a Direct Lake connection from within Power BI Desktop. You must create the reports that use it in the service. You cannot mix the lakehouse with existing models as it must be sourced from a single lakehouse or data warehouse. You cannot use views created in the lakehouse as tables in your Power BI semantic model. Direct Lake cannot be used in embedded. You also cannot use calculated columns or tables.  If you implement row level security (RLS), the model is switched to Direct Query.othI imagine many of these limitations will be addressed over time, and when they do, I will be writing an update to this post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Four Different ways to incorporate Azure Synapse and Power BI

More and more I am working with Azure Synapse analytics as the location of data for Power BI, and there are 4 different ways that you can incorporate Power BI.

Incorporating Power BI inside of Synapse

The first is to connect Power BI to Azure Synapse to explore and visualize data. You can examine your datasets that you have loaded in your datalake with Power BI to help with the analysis of the data either for a data science solution or to determine how you are going to transform the data. For more information on how to do this, check out my previous blog .

Traditional Data Warehousing Sourced in Power BI

Once you have curated the data within Azure Synapse and it is ready for reporting, Power BI can do that too. Of course if you have large data and a dedicated SQL pool in Azure Synapse where your data warehouse is located you can use Power BI to visualize that large data. Given the fact that this solution is for data greater than a terabyte, you may also want to create some aggregated views in Azure Synapse to aggregate the larger data set.

Data lakehouse connection to Power BI

More and more I see companies wondering if they need a database. Some really don’t because their data sets are not that large, and what they want is a data source for Power BI. There are two different was of accomplishing this. Configure Power BI to connect to a Data Lake Gen2 to use data curated in Azure Synapse Companies are using Azure Synapse to create data lakes with data designed for reporting. See how you can connect Power BI to an Azure Data Lake Gen2 account which contains this data.

Serverless Database data source in Power BI

For my clients, what I recommend is to create a data lakehouse by curating a finite data set for Power BI reporting with a series of files in a curated folder. I will then create a Serverless data base in Azure Synapse and use that to connect to Power BI. What is great about that is if they need to run some kind of an adhoc query they can do that as well, providing a lower cost solution than was previously available.

If you are interested in hearing me talk about this topic, you might be interested in the recorded talk I gave to the DFW Power BI User group.

Regards,

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Azure Synapse and Power BI

There are a number of different use cases for Azure Synapse, which no doubt you have already read about. Of course you can use Azure Synapse to build a scalable Data warehouse, use it to create machine learning solutions with Spark, but have you thought of why you might want to use it with Power BI? There are two ways which you can integrate Azure Synapse with Power BI: connecting to a Power BI Workspace from within Azure Synapse or using Azure synapse to provide the source of all of the data used in the Power BI data set. We will explore both methods in this blog.

Linking to Power BI from within Azure Synapse

From within Synapse you have the ability to access a Power BI workspace so that you can use Power BI from within Synapse.  Your Power BI tenant can be in a different data center than the Azure Synapse Workspace, but they both must be in the same Power BI Tenant.  You can use Power BI to look at any data you wish, as the data you use can be from any location. When this blog was written, it was only possible to connect to one Power BI workspace from within Azure Synapse. In order to run Power BI as shown here, first I needed to create a Linked Service from within Synapse. Select Power BI from the list of options, and then select the workspace from the list available for your tenant and create the linked service.  After the linked service is created, you can go to develop in Azure Synapse, and see the Power BI option listed on the left side of the screen.

Connecting Power BI to an Azure Data Lake Gen 2

 

 

 

 

 

 

 

 

 

 

 

 

 

As much of the functionality of Analysis services is being migrated to Power BI, many people are choosing to use Power BI instead of migrating to Analysis services, especially if you have Power BI Premium.   Another option you have is to gather the data that you are going to load into Power BI into an Azure data lake Gen 2 and create a “Data Lakehouse” as the container where you gather and clean up the data you are going to use in Power BI.  There are a few things that you will want to keep in mind if you want to configure Azure Synapse to do this.  The first is that your Power BI tenant and your Azure Data Lake Gen 2 must be in the same location.  Right now this is an issue as Azure Synapse is not available in all data centers.  To configure Power BI to use Azure, you need to go to the Admin section as shown above and select the subscription, resource group and storage account.  You will want to make sure that you have the proper permissions granted in Azure RBAC for user groups who need to access the data as they need storage blob owner permissions.  The Power BI services also need reader and data access roles granted as well.  You can use Azure Synapse to gather the data and then visualize it with Power BI.  I see a lot of use cases for this model as I suspect more people are going to be moving away from creating a database for Power BI and instead moving the datastore to Azure.

 

Regards,

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Upgrading a data model from Power BI to Tabular got a lot harder

If you were looking to upgrade a Power BI model to Analysis Services Tabular, hopefully you decided to do it before March 1, 2019, that was the day Microsoft decided to remove the tool from Azure. I know I needed to upgrade after that and I was really sorely disappointed. I also blew my schedule as the upgrade process went from a 20 minute process to something that took over 2 hours. I am writing this blog to save you some time as I learned a few tricks which I wish I would have known about before upgrading.

Preparing your Power BI model to Upgrade to Tabular

Unless you are upgrading to analysis services on SQL Server 2019, chances are you are going to have to review your DAX code and make some modifications as DAX on the other versions of SQL Server are not the same as Power BI. I was upgrading to AS on SQL Server 2016, there were some commands that I had to manual edit out of the JSON file. If you have any new DAX commands, take them out of your Power BI Model which means you will not have to manually edit the JSON file to remove them when the new commands are flagged as errors. Make sure your Power BI Model does not include commands such as SELECTEDVALUE, GENERATESERIES as well as all of the automatically generated date hierarchies. After your Power BI desktop file is clean, leave it running as you are going to need to have it running for the next step.

Connect to your Power BI Model in Analysis Services

Power BI is running a version of Analysis Services tabular and once you figure out what version is running; you can connect to it.  Open up the Task Manager and take a look at the Power BI Desktop in the Processes tab by clicking on the arrow at the right-hand side so that you see all of the processes running.  You see by the picture here, that there are 19 processes running with one Power BI file loaded. Right click on the instance of Microsoft SQL Server Analysis Services running within Power BI Desktop and select Go to Details. This command will bring you to the Details table in Task Manager and will highlight which version and PID to connect in Analysis Services on your computer. You will notice my PID is 3992.  Using this number I went to the command prompt and typed

Netstat -anop TCP | findstr 3392. 

This command returns the address I need to access Power BI model from Analysis Services, 127.0.0.1:51328

That’s it. Now I can look at my Power BI model in SQL Sever Analysis Services from within Management Studio.  It looks like this

TL;DR This model is not usable and you will have to do some tweaks to it to make it work.  The first step is to fix the connections, then edit some JSON. Right click on the connection and select properties. Click on the ellipse button next to the connection string, and you will get this error message. 

Once you click ok, you can edit the connection to something that will work.

All of my Power BI tables were connecting to views in SQL Server, which is a best practice as this allows you to select only the fields you need and provides you with the ability to sort your data as it is read in, which can improve the compression.  I changed the Provider to SQL Server, and fixed the connection.  Once the connections are changed, you will need to create an XMLA script and edit it. Right click on the Database with the GUID name and select Script->Script Database As->Create To-> File.  Give the file a name as you will be using it after connecting to your analysis Services instance. 

Editing the XMLA File

Open up the XMLA file you created in Analysis Sr.  The first thing you will need to fix is the compatibility level.  Power BI set the value to 1465. 1400 is for Azure AS, SQL Server 2017 and 2016 use 1200, so I edited the script to change the value 1200.  I also changed the database name to “SampleAS” from the GUID that it was assigned. I then executed the script, and got my first error

Executing the query …

The JSON DDL request failed with the following error: Unrecognized JSON property: variations. Check path ‘model.tables[0].columns[11].variations’, line 362, position 30..

Run complete

I typed Ctrl G and entered the line number to fix the first error which was on this line

                “formatString”: “dddd\\, mmmm d\\, yyyy”,

I deleted that line and ran it again. This time I got this error.

Executing the query …

The JSON DDL request failed with the following error: Unrecognized JSON property: variations. Check path ‘model.tables[0].columns[11].variations’, line 361, position 30..

Run complete

 

Line 361 starts with the word variations. The section is contained within square brackets [].

   “variations”: [

                  {

                    “name”: “Variation”,

                    “relationship”: “dc6b309e-0967-4eea-8e3f-7a308661dca5”,

                    “defaultHierarchy”: {

                      “table”: “LocalDateTable_c6cfa169-57d3-4da3-bcbc-21b677dde835”,

                      “hierarchy”: “Date Hierarchy”

                    },

                    “isDefault”: true

                  }

                ],

Date Hierarchies are not supported the same way in AS, so I need to edit the XMLA to remove this section. I will need to repeat this step for every auto-generated data hierarchy I have, which is why I recommend that you remove them as it will save you a lot of time. Another line which caused an error was this line.

                  “query”: “SELECT * FROM [Rates]”,

My view is not called Rates. It is called vw_PaidRates. The table in the Power BI model is called Rates. There is no table or view called rates so I needed to change Rates to the name of the view vw_PaidRates. I repeated this for every table in the XMLA file. I estimate that the editing process took over an hour to eliminate all of the code that would not work. After that I was able to create an project in visual studio using the Import From Server(Tabular) option.

Upgrade Process from Power BI to Tabular

The process was tedious, and it made me long for the 15 minute Azure AS conversion. The method I outlined here is not supported by Microsoft, so it is up to you to make it work.  I have been able to successfully convert a model, just takes a lot of time. Fixing the Power BI model up front helps, and it would also help if little of your text is formatted but there are some things, like the query fixes that you just have to do on your own. I understand that the conversion process is much easier if you are using SQL Server 2019 as that version of Analysis Services is mostly compatible with the version created by Power BI.  Unfortunately I don’t have any clients yet who are using 2019.  I hold out hope it will be easier to do next time, but I will allow for a lot more time to make the conversion happen.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Using Power BI without an Power BI Account or Active Directory Entries

Believe it or not there is a way to use Power BI without signing up or having an account.  With an embedded application, you can use one Power BI ID and have all users use this account to log in. Now before you judge this is concept as a horrible security idea, keep on reading as you may change your mind. In this embedded application all users to login using a unique id and password and restrict the data seen in Power BI data based upon that id. How do you accomplish this seemingly incongruent task and why would you ever do such a thing? Well it turns out there is a logical reason for wanting to implement a Power BI application this way.

Creating a way to Securely Access Power BI Data for Customers

There are many companies which would like to provide Power BI reports which would allow customers to interactively work with their data, but they don’t want to create Power BI accounts for customers as that can be a lot of work from an administrative standpoint.  For the same reason, these customers are not added to the corporate network which means they are not added Active Directory.  For example, if Desert Isle SQL contracts with Acme Corporation to create a custom conference display, Acme might want to show me a report showing when the components were purchased, when they were modified and when the order is in process and when the order is completed.  How do I show a Power BI report containing information? From an application design perspective data from all of the customers should be stored in the same place and Desert Isle SQL should only see their orders when logging in to Acme’s site.  Here is the workflow that I want to implement.

 

Passing Login information to Power BI

When creating an embedded application, connecting requires a connection string.  It is possible to pass additional information to the connection string buy modifying the gateway to use effective identity and then pass the role information you want to use.  There are two configuration steps you need to complete to make this work.  The Power BI gateway needs to be configured to use CustomData through the Map User name screen. Also SSAS needs to be configured to use Roles as the role will restrict the data that users can access. The CustomData can contain a comma delimited list of values, which can include the data I need to have to access the role.  In the DAC for the role, the CUSTOMDATA field as performs as if it was a table. The DAX in the role would provide permissions based on the value of that table DimTerritory[TerritoryName] = IF(CUSTOMDATA() = “username”.“territory” .  This will restrict the data that a customer can see based on the territory they have assigned.  The Id can then be used to implement Row level security in Power BI with either the embedded data model or with Analysis Services Tabular. By using this method, you have the ability to restrict the data for each user and use one Power BI account all at the same time.

 

Costs for Implementation Multi-User Power BI Systems

Unfortunately, this solution means that you are going to be purchasing the embedded version of Power BI as this functionality is not covered with a Pro License. Embedded applications require that you purchase an embedded license or have a premium account.  The pricing for embedded has changed quite a bit from 2017 when it was introduced.  Pricing is all about capacity, unless you use a Premium account.

Power BI can be implemented in a number of different ways, and this implementation is one that you may see more of in the future.  There are a lot of different things that you can implement Power BI and it is hard to keep up with all of the changes. If you are interested in learning more about some of Advanced Power BI topics, join me in person in Boston for a full day of Advanced Power BI Training on Friday, September 21.  I look forward to meeting you here or anywhere else we might meet up.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Many to Many and Composite Modeling Power BI

The July 2018 release of Power BI includes a lot of neat new features, like wallpaper and of course getting rid of the big formatting frame on Power BI Visuals, but Composite Modeling really stands out. Have you ever created a direct connection to a tabular model and then wanted to add that one Excel spreadsheet for another sales group or needed to add some economic web data? I have. If I was lucky I was able to get it added to the tabular model.  Otherwise I had to give up on the idea. Adding more tables to the model which is using direct query is now possible with Composite Model.

Turning on Composite Modeling

Since this feature was just released, it is still in preview.  In order to be able to use it, go to the Power BI File tab then Options and Settings->Options->Preview features, as shown in the picture.  You will have to restart Power BI after this option is changed. This option not only turns on Composite Modeling, it also enables Many to Many data modeling.

Many to Many Data Modeling

In Power BI if you have tried to create a join relationship between two columns and one of them has duplicate values, you will get an error.  If you need to join two tables and they have duplicate key values, more than likely you are familiar with the work around, which is to create another intermediate table which has a column with unique values to create the relationship.  If you hide the table you can almost pretend that it really works. When you have Composite Models enabled in the Preview Features, you can create a direct relationship between tables which do not have unique keys! The Many to many relationship will be created for you in Power BI.

Caveats with the Composite Modeling Feature

Like me after playing around with both Many to Many and Composite Modeling you are probably all ready to deploy these features into production. That is a problem. As Composite modeling is a preview feature, it cannot be loaded to the Power BI Service, for now.  It will work as a desktop prototype, but you cannot distribute it, yet.

There is another big factor regarding Many to Many which you may have noticed when trying to get this feature to work. It is not possible to incorporate Many to Many with SQL Server Analysis Services Tabular Models, yet. It is my understanding that Microsoft plans to have this available for Analysis Services data sources, but for now, it is not possible to model them.

If you are planning on releasing either of these two features it is now possible to create and to use them in a desktop prototype.  I am looking forward to the day, which I hope will be later this year, when these features are available to be distributed to the Power BI Service.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Gaining Better DAX Skills

If you work with Excel Pivot Tables, Power BI, or Analysis Services Tabular more than likely you are working to improve your DAX skills. I know I am. While I do a lot of work with DAX, I am always looking to get better at writing it and better at explaining how to write it for clients or class attendees.  As I find things which I think others may find helpful, I write about them here to help spread the knowledge, especially to those people who have background in SQL.

DAX History

Microsoft has been working on the DAX language for about 10 years and released it in as part of Excel 2010 Power Pivot Add-in, back when Microsoft was incorporating Power BI functionality in Excel. Microsoft wrote DAX so that it would be like Excel so that people who are familiar with Excel would have an easier time writing it. While DAX is easier than MDX, it is enough different from Excel that it can get quite complex. One of the founders of the DAX language is Jeffery Wang, who is still actively working on enhancing the language further.  He works not only with developers but also people in the technical community who work extensively with DAX to get their input. I also found out recently that he took time out to do a technical review of a book on the DAX language by Phil Seamark b | t.

Leveraging SQL Knowledge to Learn DAX

If like me, your skills like not in Excel but in SQL, reading a book which builds upon SQL knowledge to understand DAX is really helpful.  Beginning DAX with Power BI provides examples of how to write something in SQL then shows how the same functionality would be written in DAX, sort of like a SQL-to-DAX language library.  Showing how to get the answer from SQL and DAX is a great technique to provide a better understanding of how to write DAX. I worked a client recently who validated all of the DAX in the tabular model by writing SQL to ensure they were getting the correct answers. I think the validation process would have been improved with this book which takes DAX out of the black magic category and explains the way it works.

Performance Tuning DAX

Once you start writing a lot of DAX, you will at some point write a query which performs badly.  What do you do to avoid poorly performing DAX? Phil provides some “old-school debugging” techniques to understand what is being generated and where it is going wrong. He also shows how you can use DAX Studio and SSMS to help figure out how and why your DAX is working the way it is. You will definitely want to reference Chapter 8 to learn more about this important topic.

Other Methods for Learning DAX and Improving Power BI skills

If you are looking for a hands-on environment to better understand how DAX works and how to improve your Power BI skills I have two opportunities for you in the US. I will be giving Advanced Power BI Classes in Columbus, Ohio on July 27 and Boston, Massachusetts on September 20.  Of course we will be covering more advanced DAX development and tuning as well as applied visualization theory to improve the interactivity of your Power BI reports as well as hands on labs to implement the new drill-through features and enhancements in Power BI and other class topics.  I hope to meet you there or at some of the other conferences I will be attending this year.  Check out my Engagement page and hopefully find a place where we can meet up as I would love to meet you.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Creating Power BI Date Tables by Marking as Date Table

In a previous post, I wrote about the fact that Power BI did not have the capability to Mark a table as date, but that has changed! Now if you want a date table to use numeric keys to join to other tables you can and still get the time dimension functionality to work.  Now there is no need to use some kind of a work around to get a date table, you can just make it one.

Marking a Date Table in Power BI

Naturally there is more than one way to mark a table as a date table in Power BI.  The first way is to click on the ellipse (you know the three dots) next to the date table Name  in the Power BI Fields, which I highlighted in yellow. The menu contains an option to Mark as date table.  If you select say the locations table, Power BI is smart enough to know that this is not a date table and the option will not appear. The first time that you select the table it will show the window shown above and prompt you to select the column in the table which is an actual value.  Once you select the OK button, the menu item will have a check mark next to the Mark As Date Option and if you select it again you will see this window. Generally speaking there is no reason to Unmark a date table, but if you want to, you can.

The other place in Power BI where you can Mark a Date table is in the report view there is an option in the Modeling tab to Mark as Date Table. Note this option will be disabled unless you have clicked on a table first.  The icon is only available in the Modeling tab of the report view.  If you are in the grid view or the relationship view you won’t see the option to Mark as Date Table, it is only available in the Report Modeling tab.

Time Dimension Functions: The reason for marking a Date Table

For those of you who may be wondering: Why would I want to mark a table as a date anyway there is a very good reason.  Marking a table as a date allows you to use all of the cool features which Microsoft added to DAX to do date calculations for you.  While it is possible to not use any time based DAX functions, unless you are rather expert at DAX, like these guys, you probably will want to use the nice DAX functions Microsoft created for handy things like TOTALYTD or SAMEPERIODLASTYEAR.  Now just because you have a date table added does not mean that you can use any date in those calculations, you need to reference a table that you have marked as date, not a date field within the table.  That means that  CALCULATE(SUM(ResellerSales_USD[SalesAmount_USD]), SAMEPERIODLASTYEAR(DateTime[DateKey]))  will return the correct value and CALCULATE(SUM(ResellerSales_USD[SalesAmount_USD]), SAMEPERIODLASTYEAR(ResellserSales[DateKey]))  will not.  You need to reference the table marked as a date table every single time in for Time Dimensions.   For and example of a Power BI Desktop DAX calculation not working when you are not using a marked date table,  check out this post.

Power BI Update Requests

Power BI updates every month and you may miss some of the things that the team released last month if you do not check out this site every month.  I know that the Power BI teams takes ideas for upcoming releases from Https://ideas.powerbi.com, and there is one thing that I really wish would get some more votes so that it can be added soon, which is another idea, like Marking Date Tables, which started out in SSAS Tabular. Having recently worked on a project with a ton of measures. The next feature that I hope that Power BI inherits from SSAS tabular is the ability to create folders for measures.   If you would like that too, vote for it here.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 2

As advanced analytical techniques become more popular, companies are looking to hire people to find answers in their data. What kind of answers?  Predicting the future, determining what factors make customers leave, what kind of products can they get good customers to buy, what conditions are related, is this a valid transaction and similar questions. What answers can be provided have everything to do with the available data. The data I chose to analyze was the salary data provided by Brent Ozar, which is publicly available here.  I started looking at the data in a previous post where I did an initial review of the data and discussed the data analysis process.

Regression Analysis – What kind of relationships can I find in the Data?

Looking at how data is related is a very important step in data analysis.  Most often various items are analyzed using some linear regression algorithms to compare one or more variables together. For this kind of analysis, generally speaking all of the data needs to be represented numerically, which means that if data instead exists as categories of items, the data will need to be transformed. For example, on the Salary Data which I analyzed and published in Power BI, the job experience and salary roe compared using the ggplot library from R, and the two different values are included on respective axis. What I was hoping to find is that there was a strong relationship between these two values.  If you look at the graph, you can see this is not the case. Interestingly enough while the line shows an upward trend, you can see a drop in salaries for those with a lot of job experience.  Those people with the most experience, above 35 years are making less money than those with less experience.  The graph also shows those who are just starting in their careers are not necessarily making the very little money. What the data shows is there is no guarantee that the longer you work the more money you make.

Data Cleansing for Analysis

Because I am looking for data trends and not anomaly detection, I normalized the survey data.  I eliminated the 100 people who did not fill in the salary amount, and cut off the high and the low.  I used the box plots generated in Power BI to serve as a guideline for the ranges to exclude.  As I was also interested in determining the difference in the responses between male and female, so I did some data substitution on some of the values as I wanted to included more records.  In 2018, the only year that this question was asked, 87.6% of the respondents were male.  I made the decision to include all of the respondents where the number of respondents was less than .22% as male so that I would have more data to evaluate. I modified all of the data in Power BI using M code.  You can take a look at all of the modifications I made to the data here in the Power BI report I created, as I am making it available for the next 30 days.

Examining the Top 5%

Recently I have had some conversations with some colleges regarding salary, and that led me to want to review what people would like to make.  Most people would like to be making the most money possible in their profession, and are not interested in moving, which is why I chose not to do much with the geographic data.  I ran a number of different machine learning algorithms on the data trying to find a definitive set of results among those who reported making the most money.  The results of those experiments were inconclusive.  While I found some items which were common among the highest earners, the results were not statistically conclusive. There are a number of conceptions that people have regarding salary, and I chose to illustrate some of them to dispel some myths surrounding data. I also grouped the salaries into groups: 95% for above 153,565, 75 for above 67,789, and the rest for the average.  These numbers were based on the values in the box chart in the top left of the Power BI report.

Salary Conclusions – Myth vs Reality

I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers.  According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%?  Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute.  Those who commute 100% of the time do very well, as well as those who spend every day at a job site.  Males report working more hours and telecommuting less than females do as well.  If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.

After spending quite a bit of time analyzing and visualizing the data, I was unable to determine a specific set of skills which to provide a roadmap of exactly what one needs to do to be in the top 5% of the salary for a data professional.  What I can tell you is more than likely there is someone with your level of work experience and position who is doing really well, and there is no reason why by the time that the next survey comes out, you are not the person who is in the top 5 percent.  This may mean working harder at your job and perhaps changing employers as the analysis shows that is the best way to make more money.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 1

CRISP-DM Diagram

The standard method for analyzing data is the CRoss Industry Standard Process for Data Mining [CRISP-DM].    Rather than describe the method, this post will walk through the process to illustrate how to analyze data using it. The data that I selected for analysis is the Brent Ozar Salary Survey information.The data is available under open source license and contains two years of answers to salary data with a total of over 6,000 responses. Understanding what is in the data determines what kind of answers it can provide. What can the data reveal?  Prior to drawing any conclusions, one needs to examine the data to determine the level of completeness, correctness and whether or not you have enough data to make decisions based upon the data.

Data Understanding

The first step in the process is to analyze the data to evaluate what kind of knowledge you can gather from the data.  The primary perspective of the data is salary, and the survey describes the characteristics which people with a certain salary level have.  The survey used to gather the data contained a number of drop down boxes and those fields can be used as categorical variables as there are a fixed number of possible responses.  Other survey items allowed people to freely enter anything, which makes it harder to statistically analyze some of the data.

 

Where Do Data Professionals Make the Most Money?

In the survey for 2018, the people who made the most money were from Hong Kong with an average salary of $263,289.  Before you start planning on moving, you will might want to look at the data a little closer.  There were 2 people who responded from Hong Kong.  One of them said he was making over 1.4 million dollars, the highest amount reported in the survey.  Given the fact that we only have two responses from Hong Kong, we will be unable to draw a definitive conclusion with 2 records. To be able to answer that question, more analysis will need to be done on the location and salary information and you will probably want to add market basket criteria because a dollar say in Hong Kong doesn’t go as far as the average apartment rental is $3,237 a month as it does say in Uganda where the rent is around $187 a month.

Using Power BI to Provide Data Understanding

The data modeling step of the CRISP-DM process anticipates that you will want to modify the raw data.  There may be records containing null or erroneous values that you will need to eliminate the entire record or substitute entries for a particular value.  You can also use this analysis to determine what conclusions you will be able to derive from the data.  For example, if you wanted to analyze what criteria are required for Microsoft Access Developers to make over $100,000 a year, you could easily find out in Power BI that it is not possible to do that analysis as there is no data for that set of criteria. If you want to do a year over year analysis of people who are working as DBAs, which I show in the second tab, you will need to change some of the categories as they changed from 2017 to 2018.

Analyzing Data with R and Power BI

Many times when providing a final report to explain your analysis, you will need to provide some documentation to demonstrate your conclusions.  In addition to creating some visualizations in Power BI, I also created some in R to include visualizations and analysis with R.  While I can include any R library I wish in Power BI Desktop, there are only 364 currently added to the Power BI Service in Azure.  If there is an R visualization you would like to add, you can send an email request to rinpbifb@microsoft.com and ask for it to be added.

Power BI Salary Data

For more information on Analyzing Data with Power BI and R, I recorded a video for Microsoft’s Power BI team which is available here.  The video shows the cleaning process some information regarding the analysis of the process. The analysis of the Salary data itself will be included in another post.  If you would like to find out when the next post in this series is available, please subscribe to my website for all of the latest updates.

*** UPDATE: My next post on further analysis of this data is available here.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur