Analysis Services Tabular Best Practices – Part 1

Recently I have been working on Analysis Services Tabular as a data source to Power BI and wanted to share some of the best practices for developing a model. Following these guidelines will improve the performance and scale of your project. There is a lot of material to cover, so I have broken up the information into this series.

Analysis Services Internal Dictionary Encoding

Understanding the internal components drives a number of design elements. The rules for Tabular are different because the underlying technology is different.  Analysis Service is a columnar based database, which means that it is most efficient on summarizing columns of data.  This methodology is incorporated in the Vertipaq Engine which is compresses the data in memory. The data is stored within a dictionary and is modified to store information optimally. Internally all of the data stored is translated into an integer value.  This means that two date fields when joined in a model will be as efficient as two integer values, as all values are internally represented as numbers.  Dictionary entries are created for each column.  The most efficient compression would be if the column contains one value as there would only need to have one entry.  If there were 4 different numbers in the column and the values were sorted on that column, there would be 4 dictionary entries.  If instead the numbers were ordered sequentially, such as 1,2,3,4,1,2,3,4, compression would not be possible and there would be an entry for each row.  When possible, data should be sorted so that like numbers are together.   Whenever possible, do not load unique values, as this data cannot be compressed.

Create tables with Underlying Queries with Views

Microsoft recommends that when getting data from a relational database, the data retrieved should be encapsulated with views.  This creates a record in the source data of the information used in by Power BI in the source database which can help ensure that the data structures are not modified in a way that will adversely impact the model.  The views can be materialized which can prevent the structure from changing without changing the view.  This provides visibility to the database administrator of the structure needed for Analysis Services.  Using views also can insulate Analysis Services from being dependent on the existing data model as the view does not have to match existing data types.  All views need to include an explicit list of columns for a deliberate inclusion of requested columns and should be sorted.

Extra columns should be eliminated from the view and the model as they will increase the size for no reason. The views should only contain columns which are actually used in either the DAX, reporting fields or to join tables together.  Do not bring in every field in the source table as it is not a best practice. It is more efficient to only bring in the columns needed rather than hide columns not used.  If the columns are not needed in DAX or for joining tables, do not include them in the model.  The views need to be written to run as efficient as possible on the source system, so be sure to tune the views as well.

Data Type Selection

The data type selected will impact the physical storage used, not the compression of the models in memory.  It is important whenever possible to reduce the cardinality of the data in order to be able to sort the data effectively.  When storing decimal numbers, unless you need many significant digits, store the data as Currency as it will take less space in physical storage than decimal.

Following these rules will help your model be better interpreted for dictionary encoding and will improve the compression of your data meaning it will load faster.  In the next part of this series I will discuss other things you can do to improve usability and structure of the data model. Let me know if you find this helpful by adding a comment or subscribing to this blog.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Using Power BI without an Power BI Account or Active Directory Entries

Believe it or not there is a way to use Power BI without signing up or having an account.  With an embedded application, you can use one Power BI ID and have all users use this account to log in. Now before you judge this is concept as a horrible security idea, keep on reading as you may change your mind. In this embedded application all users to login using a unique id and password and restrict the data seen in Power BI data based upon that id. How do you accomplish this seemingly incongruent task and why would you ever do such a thing? Well it turns out there is a logical reason for wanting to implement a Power BI application this way.

Creating a way to Securely Access Power BI Data for Customers

There are many companies which would like to provide Power BI reports which would allow customers to interactively work with their data, but they don’t want to create Power BI accounts for customers as that can be a lot of work from an administrative standpoint.  For the same reason, these customers are not added to the corporate network which means they are not added Active Directory.  For example, if Desert Isle SQL contracts with Acme Corporation to create a custom conference display, Acme might want to show me a report showing when the components were purchased, when they were modified and when the order is in process and when the order is completed.  How do I show a Power BI report containing information? From an application design perspective data from all of the customers should be stored in the same place and Desert Isle SQL should only see their orders when logging in to Acme’s site.  Here is the workflow that I want to implement.

 

Passing Login information to Power BI

When creating an embedded application, connecting requires a connection string.  It is possible to pass additional information to the connection string buy modifying the gateway to use effective identity and then pass the role information you want to use.  There are two configuration steps you need to complete to make this work.  The Power BI gateway needs to be configured to use CustomData through the Map User name screen. Also SSAS needs to be configured to use Roles as the role will restrict the data that users can access. The CustomData can contain a comma delimited list of values, which can include the data I need to have to access the role.  In the DAC for the role, the CUSTOMDATA field as performs as if it was a table. The DAX in the role would provide permissions based on the value of that table DimTerritory[TerritoryName] = IF(CUSTOMDATA() = “username”.“territory” .  This will restrict the data that a customer can see based on the territory they have assigned.  The Id can then be used to implement Row level security in Power BI with either the embedded data model or with Analysis Services Tabular. By using this method, you have the ability to restrict the data for each user and use one Power BI account all at the same time.

 

Costs for Implementation Multi-User Power BI Systems

Unfortunately, this solution means that you are going to be purchasing the embedded version of Power BI as this functionality is not covered with a Pro License. Embedded applications require that you purchase an embedded license or have a premium account.  The pricing for embedded has changed quite a bit from 2017 when it was introduced.  Pricing is all about capacity, unless you use a Premium account.

Power BI can be implemented in a number of different ways, and this implementation is one that you may see more of in the future.  There are a lot of different things that you can implement Power BI and it is hard to keep up with all of the changes. If you are interested in learning more about some of Advanced Power BI topics, join me in person in Boston for a full day of Advanced Power BI Training on Friday, September 21.  I look forward to meeting you here or anywhere else we might meet up.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Many to Many and Composite Modeling Power BI

The July 2018 release of Power BI includes a lot of neat new features, like wallpaper and of course getting rid of the big formatting frame on Power BI Visuals, but Composite Modeling really stands out. Have you ever created a direct connection to a tabular model and then wanted to add that one Excel spreadsheet for another sales group or needed to add some economic web data? I have. If I was lucky I was able to get it added to the tabular model.  Otherwise I had to give up on the idea. Adding more tables to the model which is using direct query is now possible with Composite Model.

Turning on Composite Modeling

Since this feature was just released, it is still in preview.  In order to be able to use it, go to the Power BI File tab then Options and Settings->Options->Preview features, as shown in the picture.  You will have to restart Power BI after this option is changed. This option not only turns on Composite Modeling, it also enables Many to Many data modeling.

Many to Many Data Modeling

In Power BI if you have tried to create a join relationship between two columns and one of them has duplicate values, you will get an error.  If you need to join two tables and they have duplicate key values, more than likely you are familiar with the work around, which is to create another intermediate table which has a column with unique values to create the relationship.  If you hide the table you can almost pretend that it really works. When you have Composite Models enabled in the Preview Features, you can create a direct relationship between tables which do not have unique keys! The Many to many relationship will be created for you in Power BI.

Caveats with the Composite Modeling Feature

Like me after playing around with both Many to Many and Composite Modeling you are probably all ready to deploy these features into production. That is a problem. As Composite modeling is a preview feature, it cannot be loaded to the Power BI Service, for now.  It will work as a desktop prototype, but you cannot distribute it, yet.

There is another big factor regarding Many to Many which you may have noticed when trying to get this feature to work. It is not possible to incorporate Many to Many with SQL Server Analysis Services Tabular Models, yet. It is my understanding that Microsoft plans to have this available for Analysis Services data sources, but for now, it is not possible to model them.

If you are planning on releasing either of these two features it is now possible to create and to use them in a desktop prototype.  I am looking forward to the day, which I hope will be later this year, when these features are available to be distributed to the Power BI Service.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Gaining Better DAX Skills

If you work with Excel Pivot Tables, Power BI, or Analysis Services Tabular more than likely you are working to improve your DAX skills. I know I am. While I do a lot of work with DAX, I am always looking to get better at writing it and better at explaining how to write it for clients or class attendees.  As I find things which I think others may find helpful, I write about them here to help spread the knowledge, especially to those people who have background in SQL.

DAX History

Microsoft has been working on the DAX language for about 10 years and released it in as part of Excel 2010 Power Pivot Add-in, back when Microsoft was incorporating Power BI functionality in Excel. Microsoft wrote DAX so that it would be like Excel so that people who are familiar with Excel would have an easier time writing it. While DAX is easier than MDX, it is enough different from Excel that it can get quite complex. One of the founders of the DAX language is Jeffery Wang, who is still actively working on enhancing the language further.  He works not only with developers but also people in the technical community who work extensively with DAX to get their input. I also found out recently that he took time out to do a technical review of a book on the DAX language by Phil Seamark b | t.

Leveraging SQL Knowledge to Learn DAX

If like me, your skills like not in Excel but in SQL, reading a book which builds upon SQL knowledge to understand DAX is really helpful.  Beginning DAX with Power BI provides examples of how to write something in SQL then shows how the same functionality would be written in DAX, sort of like a SQL-to-DAX language library.  Showing how to get the answer from SQL and DAX is a great technique to provide a better understanding of how to write DAX. I worked a client recently who validated all of the DAX in the tabular model by writing SQL to ensure they were getting the correct answers. I think the validation process would have been improved with this book which takes DAX out of the black magic category and explains the way it works.

Performance Tuning DAX

Once you start writing a lot of DAX, you will at some point write a query which performs badly.  What do you do to avoid poorly performing DAX? Phil provides some “old-school debugging” techniques to understand what is being generated and where it is going wrong. He also shows how you can use DAX Studio and SSMS to help figure out how and why your DAX is working the way it is. You will definitely want to reference Chapter 8 to learn more about this important topic.

Other Methods for Learning DAX and Improving Power BI skills

If you are looking for a hands-on environment to better understand how DAX works and how to improve your Power BI skills I have two opportunities for you in the US. I will be giving Advanced Power BI Classes in Columbus, Ohio on July 27 and Boston, Massachusetts on September 20.  Of course we will be covering more advanced DAX development and tuning as well as applied visualization theory to improve the interactivity of your Power BI reports as well as hands on labs to implement the new drill-through features and enhancements in Power BI and other class topics.  I hope to meet you there or at some of the other conferences I will be attending this year.  Check out my Engagement page and hopefully find a place where we can meet up as I would love to meet you.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Data Science with Python

KD Nuggets Data Science/ Machine Learning PollFor those of you who might have missed it, the website KDnuggets released their latest internet survey on data science tools, and Python came out ahead, again. Python has continued to gain as a tool that people are using for Data Science.  The article accompanying the graphic is very interesting as it brings up two data related points. The first is the survey only had “over 2300 votes” and “…one vendor – RapidMiner – had a very active campaign to vote in KDnuggets poll”.  This points the fallacy in completely relying on data with an insufficiently sized data set, as it is possible to skew the results, which is true both for surveys and data science projects.  If you look at the remaining results one thing also strikes me as interesting. Anaconda and sci-kit learn are Python libraries.  Tensorflow could be used for either R or Python.  This does tend to increase the argument for more use of R or Python over RapidMiner.  The survey also made me want to check out RapidMiner.

Thoughts around Rapid Miner for Machine Learning

While I have not had enough time to fully analyze Rapid Miner, I thought I would give my initial analysis here and do a more detailed review of it in another post.  Rapid Miner scored well in the Kaggle Survey, but also it ranked highly on the 2018 Garner Magic Quadrant for Data Science Platforms.  Rapid Miner is trying to be a tool not only for data scientists, but also for business analysts as well.  The UI is pretty intuitive, which is good because the help is not what it should be. I also was less than impressed at its data visualization capabilities, as R and Python both provide much better visuals. Of course, I used the free version of the software, which works but it is limiting.  It looks like a lot of the new stuff is going to be only available on the paid version, which decreases my desire to really learn this tool.

Machine Learning Tools

Recently I have done a number of talks on Python in SQL Server, literally all around the world, including Brisbane, Australia tomorrow and Saturday, June 2 as well as in Christchurch New Zealand. As R was written in New Zealand, I thought that it would be the last place where people would be looking to use Python with Data Science, but several of the attendees of my precon on Machine Learning for SQL Server told me that where they worked, Python was being used to solve data science problems. Now of course this is anecdotal sample, as we are not talking about a statistically significant sample set, but that doesn’t keep it from being interesting.   The demand for Python training continues to increase as Microsoft has announced they are working on incorporating Machine Learning Service blog series with SQL Server Central.  The first two post have been released. Let me know what you think of them.

Upcoming Events

I am looking forward to talking about Machine Learning with SQL Server in Brisbane both at an intense day long session and at a one hour session on Implementing Python in SQL Server 2017 at SQL Saturday #713 – Brisbane, Australia. I look forward to seeing you there. For those who can’t make it, well, hopefully our paths will cross at a future event.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Creating Power BI Date Tables by Marking as Date Table

In a previous post, I wrote about the fact that Power BI did not have the capability to Mark a table as date, but that has changed! Now if you want a date table to use numeric keys to join to other tables you can and still get the time dimension functionality to work.  Now there is no need to use some kind of a work around to get a date table, you can just make it one.

Marking a Date Table in Power BI

Naturally there is more than one way to mark a table as a date table in Power BI.  The first way is to click on the ellipse (you know the three dots) next to the date table Name  in the Power BI Fields, which I highlighted in yellow. The menu contains an option to Mark as date table.  If you select say the locations table, Power BI is smart enough to know that this is not a date table and the option will not appear. The first time that you select the table it will show the window shown above and prompt you to select the column in the table which is an actual value.  Once you select the OK button, the menu item will have a check mark next to the Mark As Date Option and if you select it again you will see this window. Generally speaking there is no reason to Unmark a date table, but if you want to, you can.

The other place in Power BI where you can Mark a Date table is in the report view there is an option in the Modeling tab to Mark as Date Table. Note this option will be disabled unless you have clicked on a table first.  The icon is only available in the Modeling tab of the report view.  If you are in the grid view or the relationship view you won’t see the option to Mark as Date Table, it is only available in the Report Modeling tab.

Time Dimension Functions: The reason for marking a Date Table

For those of you who may be wondering: Why would I want to mark a table as a date anyway there is a very good reason.  Marking a table as a date allows you to use all of the cool features which Microsoft added to DAX to do date calculations for you.  While it is possible to not use any time based DAX functions, unless you are rather expert at DAX, like these guys, you probably will want to use the nice DAX functions Microsoft created for handy things like TOTALYTD or SAMEPERIODLASTYEAR.  Now just because you have a date table added does not mean that you can use any date in those calculations, you need to reference a table that you have marked as date, not a date field within the table.  That means that  CALCULATE(SUM(ResellerSales_USD[SalesAmount_USD]), SAMEPERIODLASTYEAR(DateTime[DateKey]))  will return the correct value and CALCULATE(SUM(ResellerSales_USD[SalesAmount_USD]), SAMEPERIODLASTYEAR(ResellserSales[DateKey]))  will not.  You need to reference the table marked as a date table every single time in for Time Dimensions.   For and example of a Power BI Desktop DAX calculation not working when you are not using a marked date table,  check out this post.

Power BI Update Requests

Power BI updates every month and you may miss some of the things that the team released last month if you do not check out this site every month.  I know that the Power BI teams takes ideas for upcoming releases from Https://ideas.powerbi.com, and there is one thing that I really wish would get some more votes so that it can be added soon, which is another idea, like Marking Date Tables, which started out in SSAS Tabular. Having recently worked on a project with a ton of measures. The next feature that I hope that Power BI inherits from SSAS tabular is the ability to create folders for measures.   If you would like that too, vote for it here.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Limitations in Time Series Data Analysis and the growth of Advanced Data Analytics

As someone who regularly analyzes data, I have done my share of time series analysis to determine trends over time.  I am struck by the fallibility of this sort of analysis.  For those who are unfamiliar with this time of analysis, time series analysis is performed to try to identify patters in the noise of data to help predict future trends through the use of algorithms like ARIMA. As I kid I remember hearing in fast announcer voice the following text “Past performance is no indication of future results”.  As a matter of fact this is a rule that the SEC requires mutual funds to tell all of their investors this statement.  Yet I get asked to do it anyway.  While I enjoy working with data and using advance analysis techniques including R and Python, I think it is important to realize the limitations of this sort of analysis.  It is considered a good experiment in Machine Learning if you are 85% right.  This is not acceptable if you are talking about a self-driving car as running people over 15% of the time is generally not considered acceptable. There are times when looking at the future that the data is not always going to provide an answer.  When looking to find answers in data, that needs to be something people keep in mind.  While you can find some answers in data, other answers will require prognostication or plan old guessing.

Impact on Technology realized in 2026

http://michaeljswart.com/2016/06/t-sql-tuesday-079-its-2016/comment-page-1/#comment-186750Data analysis is all about pattern matching, and while I don’t find it to be infallible, looking at a wideset of data has led me to plan accordingly.  While I am no Faith Popcorn, my analysis of what I see in the marketplace has led me to make some changes in my own life as I believe change is coming to the industry.  Adam Mechanic’s prompting for looking ahead to 2016 has provided the impetus to publish these theories.  What I see in the marketplace is the tools which are used to support databases are improving.  I see the ability of software to provide relevant hints and automate tuning of database queries and performance to continually improve, meaning there will be less of a need to employ people to perform this task.  I see with databases being pushed more and more to the cloud and managed services less and less need to employ many people to perform dba roles. Where I see the industry moving is towards more people being employed in analyzing the data to determine meaning from it.  I see that in 2026 very little data analysis being performed with R and most analysis being performed in Python.  This means that if you are looking ahead, and are employed in areas where people are being supplemented with tools, the time is now to learn skills in areas where there is growth. If you have been thinking about learning data science, Python and advanced analytics tools now is the time to start so that you will be prepared for the future.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Preparing for Exam 70-774 – Perform Cloud Data Science with Azure Machine Learning

There are a number of reasons why you might want to take a Microsoft cert exam. Maybe you want to focus your studies on a tangible thing, or you think it will help further your career, or you work for a Microsoft Partner and they required a certain number of people to pass the exam to maintain their current partner status.  I am not going to get into the long argument regarding whether or not a cert will help you in your career, or not, I can tell you why you might want to take the 70-774 exam. Machine Learning, or Data Science if you prefer, is an important analytic skill to have to analyze data.  I believe that it will only become more useful overtime. Azure Machine Learning is a good tool for learning the analysis process.  Once you have the concepts down, then should you need to use other tools to perform analysis it is just a matter of learning a new tool.  I talk to a number of people who are trying to learn new things, and the study them in their spare time.  It’s very easy to spend time vaguely studying something, but you may find that having a target set of items to study will focus your time, and as a bonus you get a neat badge and some measure of proof that you were spending time on the computer learning new things and not just watching cat videos.

Exam 70-774 Preparation Tips


While you could always buy the book for the exam (shameless plug as I was one of the authors), the book will not be enough and you will still need to write some code, and do some additional studying. This exam one of two needed for the MCSA in Data Science and you an take the exams in any order. The best place to start is by first looking at the 70-774 exam reference page from Microsoft.  There are four different sections in the exam, and I have created some links for each section which will help you prepare for the exam. In studying for exams in the past, the best way I have found to prepare is to look at everything on the outline and make sure that I know it.

Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning

Normalizing Data
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data

TanH
https://reference.wolfram.com/language/ref/Tanh.html

ZScore
http://stattrek.com/statistics/dictionary.aspx?definition=z-score
http://howto.commetrics.com/methodology/statistics/normalization/

Min Max
https://www.quora.com/What-is-the-meaning-of-min-max-normalization

PCA
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://stackoverflow.com/questions/9590114/importance-of-pca-or-svd-in-machine-learning

SVD
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Canonical-correlation analysis (CCA)
https://en.wikipedia.org/wiki/Canonical_correlation

Singular Value Decomposition (SVD)
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Develop Machine Learning Models

Team Data Science
https://docs.microsoft.com/fi-fi/azure/machine-learning/team-data-science-process/python-data-access

K-Means
https://www.datascience.com/blog/k-means-clustering

Confusion Matrix
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
https://en.wikipedia.org/wiki/Confusion_matrix
https://en.wikipedia.org/wiki/F1_score

Ordinal Regression
https://en.wikipedia.org/wiki/Ordinal_regression

Poisson regression
https://en.wikipedia.org/wiki/Poisson_regression

Mean Absolute Error and Root Mean Squared Error
http://www.eumetrain.org/data/4/451/english/msg/ver_cont_var/uos3/uos3_ko1.htm

Cross Validation
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f

Operationalize and Manage Azure Machine Learning Services

Connect to a published Machine Learning web service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/publish-a-machine-learning-web-service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/consume-web-service-with-web-app-template
https://docs.microsoft.com/en-us/azure/machine-learning/studio/manage-new-webservice

Use Other Services for Machine Learning

Microsoft Cognitive Toolkit
https://www.microsoft.com/en-us/cognitive-toolkit/

BrainScript
https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-basic-concepts

Streamline development by using existing resources
https://docs.microsoft.com/en-us/azure/machine-learning/studio/gallery-how-to-use-contribute-publish
Perform database analytics by using SQL Server R Services on Azure
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/provision-vm
https://docs.microsoft.com/en-us/machine-learning-server/install/r-server-vm-data-science
https://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
http://blog.revolutionanalytics.com/2017/07/xgboost-support-added-to-rattle.html
https://github.com/JohnLangford/vowpal_wabbit/wiki

I hope you have found this test preparation material helpful.  If you passed the exam, let me know by sending me a comment.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Azure Machine Learning Workbench

Microsoft released Azure Machine Learning Workbench at the Ignite conference on September 25, 2017 as a public preview.  This tool is a new tool which they are adding to their Azure ecosystem, which includes the machine learning tool they introduced three years ago, Azure Machine Learning Studio. Microsoft has said they plan on keeping both products. When asked about the two products, they said that the earlier tool, Azure Machine Learning Studio, is targeted to developers who wanted to add machine learning to their current applications, as it is an easy to use tool that doesn’t require a person to be a trained data scientist.  Azure Machine Learning Workbench is targeted to data scientists who want to bring in other libraries, like TensorFlow for Python, and delve deep into the data.

Microsoft Moves into Machine Learning Management

Microsoft is looking for Azure Machine Learning Workbench for more than a tool to use for Machine Learning analysis. It is part of a system to manage and monitor the deployment of machine learning solutions with Azure Machine Learning Model Management. The management aspects are part of the application installation.  To install the Azure Machine Learning Workbench, the application download is available only by creating an account in Microsoft’s Azure environment, where a Machine Learning Model Management resource will be created as part of the install. Within this resource, you will be directed to create a virtual environment in Azure where you will be deploying and managing Machine Learning models.

This migration into management of machine learning components is part of a pattern first seen on the on-premises version of data science functionality.  First Microsoft helped companies manage the deployment of R code with SQL Server 2016 which includes the ability to move R code into SQL Server.  Providing this capability decreased the time it took to implement a data science solution by providing a means for the code can be deployed easily without the need for the R code to be re-written or included in another application. SQL Server 2017 expanded on this idea by allowing Python code to be deployed into SQL Server as well.  With the cloud service Model Management, Microsoft is hoping to centralize the implementation so that all Machine Learning services created can be managed in one place.

Hybrid Cloud, Desktop, and Python

While you must have an Azure account to use the Machine Learning Workbench, the application is designed to run on a locally on either a Mac or Windows computer.  There is a developer edition of the tool so that one can learn the tool and not incur a bill, which is the case with the previous product, Azure Machine Learning.  The download of Machine Learning Workbench must be accessed within an Azure account and is installed to your local computer.  When running the application from your computer, the application will prompt to log into your Azure account to load Azure Machine Learning Workbench.

The application is designed to use and create Python code.  Azure Machine Learning Workbench does not contain any accommodation to incorporate machine learning components written in R, just Python.  If you have created machine learning components using R, they can be incorporated into the Azure Machine Learning Model Management if you create webservices which encapsulate the R code. The R code does not interface into Workbench, but can be made to be a part of the managed projectes in Azure. While it is possible to create a webservice for R with the earlier product with Azure Machine Learning, there is no direct way to include R with Azure Machine Learning Workbench.  There are a number of sample templates to get started using Python templates including the ubiquitous Iris dataset, Linear regression and several others.   Once the project is created, you can use your favorite IDE, it creates python code which can be read anywhere.

Staying within Machine Learning Workbench application allows you access to arguably one of the neatest parts of the Machine Learning Workbench, the data parser. This tool which was originally code-named project Pendleton and designed to be an intuitive way to modify the contents of data even better than the previous leader in parsing data, Power BI’s Power Query.

You can select the option “Derive column by example” or “Split Column by Example” and then start typing in a new column.  For example, if you want to separate a column which contains the date and the time, if you right click on that date column and select “Split Column by Example” then type the date in the new column provided, the application will immediately determine that you want two columns and crate them. The date column and a time column be created for you after typing in one date.  After the sample columns have been created, you can approve the change or reject it if does not work how you want to.

Like Power Query, each change made to the data is included in the window called Steps on the right side of the application window. When you are done modifying the data, right click on the Data Preparations source icon, which in my example is called UFO Clean, to and the UI changes made to the data are used to create Python code to perform the changes. The generated Python code can be used to the source data programmatically.

The next step in the process is to write the python code needed to evaluate the data and create a model which would in my case determine where and when you are most likely to see the next UFO based on the dataset I have included in my project.  Unlike it’s counterpart Azure Machine Learning, you will need to know how to write the necessary code needed to create a machine learning analysis in Python for Azure Machine Learning Workbench. One could write the Python code to create a machine learning analysis in any Python editor.  If you chose to use Azure Machine Learning, the Python library scikit-learn is installed as part of the application.  Other libraries which you may want to use, such as the common library matplot, you will need to load within Azure Machine Learning Workbench.

Web Service: How Azure Machine Learning Workbench Solutions are Deployed

To deploy a package, you will need to export the completed model serialized Python object, with the Python Module, Pickle. This will create a file with the suffix of pkl, which is the file that you will be deploying. Azure Machine Learning Workbench expects that you will be deploying via Docker containers or creating an Azure cluster.  You will need to register the Docker container in the Machine Learning Container for it to be deployed.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 2

As advanced analytical techniques become more popular, companies are looking to hire people to find answers in their data. What kind of answers?  Predicting the future, determining what factors make customers leave, what kind of products can they get good customers to buy, what conditions are related, is this a valid transaction and similar questions. What answers can be provided have everything to do with the available data. The data I chose to analyze was the salary data provided by Brent Ozar, which is publicly available here.  I started looking at the data in a previous post where I did an initial review of the data and discussed the data analysis process.

Regression Analysis – What kind of relationships can I find in the Data?

Looking at how data is related is a very important step in data analysis.  Most often various items are analyzed using some linear regression algorithms to compare one or more variables together. For this kind of analysis, generally speaking all of the data needs to be represented numerically, which means that if data instead exists as categories of items, the data will need to be transformed. For example, on the Salary Data which I analyzed and published in Power BI, the job experience and salary roe compared using the ggplot library from R, and the two different values are included on respective axis. What I was hoping to find is that there was a strong relationship between these two values.  If you look at the graph, you can see this is not the case. Interestingly enough while the line shows an upward trend, you can see a drop in salaries for those with a lot of job experience.  Those people with the most experience, above 35 years are making less money than those with less experience.  The graph also shows those who are just starting in their careers are not necessarily making the very little money. What the data shows is there is no guarantee that the longer you work the more money you make.

Data Cleansing for Analysis

Because I am looking for data trends and not anomaly detection, I normalized the survey data.  I eliminated the 100 people who did not fill in the salary amount, and cut off the high and the low.  I used the box plots generated in Power BI to serve as a guideline for the ranges to exclude.  As I was also interested in determining the difference in the responses between male and female, so I did some data substitution on some of the values as I wanted to included more records.  In 2018, the only year that this question was asked, 87.6% of the respondents were male.  I made the decision to include all of the respondents where the number of respondents was less than .22% as male so that I would have more data to evaluate. I modified all of the data in Power BI using M code.  You can take a look at all of the modifications I made to the data here in the Power BI report I created, as I am making it available for the next 30 days.

Examining the Top 5%

Recently I have had some conversations with some colleges regarding salary, and that led me to want to review what people would like to make.  Most people would like to be making the most money possible in their profession, and are not interested in moving, which is why I chose not to do much with the geographic data.  I ran a number of different machine learning algorithms on the data trying to find a definitive set of results among those who reported making the most money.  The results of those experiments were inconclusive.  While I found some items which were common among the highest earners, the results were not statistically conclusive. There are a number of conceptions that people have regarding salary, and I chose to illustrate some of them to dispel some myths surrounding data. I also grouped the salaries into groups: 95% for above 153,565, 75 for above 67,789, and the rest for the average.  These numbers were based on the values in the box chart in the top left of the Power BI report.

Salary Conclusions – Myth vs Reality

I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers.  According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%?  Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute.  Those who commute 100% of the time do very well, as well as those who spend every day at a job site.  Males report working more hours and telecommuting less than females do as well.  If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.

After spending quite a bit of time analyzing and visualizing the data, I was unable to determine a specific set of skills which to provide a roadmap of exactly what one needs to do to be in the top 5% of the salary for a data professional.  What I can tell you is more than likely there is someone with your level of work experience and position who is doing really well, and there is no reason why by the time that the next survey comes out, you are not the person who is in the top 5 percent.  This may mean working harder at your job and perhaps changing employers as the analysis shows that is the best way to make more money.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur