Articles By Ginger Grant

Azure Synapse and Power BI

There are a number of different use cases for Azure Synapse, which no doubt you have already read about. Of course you can use Azure Synapse to build a scalable Data warehouse, use it to create machine learning solutions with Spark, but have you thought of why you might want to use it with Power BI? There are two ways which you can integrate Azure Synapse with Power BI: connecting to a Power BI Workspace from within Azure Synapse or using Azure synapse to provide the source of all of the data used in the Power BI data set. We will explore both methods in this blog.

Linking to Power BI from within Azure Synapse

From within Synapse you have the ability to access a Power BI workspace so that you can use Power BI from within Synapse.  Your Power BI tenant can be in a different data center than the Azure Synapse Workspace, but they both must be in the same Power BI Tenant.  You can use Power BI to look at any data you wish, as the data you use can be from any location. When this blog was written, it was only possible to connect to one Power BI workspace from within Azure Synapse. In order to run Power BI as shown here, first I needed to create a Linked Service from within Synapse. Select Power BI from the list of options, and then select the workspace from the list available for your tenant and create the linked service.  After the linked service is created, you can go to develop in Azure Synapse, and see the Power BI option listed on the left side of the screen.

Connecting Power BI to an Azure Data Lake Gen 2

 

 

 

 

 

 

 

 

 

 

 

 

 

As much of the functionality of Analysis services is being migrated to Power BI, many people are choosing to use Power BI instead of migrating to Analysis services, especially if you have Power BI Premium.   Another option you have is to gather the data that you are going to load into Power BI into an Azure data lake Gen 2 and create a “Data Lakehouse” as the container where you gather and clean up the data you are going to use in Power BI.  There are a few things that you will want to keep in mind if you want to configure Azure Synapse to do this.  The first is that your Power BI tenant and your Azure Data Lake Gen 2 must be in the same location.  Right now this is an issue as Azure Synapse is not available in all data centers.  To configure Power BI to use Azure, you need to go to the Admin section as shown above and select the subscription, resource group and storage account.  You will want to make sure that you have the proper permissions granted in Azure RBAC for user groups who need to access the data as they need storage blob owner permissions.  The Power BI services also need reader and data access roles granted as well.  You can use Azure Synapse to gather the data and then visualize it with Power BI.  I see a lot of use cases for this model as I suspect more people are going to be moving away from creating a database for Power BI and instead moving the datastore to Azure.

 

Regards,

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Exploring Azure Synapse

In the past few months, I have been examining Azure Synapse and what it can do.  When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW.  For this reason, many people think that Synapse is just an improved version of a cloud data warehouse.  Microsoft did improve SQL DW when it moved it to Synapse.  The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change.  Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource.  You will still need to partition your DW as large datasets require partitioning to perform well.  Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.

Non-Data Warehouse Elements of Azure Synapse

To me the more interesting parts about Azure Synapse have nothing to do with data warehouses.  Azure Synapse also contains the ability to query files stored in Azure Data Lake Gen 2 as if they were SQL files. This is a great way to analyze large data without first cleaning it up and putting it into a relational environment. Within Synapse you can formulate a query using syntax for selecting parts of files, providing the ability to look at many files as if they were one. You can also create processes which bring data into your synapse environment using Orchestration, a process that people who are familiar with Azure Data Factory will find very familiar. Synapse also contains the ability to analyze data in Cosmos DB without doing ETL or moving the data at all using a scalable architecture which will not impact the transactions being processed simultaneously on the same Cosmos DB.

Azure Synapse and Spark

Check out the promotional video here

By far the most interesting component of Azure Synapse is the Spark connection. Microsoft has added the ability to create Spark Pools into Azure Synapse.  To be honest I was somewhat surprised that this functionality is included here first and not in Azure Machine Learning, where to use Spark you need to access clusters created them in Databricks.  Spark provides the ability to dynamically scale resources when running processes.  This is very handy when writing machine learning code which can really use the performance improvements Spark brings.  Because this is Microsoft’s Spark, you can also write your code to access it in .Net if you like, in addition to the more common Spark Languages, Scala, R or Python.  You can also incorporate the AutoML API created for Azure Machine learning in R and Python so that you can use the power of Azure to select your algorithm and hyperparameters instead of spending time doing it yourself.

Getting up to Speed with Synapse

There is a lot to learn when it comes to Synapse as it combines a lot of different components into one environment. As more and more data is being migrated to the cloud, it is uniquely designed to handle both big data components containing raw data, managed data lakes as well as more traditional data warehouse needs.  It can also be the location where all of the data is processed, secured, cleaned and analyzed using Machine Learning. There is a lot to cover and since it is new, there is not a lot of places yet where you can learn more about it.  If you are interested in a deep dive on Azure Synapse and how to use it in a Modern Data Warehouse, sign up for my precon at PASS Summit 2020 where I will cover the topic in depth.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Configuring Databricks for Koalas

koala getting hlep from a Firefighter

The Apache Spark open source organization maintains all of the documentation for Apache Spark, which is a set of APIs which are used in Databricks and other big data processing applications.  The documentation provides the detailed information about the libraries, but the instructions for loading libraries in Databricks are not exactly the same as are used in Databricks, so if you follow the Spark installation instructions, you will get nowhere. If you follow the steps listed you will be up and running in no time.

Installing Options – Cluster or Notebook ?

If you are not using a ML workspace you can add in using dbutils like this.
dbutils.library.installPyPI("koalas")
dbutils.library.restartPython()

Unfortunately if you are using an ML workspace, this will not work and you will get the error message org.apache.spark.SparkException: Library utilities are not available on Databricks Runtime for Machine Learning. The Koalas github documentation  says “In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning”.  What this means is if you want to use it now

Most of the time I want to install on the whole cluster as I segment libraries by cluster.  This way if I want those libraries I just connect to the cluster that has them. Now the easiest way to install a library is to open up a running Databricks cluster (start it if it is not running) then go to the Libraries tab at the top of the screen. My cluster is called Yucca, and you can see that it is running because the circle next to the name is green.

After you are on the Libraries table you will see two buttons.  Click on the one labeled Install New.  A window will appear.  Select the library source of PYPI and in the Package text box enter the word koalas.  Then click on the install button.

Install Databricks LibraryThe installation may take a few minutes.  When it is complete you will see a green  status circle and the word installed.

After this you are ready to use the new library, once you import it as shown here.

 

Why do I want to install Koalas in Databricks?

If you have written Python code for Machine Learning, chances are you are using Pandas. Pandas dataframes are practically the standard for manipulating the data in Python.  They are not however part of the Spark API.  While you can move your Python code over to Databricks without making any changes to it, that is not advisable.  Databricks is able to scale pandas, so adding more resources to your code may not improve the performance.  When writing Python code for Databricks you need to use the Spark APIs in order to ensure that your code can scale and will perform optimally.   Prior to April of 2019, that meant that you had to use Spark dataframes and not pandas dataframes, which could involve a bit of rework when porting code as much code was written in pandas.  In April of last year Koalas was added to Spark, meaning that changing code to use a pandas dataframe to a koalas dataframe means that you only have to change one word. Koalas contains all of the functionalities of a pandas dataframe, so if you are familiar with one you can use the other.

More About Koalas

koala getting hlep from a Firefighter

Photo Credit:New York Fox 5

It is impossible for me to load the library without thinking about the Australian Bush Fires which are burning the homes of people and Koalas.  If your finances allow it, please consider donating to the firefighters as I am sure they can use help to save the homes of people and animals in Australia.

 

Regards,

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Introduction to Databricks

As I have been doing some work on Databricks, I thought that it would make sense that I start writing about it. Databricks is a scalable environment used to run R, Python and Scala code in the cloud. It currently can be run in either AWS or Microsoft’s Azure Cloud. For those of you who are budget-minded when it comes to learning new tools, there is also a free tier, which is available here Community.cloud.databricks.com . It has somewhat limited compute capacity, but if you are just starting out you might find it helpful.

Backstory

Databricks is an implementation of Apache Spark, part of the Hadoop collective and was a replacement for Map Reduce.  Many of the people who worked on that open source project were students at Berkeley, where Apache Spark was created and added to Apache in 2013.  Like many development projects, after it was completed, they had some ideas on how to improve the code.  This time they decided to not make it open source but make it a commercial product so they could make some money for their development efforts. In April of 2017 Databricks was released on AWS and in March 2018 it was released in Azure.

Creating an Azure Databricks Service

Creating a Databricks Service is very straight-forward.  There are only a few things that you need to complete when creating a new Databricks instance. The location becomes very important if you are looking at higher level performing instances which may not be available in all locations. If you are just getting started don’t worry about high level hosting services as you most likely will not need them and most of the compute options are available in most data centers.  As always in Azure you want to make sure that you are hosting your Databricks service in the same location as your data so you will not need to pay to transfer data between data centers.

The Pricing Tier contains three options: Standard, Premium and Trial(Premium 14 Days).  The trial is pretty self-explanatory and is a great way to get started using Databricks. They are of course a few differences between Standard and Premium. Premium has extra features needed for teams including Role-Based rights for the components of Databricks.  And if you want ODBC authentication and Audit logs you will need to use Premium.  For more information on the cost of Databricks pricing tiers, check out Microsoft’s pricing link for more information.

Once you have an instance created, you can start using Databricks.  The application is contained within a managed instance, so once you launch Databricks you will be in their environment, which looks the same as the free edition.

 

Clusters, Notebooks and Data

These three components are the most important parts of Databricks as they include the compute power, where you write code and the information you work with respectively. These components are all separated in Databricks to improve scaling and provide a familiar environment to create and run code.

Cluster

The most important Databrick element, as it contains the compute.  This is also the part of Databricks which will greatly increase your bill as the more resources you use to run code the more money you need to run it.  One nice thing is clusters by default will terminate in 120 minutes of inactivity.  I generally drop this to 20 minutes.  If I am using it naturally it will not terminate, but if I am not using it, I want the charges to stop. You can also automatically spin up clusters to run jobs, so that they will only be in use when the job needs them. More about that in another post.

Notebooks

Databricks Notebook Import

Databricks Notebook Import

There are 3 supported languages in Databricks, R, Scala and Python, and within Databricks all of these languages are written in Notebooks. You don’t have to write your code in the environment.  You can write it locally and then import it. However, if you want to export your Notebook then run it locally, it gets trickier. Natively all of the Notebooks in Databricks are saved as .dbc files.  You can’t read them from anywhere else.  Fortunately there is a workaround to format the Notebook files as .ipynb files which can be read by any notebook.  Dave Wentzel from Microsoft has an elegant solution to convert .dbc to .ipynb which he includes in his blog here.

Data

You have a lot of options with data.  You can import a dataset into your environment to play with or you can connect to just about anything you can think of.  When you start doing data connections is when you stop using the community edition as you will want to use the Azure version to this to connect to various data resources like Azure SQL and blob storage.  More on how to that later in an upcoming post.

If you are interested in hearing more about Databricks and are in Chicago, I am teaching an all day class as part of SQL Saturday Chicago and would love to have you attend. More information on that class is here.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Upgrading a data model from Power BI to Tabular got a lot harder

If you were looking to upgrade a Power BI model to Analysis Services Tabular, hopefully you decided to do it before March 1, 2019, that was the day Microsoft decided to remove the tool from Azure. I know I needed to upgrade after that and I was really sorely disappointed. I also blew my schedule as the upgrade process went from a 20 minute process to something that took over 2 hours. I am writing this blog to save you some time as I learned a few tricks which I wish I would have known about before upgrading.

Preparing your Power BI model to Upgrade to Tabular

Unless you are upgrading to analysis services on SQL Server 2019, chances are you are going to have to review your DAX code and make some modifications as DAX on the other versions of SQL Server are not the same as Power BI. I was upgrading to AS on SQL Server 2016, there were some commands that I had to manual edit out of the JSON file. If you have any new DAX commands, take them out of your Power BI Model which means you will not have to manually edit the JSON file to remove them when the new commands are flagged as errors. Make sure your Power BI Model does not include commands such as SELECTEDVALUE, GENERATESERIES as well as all of the automatically generated date hierarchies. After your Power BI desktop file is clean, leave it running as you are going to need to have it running for the next step.

Connect to your Power BI Model in Analysis Services

Power BI is running a version of Analysis Services tabular and once you figure out what version is running; you can connect to it.  Open up the Task Manager and take a look at the Power BI Desktop in the Processes tab by clicking on the arrow at the right-hand side so that you see all of the processes running.  You see by the picture here, that there are 19 processes running with one Power BI file loaded. Right click on the instance of Microsoft SQL Server Analysis Services running within Power BI Desktop and select Go to Details. This command will bring you to the Details table in Task Manager and will highlight which version and PID to connect in Analysis Services on your computer. You will notice my PID is 3992.  Using this number I went to the command prompt and typed

Netstat -anop TCP | findstr 3392. 

This command returns the address I need to access Power BI model from Analysis Services, 127.0.0.1:51328

That’s it. Now I can look at my Power BI model in SQL Sever Analysis Services from within Management Studio.  It looks like this

TL;DR This model is not usable and you will have to do some tweaks to it to make it work.  The first step is to fix the connections, then edit some JSON. Right click on the connection and select properties. Click on the ellipse button next to the connection string, and you will get this error message. 

Once you click ok, you can edit the connection to something that will work.

All of my Power BI tables were connecting to views in SQL Server, which is a best practice as this allows you to select only the fields you need and provides you with the ability to sort your data as it is read in, which can improve the compression.  I changed the Provider to SQL Server, and fixed the connection.  Once the connections are changed, you will need to create an XMLA script and edit it. Right click on the Database with the GUID name and select Script->Script Database As->Create To-> File.  Give the file a name as you will be using it after connecting to your analysis Services instance. 

Editing the XMLA File

Open up the XMLA file you created in Analysis Sr.  The first thing you will need to fix is the compatibility level.  Power BI set the value to 1465. 1400 is for Azure AS, SQL Server 2017 and 2016 use 1200, so I edited the script to change the value 1200.  I also changed the database name to “SampleAS” from the GUID that it was assigned. I then executed the script, and got my first error

Executing the query …

The JSON DDL request failed with the following error: Unrecognized JSON property: variations. Check path ‘model.tables[0].columns[11].variations’, line 362, position 30..

Run complete

I typed Ctrl G and entered the line number to fix the first error which was on this line

                “formatString”: “dddd\\, mmmm d\\, yyyy”,

I deleted that line and ran it again. This time I got this error.

Executing the query …

The JSON DDL request failed with the following error: Unrecognized JSON property: variations. Check path ‘model.tables[0].columns[11].variations’, line 361, position 30..

Run complete

 

Line 361 starts with the word variations. The section is contained within square brackets [].

   “variations”: [

                  {

                    “name”: “Variation”,

                    “relationship”: “dc6b309e-0967-4eea-8e3f-7a308661dca5”,

                    “defaultHierarchy”: {

                      “table”: “LocalDateTable_c6cfa169-57d3-4da3-bcbc-21b677dde835”,

                      “hierarchy”: “Date Hierarchy”

                    },

                    “isDefault”: true

                  }

                ],

Date Hierarchies are not supported the same way in AS, so I need to edit the XMLA to remove this section. I will need to repeat this step for every auto-generated data hierarchy I have, which is why I recommend that you remove them as it will save you a lot of time. Another line which caused an error was this line.

                  “query”: “SELECT * FROM [Rates]”,

My view is not called Rates. It is called vw_PaidRates. The table in the Power BI model is called Rates. There is no table or view called rates so I needed to change Rates to the name of the view vw_PaidRates. I repeated this for every table in the XMLA file. I estimate that the editing process took over an hour to eliminate all of the code that would not work. After that I was able to create an project in visual studio using the Import From Server(Tabular) option.

Upgrade Process from Power BI to Tabular

The process was tedious, and it made me long for the 15 minute Azure AS conversion. The method I outlined here is not supported by Microsoft, so it is up to you to make it work.  I have been able to successfully convert a model, just takes a lot of time. Fixing the Power BI model up front helps, and it would also help if little of your text is formatted but there are some things, like the query fixes that you just have to do on your own. I understand that the conversion process is much easier if you are using SQL Server 2019 as that version of Analysis Services is mostly compatible with the version created by Power BI.  Unfortunately I don’t have any clients yet who are using 2019.  I hold out hope it will be easier to do next time, but I will allow for a lot more time to make the conversion happen.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

5 Tips for Writing Better SQL

More people are starting to write SQL who have a background in programming as more and more people are analyzing data as data science related jobs are exploding.  SQL is a little different than other languages and here are some common patterns which should be avoided.  Here are 5 things, in no particular order, you can do to improve your SQL.

1. Use the correct join for the job

There are a number of different joins you can use in SQL, but there are 3 common ones, Left Join, Right Join and Inner join.  When joining two tables you will get only the records which match with an inner join.  Any records which cannot be joined will be excluded.  For this reason, many people only write left or right joins which automatically return all of the rows in the left or right hand side, respectively.  These joins are always slower that Inner joins, and may not be necessary.  Evaluate your joins to see if the record count changes if you do an inner join before you do a left or right join.

2. Do not Loop through rows of data

SQL is designed to process data in sets, not one record at a time.  In fact, it does not work well processing one row at a time.  Looping through records one at a time in a cursor or a while loop is to be avoided.  Think about how you might write the code so that it processes the records at one time.

3. Avoid Nesting SQL code

SQL is not an object-oriented language.  This means the performance and readability does not improve when views, stored Procedures Or cursors are called from other objects. Try to do all of the work in one place.  For more information on nested views, check out my post on the topic.

4. Do not use column names which include spaces or reserved words

While you can get away with using spaces or reserved words by putting the name of the column in brackets, most people are not used to writing code this way and it is inevitable that someone will leave off the square brackets and the code will not work. Have pity on those who will be supporting your code and do not include spaces.  If you are unsure of what the reserved words are, see if they change color in the editor.

5.Use Sets in SQL and Avoid Temp tables

Someone who attended I taught a class recently told me that they didn’t think that it was possible to write a stored procedure without a temp table because all of the stored procedures at her work had them.  A common pattern which I see is to processing a bunch of records and putting them in a temp table, then processing them again. This pattern should be avoided. You can process records in one query. Look at writing more complex queries which look at all of the data in one query.

Resolving Common SQL Pitfalls for People who Develop Lots of Code

Sometimes all it takes to figure out how to write SQL using sets is to have someone show you how.  If you are in Phoenix, I can show you how on February 5th when I am doing a talk at Galvanize on this topic. I would be happy to show you some tips and tricks which would help gather and analyze data for an experiment.  Not in the area or can’t make it?  Check out my class at DataCamp.  You can try the class for free to see what you think. I look forward to hearing your feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Reviewing Azure Databricks and Data Lake Analytics

Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data.  Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics.  With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.

Analyzing Data with Data Lake Analytics

Data Lake Analytics offers many of the same features as Databricks.  You can write code to analyze data and the analysis can be automatically parallelized to scale.  Microsoft has released a new version of Data Lake, which they are calling Data Lake Storage Gen2 to improve the performance of analysis performed with Data Lakes.  The difference, between the old version and the new one, is the hierarchical namespace to Azure Blob Storage which provides an indexing capability which means that operations can be performed on a directory rather than enumerating through all of the data.  Data stored within a Data Lake can be accessed just like HDFS and Microsoft has provided a new driver for accessing data in a Data Lake which can be used with SQL Data Warehouse, HDinsight and Databricks.  With Data Lake Analytics, the data analysis is designed to be performed in U-SQL. While it supports R and Python libraries, users of the technology will need to get up to speed on U-SQL which is a lot like C#.  This knowledge needs to be learned. Since U-SQL is so new, only a few years old, there is not a large number of people who are familiar with it.

Analyzing Data with Databricks

When analyzing data with Databricks, there are three different languages which you can use: R, Scala, and Python.  Data can be read in from a variety of different Azure Storage options, including Blob Storage, Data Lake, and by using a JDBC connection. You can also connect to Azure SQL DB, as well as Azure SQL Data Warehouse. Since there are three different languages which can be used, there is no reason to learn a new language as most people are already very familiar with at least one of the three supported languages.

In addition to the ability to develop code, Databricks offers some other features which are not found in Data Lake Analytics.  Many projects anticipate that people are going to be working in teams and will need to have an environment to share code and version it.  This capability is baked into Azure Databricks as it provides an environment for sharing data with others and natively saving the data to a GitHub repository.  The development environment is Jupyter Notebooks which provides a great way to document the code and include data samples, all at the same time.  Databricks also includes a job schedule component so that work created in Databricks can use a native scheduler which has the ability to retry and send configurable messages on error or completion.  These additional features, plus the ability to code in a language which is already widely used in the industry, give Databricks the edge in determining which technology to use going forward.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Fallibility of Machine Learning: Why Lenovo will not sell me a laptop

Machine Learning is being adopted by more and more companies to assist in the sales process.  Like all technology, Machine Learning is not correct all of the time.  In fact, models with 75% accuracy are commonly accepted as good models and implemented for production.  This means 25% of the time the model is wrong, meaning that the algorithm will incorrectly flag 25% of the business. What happens to transactions where an algorithm has incorrectly determined the transaction is not viable? Legitimate business is turned away.  This happened to me when I tried to do business with Lenovo.

Lenovo’s Ordering System: No Laptop For You!

Buying a laptop from Lenovo reminded me of an episode of Seinfeld when Elaine was trying to buy soup.  For some unknown reason, when I placed an order on their website and gave them my money, Lenovo gave me a Cancellation Notice, the email equivalent of “No Soup for you!”  After placing an order, about 15 minutes later, I received a cancellation notice.  I called customer service.  They looked at the order and advised me the system incorrectly cancelled the order.  I was told to place the order again as they had resolved the problem.  I created a new order, and just like the last time, I received the No Laptop for You cancellation email.  I called back. This time I was told that the system thinks I am a fraud. Now I have no laptop and I have been insulted. I asked if the system could be overridden because I was not a fraud.  Customer service verified my method of payment and told me that were going to assign a case number to it as that would ensure the transaction would go through, and they would get credit for the order as they were going to place it.  Apparently, customer service has some kind of financial incentive for placing sales. That did not work either as, once I again I received the No Laptop for You cancellation email.  Not only did I not get a laptop, the person I spoke to also lost out as he was not going to get a credit for the sale.  I called back again and this time they told me that they had no idea what was wrong with the system but it had flagged me as a fraud and a case number did not get assigned last time as it was supposed to, which was the reason that that order was canceled, again.  They placed the order again and once again I received the No Laptop for You cancellation email. Every attempt at buying a laptop had failed. I had struck out with customer service as had received advice 3 times and every time I got a  No Laptop for You cancellation email. At this point I tried getting the situation resolved via social media. Publicly Lenovo said they wanted to help, and sent me one direct message letting me know they would fix the system, and that was the last I ever heard from them.  By not sending me another email, the message they sent me instead was No Laptop For You!

Relying on Machine Learning can Cost businesses Sales and leave them wondering about Toilet Seats

I tried to give Lenovo nearly $2000 and they refused to take my money.  How many other transactions are they ignoring?  Over 500? That does not seem like a terribly high number.  500 transactions for $2000 a piece would mean Lenovo’s sales are needlessly down 1 million dollars because they implemented a system which turns away sales and actively prevents sales despite the best intentions of their employees to close a sale.  Blindly relying on the accuracy of a computer program to determine with 100% accuracy whether or not a transaction is viable or not is not just a bad idea but is a bad business decision which can cost millions of dollars in sales.  While you may not have been rejected to buy a product, most people I know have seen lists of recommended products on websites which do not reflect things you want to purchase.  A friend of mine who was remodeling a bathroom, bought a toilet seat on Amazon. When he logged in again, he continued to see a myriad selection of toilet seat product recommendations for the next six months at the exclusion of other products he might actually want to buy.  Apparently, the machine learning algorithm determined that because he bought one toilet seat, he was a Toilet Seat Connoisseur and wanted to decorate his house with a variety of rare and unusual of toilet seats for the next six months.

Combining Machine Learning with People

I create machine learning solutions for clients and provide training sessions to help people learn how to write machine learning models. I understand the process and the steps which are used to create a machine learning experiment. First you gather and clean the data, then train it using a set of algorithms against a set of data, and then you create a model.  The problem “Should I cancel this sale” is has two possible answers, yes or no, meaning it is a binary classification for anomaly detection. Never have I created a model which was 100% accurate as that is not possible. I tell clients that is not possible and help them implement solutions to handle conditions when the model is wrong. Machine Learning needs to work in concert with people who have the ability to resolve problems which are flagged by the system, as there is a place for people in all automated systems.

Most normal people would have probably given up after their order was canceled twice, but I persisted as I was amazed that such a big company like Lenovo could continue to be so wrong, and I wanted to prove I was not a fraud.  Continued failure to successfully place an order convinced me that I did not want to do business with Lenovo. If a company does not want to resolve an issue where they will receive money, how likely are they to want to resolve a situation which costs them money, such as a warranty claim? Based on my experience, I have no confidence that one could get Lenovo customer service to solve a problem as they do not have the ability, even when they are financially incentivized to do so. Machine Learning and AI may decrease the number of people needed, but when things go wrong people are needed to fix them. When a machine learning model is wrong, and this will happen, the policy should be to permit your customer service people to create successful sales.  If instead, your customer service insults and ignores customers when machine learning models go wrong, sales will go down as customers will be going to competitors.

I researched laptops as I was interested in having a lightweight powerful laptop which I could haul through various airports to use at clients and conferences, like Live 360 SQL Server  where I will be speaking on December 3. Fortunately there are other companies who have determined they do not need to create some kind of machine learning score to sell a laptop, they just sell laptops to people who go to their website and give them money with no problems.  Using the same address and credit card information which Lenovo flagged as fraudulent, I bought my new HP laptop, which I will be happy to demonstrate next time you see me at a conference or class.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Analysis Services Tabular Best Practices – Part 3

This is my last installment in my series on improving Analysis Services Tabular. If you want to start at the beginning, please click here. I started writing this because I wanted one place to point clients to improve the performance and scalability of Tabular Models, and there are a lot of places you need to go to find all of this information.  I hope you find it helpful.  Let me know if you have any questions or comments regarding this post I hope you leave some comments or perhaps ask me a question on twitter.  With that here are more things that you need to do to improve your Analysis Services Tabular implementations.

Optimize your DAX Code

While it is not easy to performance tune DAX you can do it, by evaluating the DAX Query Plan and VeritPaq Queries, and SQLBI’s VertiPaq Analyzer. Also, you can also look to use functions which perform better, for example COUNTROWS instead of DISTINCTCOUNT or ADDCOLUMNS instead of SUMMARIZE. Whenever possible use the CALCULATE function instead of the FILTER function, as CALCULATE filters for context inside the parenthesis and are more efficient. Also all of the iterative functions SUMX, COUNTX etc., should be used sparingly as the row-by-row transactions they create are less efficient and should be used only when SUM or COUNT will not work.  When evaluating if a value missing, if it is possible, use ISEMPTY instead of ISBLANK as ISEMPTY looks only for the presence of a row, which is faster than the evaluation performed by ISBLANK.

Sort by Settings

There are times when the sort needs to be specified as something other than alphabetic.  A common place where this is done is for months, as people want to see January appearing first rather than April.  April will appear by default as the dates are sorted alphabetically.  To change this, modify the Month Name column property Sort by Column to sort by the Month number field so the months will always appear with January first.  This process should be repeated for any column where the alphabetic sort is not likely what people want to see, such as Day of Week.

Formatting Fields

All values should be formatted in a way that will look good on reports. Most of the time this formatting has been included, but there are some instances where the general format is used instead. Most users want numbers which include a thousands separator.  In analysis services tabular, this is accomplished by using a custom format and the format style #,##0.  This will add the thousands separator when needed and will never show a decimal number.

Hiding values

All values not used in a report should be hidden.  This will make things less confusing for people who are writing reports as the values that appear are ones which they actually use. For purposes of joining tables, many times you need to bring in key values to create relationships, but they are not values which should be reported upon as they have no business meaning.

Business Friendly Column and Table Names

When creating Power BI reports, the column names in the table are the names used in the reports.  As a best practice, the names in the tabular model should be business friendly names.   All of the column names visible in the model should be names which can be readily understood by anyone reading the report.  Names should include special characters like spaces and should be appropriately capitalized in the normal function names appear.  For example WK_NUM would appear as something like “Week Number” would be understandable to the user and look appropriate on reports. Using views can also provide the ability to change the names to user readable names outside of the model, which is also considered a best practice as then the model does not need to be changed to have user specified names.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Analysis Services Tabular Best Practices – Part 2

I was mentoring a client recently trying to help them with some tips for using tabular. I couldn’t find all of the information I was looking for, so I decided to write all of them down so that next time I could just go to my blog for the information.  There are a lot of things to consider so I broke them up into 3 parts.  To make sure you get all of them, please subscribe to my blog.  And now, here are some more tips for tabular.

Marking all Date tables as Date

Tabular models include a number of Time Intelligence functions which provide the ability to quickly and easily calculate things such as Rolling 12 months, Year-to-date calculation, and more. In order for this functionality to be possible, the table needs to be marked as date.  In order for this feature to be available, there needs to be one contiguous date field in the table, meaning there should be no gaps in the date field. If there is one default missing date field which is many years from the start of the contiguous date values, the field needs to be removed or all of the dates after that date need to be added ensuring there are no gaps in the dates. You can choose to join on the date value or another key.

Modify Timestamps to Split Date and Time

When there is a field where the date and time are both needed, the values should be separated so that there is both a date field and a time field.   Having date time in two fields assists in the dictionary encoding as the date and time fields can be separately sorted into columns where the values are the same, decreasing the number of dictionary entries.  To further improve compression, only include the seconds if absolutely necessary, as add decreasing the cardinality will increase compression.

Active Relationships and Modeling Role Playing Dimensions

All relationships between tables should be active relationships.  If you have any dashed lines which connect portions of your model, these are inactive relationships and should be removed, unless you are writing DAX Code to use them. DAX code must be written to use inactive relationships and users often find it confusing. As a best practice for situations where there are multiple dates in the fact table, such as Order Date, Ship Date and Due Date, is to have 3 different date tables. The other date tables can be created by using a calculated table to create a second date table and join the other value to that date table.  Step-by-step instructions for creating a calculated table are included in the Reference section1.  Having multiple date tables is a best practice design for tabular models as it is a common for there to be multiple dates in one table.

Partitioning Tabular Models

Partitioning Tabular models does not improve query performance. In fact multiple partitions can actually make performance worse.  Partitioning should be implemented to assist in the loading of the data.  If you do not need to create a partition to make a load timeframe, then there is no reason to do it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur