5 Tips for Writing Better SQL

More people are starting to write SQL who have a background in programming as more and more people are analyzing data as data science related jobs are exploding.  SQL is a little different than other languages and here are some common patterns which should be avoided.  Here are 5 things, in no particular order, you can do to improve your SQL.

1. Use the correct join for the job

There are a number of different joins you can use in SQL, but there are 3 common ones, Left Join, Right Join and Inner join.  When joining two tables you will get only the records which match with an inner join.  Any records which cannot be joined will be excluded.  For this reason, many people only write left or right joins which automatically return all of the rows in the left or right hand side, respectively.  These joins are always slower that Inner joins, and may not be necessary.  Evaluate your joins to see if the record count changes if you do an inner join before you do a left or right join.

2. Do not Loop through rows of data

SQL is designed to process data in sets, not one record at a time.  In fact, it does not work well processing one row at a time.  Looping through records one at a time in a cursor or a while loop is to be avoided.  Think about how you might write the code so that it processes the records at one time.

3. Avoid Nesting SQL code

SQL is not an object-oriented language.  This means the performance and readability does not improve when views, stored Procedures Or cursors are called from other objects. Try to do all of the work in one place.  For more information on nested views, check out my post on the topic.

4. Do not use column names which include spaces or reserved words

While you can get away with using spaces or reserved words by putting the name of the column in brackets, most people are not used to writing code this way and it is inevitable that someone will leave off the square brackets and the code will not work. Have pity on those who will be supporting your code and do not include spaces.  If you are unsure of what the reserved words are, see if they change color in the editor.

5.Use Sets in SQL and Avoid Temp tables

Someone who attended I taught a class recently told me that they didn’t think that it was possible to write a stored procedure without a temp table because all of the stored procedures at her work had them.  A common pattern which I see is to processing a bunch of records and putting them in a temp table, then processing them again. This pattern should be avoided. You can process records in one query. Look at writing more complex queries which look at all of the data in one query.

Resolving Common SQL Pitfalls for People who Develop Lots of Code

Sometimes all it takes to figure out how to write SQL using sets is to have someone show you how.  If you are in Phoenix, I can show you how on February 5th when I am doing a talk at Galvanize on this topic. I would be happy to show you some tips and tricks which would help gather and analyze data for an experiment.  Not in the area or can’t make it?  Check out my class at DataCamp.  You can try the class for free to see what you think. I look forward to hearing your feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Reviewing Azure Databricks and Data Lake Analytics

Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data.  Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics.  With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.

Analyzing Data with Data Lake Analytics

Data Lake Analytics offers many of the same features as Databricks.  You can write code to analyze data and the analysis can be automatically parallelized to scale.  Microsoft has released a new version of Data Lake, which they are calling Data Lake Storage Gen2 to improve the performance of analysis performed with Data Lakes.  The difference, between the old version and the new one, is the hierarchical namespace to Azure Blob Storage which provides an indexing capability which means that operations can be performed on a directory rather than enumerating through all of the data.  Data stored within a Data Lake can be accessed just like HDFS and Microsoft has provided a new driver for accessing data in a Data Lake which can be used with SQL Data Warehouse, HDinsight and Databricks.  With Data Lake Analytics, the data analysis is designed to be performed in U-SQL. While it supports R and Python libraries, users of the technology will need to get up to speed on U-SQL which is a lot like C#.  This knowledge needs to be learned. Since U-SQL is so new, only a few years old, there is not a large number of people who are familiar with it.

Analyzing Data with Databricks

When analyzing data with Databricks, there are three different languages which you can use: R, Scala, and Python.  Data can be read in from a variety of different Azure Storage options, including Blob Storage, Data Lake, and by using a JDBC connection. You can also connect to Azure SQL DB, as well as Azure SQL Data Warehouse. Since there are three different languages which can be used, there is no reason to learn a new language as most people are already very familiar with at least one of the three supported languages.

In addition to the ability to develop code, Databricks offers some other features which are not found in Data Lake Analytics.  Many projects anticipate that people are going to be working in teams and will need to have an environment to share code and version it.  This capability is baked into Azure Databricks as it provides an environment for sharing data with others and natively saving the data to a GitHub repository.  The development environment is Jupyter Notebooks which provides a great way to document the code and include data samples, all at the same time.  Databricks also includes a job schedule component so that work created in Databricks can use a native scheduler which has the ability to retry and send configurable messages on error or completion.  These additional features, plus the ability to code in a language which is already widely used in the industry, give Databricks the edge in determining which technology to use going forward.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Fallibility of Machine Learning: Why Lenovo will not sell me a laptop

Machine Learning is being adopted by more and more companies to assist in the sales process. Like all technology, Machine Learning is not correct all of the time. In fact, models with 75% accuracy are commonly accepted as good models and implemented for production. This means 25% of the time the model is wrong, meaning that the algorithm will incorrectly flag 25% of the business. What happens to transactions where an algorithm has incorrectly determined the transaction is not viable? Legitimate business is turned away. This happened to me when I tried to do business with Lenovo. If you are someone who is genuinely interested in how technology has adapted, then you may want to check out this quick guide on some tech tips that might be able to help you with any issues that crop up, unfortunately for me, Lenovo was my issue and I needed their help.

Lenovo’s Ordering System: No Laptop For You!

Buying a laptop from Lenovo reminded me of an episode of Seinfeld when Elaine was trying to buy soup. For some unknown reason, when I placed an order on their website and gave them my money, Lenovo gave me a Cancellation Notice, the email equivalent of “No Soup for you!” After placing an order, about 15 minutes later, I received a cancellation notice. I called customer service. They looked at the order and advised me the system incorrectly cancelled the order. I was told to place the order again as they had resolved the problem. I created a new order, and just like the last time, I received the No Laptop for You cancellation email. I called back. This time I was told that the system thinks I am a fraud. Now I have no laptop and I have been insulted. I asked if the system could be overridden because I was not a fraud. Customer service verified my method of payment and told me that were going to assign a case number to it as that would ensure the transaction would go through, and they would get credit for the order as they were going to place it. Apparently, customer service has some kind of financial incentive for placing sales. That did not work either as, once I again I received the No Laptop for You cancellation email. Not only did I not get a laptop, the person I spoke to also lost out as he was not going to get a credit for the sale. I called back again and this time they told me that they had no idea what was wrong with the system but it had flagged me as a fraud and a case number did not get assigned last time as it was supposed to, which was the reason that that order was canceled, again. They placed the order again and once again I received the No Laptop for You cancellation email. Every attempt at buying a laptop had failed. I had struck out with customer service as had received advice 3 times and every time I got a No Laptop for You cancellation email. At this point I tried getting the situation resolved via social media. Publicly Lenovo said they wanted to help, and sent me one direct message letting me know they would fix the system, and that was the last I ever heard from them. By not sending me another email, the message they sent me instead was No Laptop For You!

Relying on Machine Learning can Cost businesses Sales and leave them wondering about Toilet Seats

I tried to give Lenovo nearly $2000 and they refused to take my money. How many other transactions are they ignoring? Over 500? That does not seem like a terribly high number. Jokes apart, many companies tend to limit the sales of a particular new product. This can be a part of market research where companies tend to use tools like Conjoint Analysis (look for – What is Conjoint Analysis?) or similar tools where they can find out the actual requirements of the consumer as well as a viable price point for it. However, this might not be the case as I don’t remember the customer service person mentioning it. Also, 500 transactions for $2000 a piece is huge. That would mean Lenovo’s sales are needlessly down 1 million dollars because they implemented a system that turns away deals and actively prevents sales despite the best intentions of their employees to close a sale. As a result, Lenovo’s customers may have a negative impression of the company, which could lead to a decrease in laptop and computer sales. In this case, the company’s original and powerful brand image can be revived by implementing a marketing campaign, removing negative web content (with the help of an online reputation management firm), and executing effective branding strategies. Moreover, Lenevo may need to consider the primary cause of this problem — the over-reliance on machine learning.

Blindly relying on the accuracy of a computer program to determine with 100% accuracy whether or not a transaction is viable or not is not just a bad idea but is a bad business decision that can cost millions of dollars in sales. The company can instead train its employees to generate more sales by enrolling them in a sales training program. In the case of an already existing training program, it would make sense to find ways to improve retail sales training program to prepare employees for increased sales. With effective training and the use of machine learning, it is possible that any company like Lenevo is likely in line to get more revenue. Along with that, it seems crucial to showcase such products on the website that are available in the inventory. A story could solidify this statement. A friend of mine who was remodeling a bathroom bought a toilet seat on Amazon. When he logged in again, he continued to see a myriad selection of toilet seat product recommendations for the next six months at the exclusion of other products he might actually want to buy. Apparently, the machine learning algorithm determined that because he bought one toilet seat, he was a Toilet Seat Connoisseur and wanted to decorate his house with a variety of rare and unusual of toilet seats for the next six months.

Combining Machine Learning with People

I create machine learning solutions for clients and provide training sessions to help people learn how to write machine learning models. I understand the process and the steps which are used to create a machine learning experiment. First you gather and clean the data, then train it using a set of algorithms against a set of data, and then you create a model. The problem “Should I cancel this sale” is has two possible answers, yes or no, meaning it is a binary classification for anomaly detection. Never have I created a model which was 100% accurate as that is not possible. I tell clients that is not possible and help them implement solutions to handle conditions when the model is wrong. Machine Learning needs to work in concert with people who have the ability to resolve problems which are flagged by the system, as there is a place for people in all automated systems.

Most normal people would have probably given up after their order was canceled twice, but I persisted as I was amazed that such a big company like Lenovo could continue to be so wrong, and I wanted to prove I was not a fraud. Continued failure to successfully place an order convinced me that I did not want to do business with Lenovo. If a company does not want to resolve an issue where they will receive money, how likely are they to want to resolve a situation which costs them money, such as a warranty claim? Based on my experience, I have no confidence that one could get Lenovo customer service to solve a problem as they do not have the ability, even when they are financially incentivized to do so. Machine Learning and AI may decrease the number of people needed, but when things go wrong people are needed to fix them. When a machine learning model is wrong, and this will happen, the policy should be to permit your customer service people to create successful sales. If instead, your customer service insults and ignores customers when machine learning models go wrong, sales will go down as customers will be going to competitors.

I researched laptops as I was interested in having a lightweight powerful laptop which I could haul through various airports to use at clients and conferences, like Live 360 SQL Server where I will be speaking on December 3. Fortunately there are other companies who have determined they do not need to create some kind of machine learning score to sell a laptop, they just sell laptops to people who go to their website and give them money with no problems. Using the same address and credit card information which Lenovo flagged as fraudulent, I bought my new HP laptop, which I will be happy to demonstrate next time you see me at a conference or class.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Analysis Services Tabular Best Practices – Part 3

This is my last installment in my series on improving Analysis Services Tabular. If you want to start at the beginning, please click here. I started writing this because I wanted one place to point clients to improve the performance and scalability of Tabular Models, and there are a lot of places you need to go to find all of this information.  I hope you find it helpful.  Let me know if you have any questions or comments regarding this post I hope you leave some comments or perhaps ask me a question on twitter.  With that here are more things that you need to do to improve your Analysis Services Tabular implementations.

Optimize your DAX Code

While it is not easy to performance tune DAX you can do it, by evaluating the DAX Query Plan and VeritPaq Queries, and SQLBI’s VertiPaq Analyzer. Also, you can also look to use functions which perform better, for example COUNTROWS instead of DISTINCTCOUNT or ADDCOLUMNS instead of SUMMARIZE. Whenever possible use the CALCULATE function instead of the FILTER function, as CALCULATE filters for context inside the parenthesis and are more efficient. Also all of the iterative functions SUMX, COUNTX etc., should be used sparingly as the row-by-row transactions they create are less efficient and should be used only when SUM or COUNT will not work.  When evaluating if a value missing, if it is possible, use ISEMPTY instead of ISBLANK as ISEMPTY looks only for the presence of a row, which is faster than the evaluation performed by ISBLANK.

Sort by Settings

There are times when the sort needs to be specified as something other than alphabetic.  A common place where this is done is for months, as people want to see January appearing first rather than April.  April will appear by default as the dates are sorted alphabetically.  To change this, modify the Month Name column property Sort by Column to sort by the Month number field so the months will always appear with January first.  This process should be repeated for any column where the alphabetic sort is not likely what people want to see, such as Day of Week.

Formatting Fields

All values should be formatted in a way that will look good on reports. Most of the time this formatting has been included, but there are some instances where the general format is used instead. Most users want numbers which include a thousands separator.  In analysis services tabular, this is accomplished by using a custom format and the format style #,##0.  This will add the thousands separator when needed and will never show a decimal number.

Hiding values

All values not used in a report should be hidden.  This will make things less confusing for people who are writing reports as the values that appear are ones which they actually use. For purposes of joining tables, many times you need to bring in key values to create relationships, but they are not values which should be reported upon as they have no business meaning.

Business Friendly Column and Table Names

When creating Power BI reports, the column names in the table are the names used in the reports.  As a best practice, the names in the tabular model should be business friendly names.   All of the column names visible in the model should be names which can be readily understood by anyone reading the report.  Names should include special characters like spaces and should be appropriately capitalized in the normal function names appear.  For example WK_NUM would appear as something like “Week Number” would be understandable to the user and look appropriate on reports. Using views can also provide the ability to change the names to user readable names outside of the model, which is also considered a best practice as then the model does not need to be changed to have user specified names.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Analysis Services Tabular Best Practices – Part 2

I was mentoring a client recently trying to help them with some tips for using tabular. I couldn’t find all of the information I was looking for, so I decided to write all of them down so that next time I could just go to my blog for the information.  There are a lot of things to consider so I broke them up into 3 parts.  To make sure you get all of them, please subscribe to my blog.  And now, here are some more tips for tabular.

Marking all Date tables as Date

Tabular models include a number of Time Intelligence functions which provide the ability to quickly and easily calculate things such as Rolling 12 months, Year-to-date calculation, and more. In order for this functionality to be possible, the table needs to be marked as date.  In order for this feature to be available, there needs to be one contiguous date field in the table, meaning there should be no gaps in the date field. If there is one default missing date field which is many years from the start of the contiguous date values, the field needs to be removed or all of the dates after that date need to be added ensuring there are no gaps in the dates. You can choose to join on the date value or another key.

Modify Timestamps to Split Date and Time

When there is a field where the date and time are both needed, the values should be separated so that there is both a date field and a time field.   Having date time in two fields assists in the dictionary encoding as the date and time fields can be separately sorted into columns where the values are the same, decreasing the number of dictionary entries.  To further improve compression, only include the seconds if absolutely necessary, as add decreasing the cardinality will increase compression.

Modeling Role Playing Dimensions and Active Relationships

All relationships between tables should be active relationships.  If you have any dashed lines which connect portions of your model, these are inactive relationships and should be removed, unless you are writing DAX Code to use them. DAX code must be written to use inactive relationships and users often find it confusing. As a best practice for situations where there are multiple dates in the fact table, such as Order Date, Ship Date and Due Date, is to have 3 different date tables. The other date tables can be created by using a calculated table to create a second date table and join the other value to that date table.  Step-by-step instructions for creating a calculated table are included in the Reference section1.  Having multiple date tables is a best practice design for tabular models as it is a common for there to be multiple dates in one table.

Partitioning Tabular Models

Partitioning Tabular models does not improve query performance. In fact multiple partitions can actually make performance worse.  Partitioning should be implemented to assist in the loading of the data.  If you do not need to create a partition to make a load timeframe, then there is no reason to do it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Analysis Services Tabular Best Practices – Part 1

Recently I have been working on Analysis Services Tabular as a data source to Power BI and wanted to share some of the best practices for developing a model. Following these guidelines will improve the performance and scale of your project. There is a lot of material to cover, so I have broken up the information into this series.

Analysis Services Internal Dictionary Encoding

Understanding the internal components drives a number of design elements. The rules for Tabular are different because the underlying technology is different.  Analysis Service is a columnar based database, which means that it is most efficient on summarizing columns of data.  This methodology is incorporated in the Vertipaq Engine which is compresses the data in memory. The data is stored within a dictionary and is modified to store information optimally. Internally all of the data stored is translated into an integer value.  This means that two date fields when joined in a model will be as efficient as two integer values, as all values are internally represented as numbers.  Dictionary entries are created for each column.  The most efficient compression would be if the column contains one value as there would only need to have one entry.  If there were 4 different numbers in the column and the values were sorted on that column, there would be 4 dictionary entries.  If instead the numbers were ordered sequentially, such as 1,2,3,4,1,2,3,4, compression would not be possible and there would be an entry for each row.  When possible, data should be sorted so that like numbers are together.   Whenever possible, do not load unique values, as this data cannot be compressed.

Create tables with Underlying Queries with Views

Microsoft recommends that when getting data from a relational database, the data retrieved should be encapsulated with views.  This creates a record in the source data of the information used in by Power BI in the source database which can help ensure that the data structures are not modified in a way that will adversely impact the model.  The views can be materialized which can prevent the structure from changing without changing the view.  This provides visibility to the database administrator of the structure needed for Analysis Services.  Using views also can insulate Analysis Services from being dependent on the existing data model as the view does not have to match existing data types.  All views need to include an explicit list of columns for a deliberate inclusion of requested columns and should be sorted.

Extra columns should be eliminated from the view and the model as they will increase the size for no reason. The views should only contain columns which are actually used in either the DAX, reporting fields or to join tables together.  Do not bring in every field in the source table as it is not a best practice. It is more efficient to only bring in the columns needed rather than hide columns not used.  If the columns are not needed in DAX or for joining tables, do not include them in the model.  The views need to be written to run as efficient as possible on the source system, so be sure to tune the views as well.

Data Type Selection

The data type selected will impact the physical storage used, not the compression of the models in memory.  It is important whenever possible to reduce the cardinality of the data in order to be able to sort the data effectively.  When storing decimal numbers, unless you need many significant digits, store the data as Currency as it will take less space in physical storage than decimal.

Following these rules will help your model be better interpreted for dictionary encoding and will improve the compression of your data meaning it will load faster.  In the next part of this series I will discuss other things you can do to improve usability and structure of the data model. Let me know if you find this helpful by adding a comment or subscribing to this blog.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Using Power BI without an Power BI Account or Active Directory Entries

Believe it or not there is a way to use Power BI without signing up or having an account.  With an embedded application, you can use one Power BI ID and have all users use this account to log in. Now before you judge this is concept as a horrible security idea, keep on reading as you may change your mind. In this embedded application all users to login using a unique id and password and restrict the data seen in Power BI data based upon that id. How do you accomplish this seemingly incongruent task and why would you ever do such a thing? Well it turns out there is a logical reason for wanting to implement a Power BI application this way.

Creating a way to Securely Access Power BI Data for Customers

There are many companies which would like to provide Power BI reports which would allow customers to interactively work with their data, but they don’t want to create Power BI accounts for customers as that can be a lot of work from an administrative standpoint.  For the same reason, these customers are not added to the corporate network which means they are not added Active Directory.  For example, if Desert Isle SQL contracts with Acme Corporation to create a custom conference display, Acme might want to show me a report showing when the components were purchased, when they were modified and when the order is in process and when the order is completed.  How do I show a Power BI report containing information? From an application design perspective data from all of the customers should be stored in the same place and Desert Isle SQL should only see their orders when logging in to Acme’s site.  Here is the workflow that I want to implement.

 

Passing Login information to Power BI

When creating an embedded application, connecting requires a connection string.  It is possible to pass additional information to the connection string buy modifying the gateway to use effective identity and then pass the role information you want to use.  There are two configuration steps you need to complete to make this work.  The Power BI gateway needs to be configured to use CustomData through the Map User name screen. Also SSAS needs to be configured to use Roles as the role will restrict the data that users can access. The CustomData can contain a comma delimited list of values, which can include the data I need to have to access the role.  In the DAC for the role, the CUSTOMDATA field as performs as if it was a table. The DAX in the role would provide permissions based on the value of that table DimTerritory[TerritoryName] = IF(CUSTOMDATA() = “username”.“territory” .  This will restrict the data that a customer can see based on the territory they have assigned.  The Id can then be used to implement Row level security in Power BI with either the embedded data model or with Analysis Services Tabular. By using this method, you have the ability to restrict the data for each user and use one Power BI account all at the same time.

 

Costs for Implementation Multi-User Power BI Systems

Unfortunately, this solution means that you are going to be purchasing the embedded version of Power BI as this functionality is not covered with a Pro License. Embedded applications require that you purchase an embedded license or have a premium account.  The pricing for embedded has changed quite a bit from 2017 when it was introduced.  Pricing is all about capacity, unless you use a Premium account.

Power BI can be implemented in a number of different ways, and this implementation is one that you may see more of in the future.  There are a lot of different things that you can implement Power BI and it is hard to keep up with all of the changes. If you are interested in learning more about some of Advanced Power BI topics, join me in person in Boston for a full day of Advanced Power BI Training on Friday, September 21.  I look forward to meeting you here or anywhere else we might meet up.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Many to Many and Composite Modeling Power BI

The July 2018 release of Power BI includes a lot of neat new features, like wallpaper and of course getting rid of the big formatting frame on Power BI Visuals, but Composite Modeling really stands out. Have you ever created a direct connection to a tabular model and then wanted to add that one Excel spreadsheet for another sales group or needed to add some economic web data? I have. If I was lucky I was able to get it added to the tabular model.  Otherwise I had to give up on the idea. Adding more tables to the model which is using direct query is now possible with Composite Model.

Turning on Composite Modeling

Since this feature was just released, it is still in preview.  In order to be able to use it, go to the Power BI File tab then Options and Settings->Options->Preview features, as shown in the picture.  You will have to restart Power BI after this option is changed. This option not only turns on Composite Modeling, it also enables Many to Many data modeling.

Many to Many Data Modeling

In Power BI if you have tried to create a join relationship between two columns and one of them has duplicate values, you will get an error.  If you need to join two tables and they have duplicate key values, more than likely you are familiar with the work around, which is to create another intermediate table which has a column with unique values to create the relationship.  If you hide the table you can almost pretend that it really works. When you have Composite Models enabled in the Preview Features, you can create a direct relationship between tables which do not have unique keys! The Many to many relationship will be created for you in Power BI.

Caveats with the Composite Modeling Feature

Like me after playing around with both Many to Many and Composite Modeling you are probably all ready to deploy these features into production. That is a problem. As Composite modeling is a preview feature, it cannot be loaded to the Power BI Service, for now.  It will work as a desktop prototype, but you cannot distribute it, yet.

There is another big factor regarding Many to Many which you may have noticed when trying to get this feature to work. It is not possible to incorporate Many to Many with SQL Server Analysis Services Tabular Models, yet. It is my understanding that Microsoft plans to have this available for Analysis Services data sources, but for now, it is not possible to model them.

If you are planning on releasing either of these two features it is now possible to create and to use them in a desktop prototype.  I am looking forward to the day, which I hope will be later this year, when these features are available to be distributed to the Power BI Service.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Gaining Better DAX Skills

If you work with Excel Pivot Tables, Power BI, or Analysis Services Tabular more than likely you are working to improve your DAX skills. I know I am. While I do a lot of work with DAX, I am always looking to get better at writing it and better at explaining how to write it for clients or class attendees.  As I find things which I think others may find helpful, I write about them here to help spread the knowledge, especially to those people who have background in SQL.

DAX History

Microsoft has been working on the DAX language for about 10 years and released it in as part of Excel 2010 Power Pivot Add-in, back when Microsoft was incorporating Power BI functionality in Excel. Microsoft wrote DAX so that it would be like Excel so that people who are familiar with Excel would have an easier time writing it. While DAX is easier than MDX, it is enough different from Excel that it can get quite complex. One of the founders of the DAX language is Jeffery Wang, who is still actively working on enhancing the language further.  He works not only with developers but also people in the technical community who work extensively with DAX to get their input. I also found out recently that he took time out to do a technical review of a book on the DAX language by Phil Seamark b | t.

Leveraging SQL Knowledge to Learn DAX

If like me, your skills like not in Excel but in SQL, reading a book which builds upon SQL knowledge to understand DAX is really helpful.  Beginning DAX with Power BI provides examples of how to write something in SQL then shows how the same functionality would be written in DAX, sort of like a SQL-to-DAX language library.  Showing how to get the answer from SQL and DAX is a great technique to provide a better understanding of how to write DAX. I worked a client recently who validated all of the DAX in the tabular model by writing SQL to ensure they were getting the correct answers. I think the validation process would have been improved with this book which takes DAX out of the black magic category and explains the way it works.

Performance Tuning DAX

Once you start writing a lot of DAX, you will at some point write a query which performs badly.  What do you do to avoid poorly performing DAX? Phil provides some “old-school debugging” techniques to understand what is being generated and where it is going wrong. He also shows how you can use DAX Studio and SSMS to help figure out how and why your DAX is working the way it is. You will definitely want to reference Chapter 8 to learn more about this important topic.

Other Methods for Learning DAX and Improving Power BI skills

If you are looking for a hands-on environment to better understand how DAX works and how to improve your Power BI skills I have two opportunities for you in the US. I will be giving Advanced Power BI Classes in Columbus, Ohio on July 27 and Boston, Massachusetts on September 20.  Of course we will be covering more advanced DAX development and tuning as well as applied visualization theory to improve the interactivity of your Power BI reports as well as hands on labs to implement the new drill-through features and enhancements in Power BI and other class topics.  I hope to meet you there or at some of the other conferences I will be attending this year.  Check out my Engagement page and hopefully find a place where we can meet up as I would love to meet you.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Data Science with Python

KD Nuggets Data Science/ Machine Learning PollFor those of you who might have missed it, the website KDnuggets released their latest internet survey on data science tools, and Python came out ahead, again. Python has continued to gain as a tool that people are using for Data Science.  The article accompanying the graphic is very interesting as it brings up two data related points. The first is the survey only had “over 2300 votes” and “…one vendor – RapidMiner – had a very active campaign to vote in KDnuggets poll”.  This points the fallacy in completely relying on data with an insufficiently sized data set, as it is possible to skew the results, which is true both for surveys and data science projects.  If you look at the remaining results one thing also strikes me as interesting. Anaconda and sci-kit learn are Python libraries.  Tensorflow could be used for either R or Python.  This does tend to increase the argument for more use of R or Python over RapidMiner.  The survey also made me want to check out RapidMiner.

Thoughts around Rapid Miner for Machine Learning

While I have not had enough time to fully analyze Rapid Miner, I thought I would give my initial analysis here and do a more detailed review of it in another post.  Rapid Miner scored well in the Kaggle Survey, but also it ranked highly on the 2018 Garner Magic Quadrant for Data Science Platforms.  Rapid Miner is trying to be a tool not only for data scientists, but also for business analysts as well.  The UI is pretty intuitive, which is good because the help is not what it should be. I also was less than impressed at its data visualization capabilities, as R and Python both provide much better visuals. Of course, I used the free version of the software, which works but it is limiting.  It looks like a lot of the new stuff is going to be only available on the paid version, which decreases my desire to really learn this tool.

Machine Learning Tools

Recently I have done a number of talks on Python in SQL Server, literally all around the world, including Brisbane, Australia tomorrow and Saturday, June 2 as well as in Christchurch New Zealand. As R was written in New Zealand, I thought that it would be the last place where people would be looking to use Python with Data Science, but several of the attendees of my precon on Machine Learning for SQL Server told me that where they worked, Python was being used to solve data science problems. Now of course this is anecdotal sample, as we are not talking about a statistically significant sample set, but that doesn’t keep it from being interesting.   The demand for Python training continues to increase as Microsoft has announced they are working on incorporating Machine Learning Service blog series with SQL Server Central.  The first two post have been released. Let me know what you think of them.

Upcoming Events

I am looking forward to talking about Machine Learning with SQL Server in Brisbane both at an intense day long session and at a one hour session on Implementing Python in SQL Server 2017 at SQL Saturday #713 – Brisbane, Australia. I look forward to seeing you there. For those who can’t make it, well, hopefully our paths will cross at a future event.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur