Articles

Fabric Features not found in Azure Synapse or Power BI – Data Wrangler

Fabric Features not found in Azure Synapse or Power BI – Data Wrangler

The second element in my series on new Fabric Features is Data Wrangler. Data Wrangler is an entirely new feature found inside of the Data Engineering and Machine Learning Experience of Fabric. It was created to help analyze data in a lakehouse using Spark and generated code. You may find that there’s a lot of data in the data lake that you need to evaluate to determine how you might incorporate the data into a data model. It’s important to examine the data to evaluate what the data contains. Is there anything missing? Incorrectly data typed? Bad Data? There is an easy method to discover what is missing with your data which uses some techniques commonly used by data scientists. Data Wrangler is used inside of notebooks in the Data Engineering or Machine Learning Environments, as the functionality does not exist within the Power BI experience.

Step-by-step Data Wrangling

The example I am going to walk through here uses the public holiday sample dataset, which you can load by clicking on the Use a sample button with Data Engineering, and selecting Public Holidays. Once you create a lakehouse with the sample data, you can follow along with the steps in this blog for doing your own data wrangling.

From within the Data Engineering Environment, select the notebook from the + New dropdown menu to create a new notebook. You will need to add some data to the lakehouse inside of the new notebook. To do this, click on the green Add button in the lakehouse section of the screen to add your Public Holiday Lakehouse. After the lakehouse is loaded, you will have two file folders, if you click on the one labelled Files, you will see two more folders. Click on the folder called sample_datasets, which makes it turn green and you will see a list of files in the sample_datasets pane which opens up. If you click on the ellipse menu next to public_holidays.parquet, you will see a popup window. In that window, select Load data and you will see two options, Spark and Pandas. The screen image shows what it looks like.

When Fabric was first released, the only option available for Data Wrangling was pandas, which is not Spark compliant. For more information regarding Pandas, check out my blog on Koalas where I cover why you don’t want to use pandas. I am very happy to see the Spark dataframes option as it is spark compliant, and can then take advantage of Spark’s superior memory and data management. However all of the generated code is pandas, and then converted to Spark. Data Wrangler will use Pandas no matter which option you select. We are now going to need to run this code which was generated for us so the data is loaded in a Spark DataFrame, by going to the Run menu and selecting Run all. This will start up the Spark cluster and run the code. You can look in the bottom left corner to see when it is complete.

Once you see the data you can start wrangling it. Go the Data tab and select Transform DataFrame in DataWrangler, then select the dataframe which was created for you, df.

Using Data Wrangler


In the Data Wrangler screen which is pictured here, you will see several interesting items. The first is at the top of the screen, where it lets you know that this code is pandas, but the code will be converted to pySpark when you leave Data Wrangler, but everything here will be pandas. This is important as if you end up with code you want to use, you can just use the generated code because it will be spark compliant. The top of the screen looks a lot like something you might see within Power Query if you have all the view items checked. Data Wrangler also shows missing and distinct as well as the top 3 values. On the left side of the screen there is a list of tasks which will generate code for you. I am going to select the first two of them, starting with Drop Duplicate Rows, which you can see in the screen shot image. This will immediately generate a line of code to drop the rows across all columns, but nothing will happen until the Apply button is clicked, which I will do so the code will be run. The applied steps are called cleaning steps and they appear on the bottom left side of the screen. Even if a step is applied it is not removed from the list, which seems odd. I am going to select the next option which is to delete missing values, and I am then prompted with a dropdown box to select which columns I want to drop if missing. I am going to pick all, then apply the change. At this point, since I completed the first two items, I don’t have anything missing as I deleted everything, so the next two steps will do nothing , as there is nothing missing. Clicking the option preview code for all steps allows me to see the generated code.
 If I want to run this code every time I get this dataset, I want to add it to a notebook, so I will select add code to notebook. Now in my notebook, I want to add pySpark code, which is different than the pandas code generated. Optionally I can add Pandas code, but I see no reason to do that. A popup window will appear showing me the code, at which point I can just click on the add button. The generated code is added to my notebook, saving me the tedium of having to write common code like this.

Data Wrangler is another automated tool which will write code for me similar to what I might expect if I was using Co-pilot. Data Wrangler contains decision making options and is generated based on my thoughts around which would be the best option in a given scenario. I can always return to Data Wrangler to add other steps or make different selections from the operations I chose. This tool can be really useful and I plan on using it to clean up my data in Notebooks.

 

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Exploring Azure Synapse

In the past few months, I have been examining Azure Synapse and what it can do.  When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW.  For this reason, many people think that Synapse is just an improved version of a cloud data warehouse.  Microsoft did improve SQL DW when it moved it to Synapse.  The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change.  Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource.  You will still need to partition your DW as large datasets require partitioning to perform well.  Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.

Non-Data Warehouse Elements of Azure Synapse

To me the more interesting parts about Azure Synapse have nothing to do with data warehouses.  Azure Synapse also contains the ability to query files stored in Azure Data Lake Gen 2 as if they were SQL files. This is a great way to analyze large data without first cleaning it up and putting it into a relational environment. Within Synapse you can formulate a query using syntax for selecting parts of files, providing the ability to look at many files as if they were one. You can also create processes which bring data into your synapse environment using Orchestration, a process that people who are familiar with Azure Data Factory will find very familiar. Synapse also contains the ability to analyze data in Cosmos DB without doing ETL or moving the data at all using a scalable architecture which will not impact the transactions being processed simultaneously on the same Cosmos DB.

Azure Synapse and Spark

Check out the promotional video here

By far the most interesting component of Azure Synapse is the Spark connection. Microsoft has added the ability to create Spark Pools into Azure Synapse.  To be honest I was somewhat surprised that this functionality is included here first and not in Azure Machine Learning, where to use Spark you need to access clusters created them in Databricks.  Spark provides the ability to dynamically scale resources when running processes.  This is very handy when writing machine learning code which can really use the performance improvements Spark brings.  Because this is Microsoft’s Spark, you can also write your code to access it in .Net if you like, in addition to the more common Spark Languages, Scala, R or Python.  You can also incorporate the AutoML API created for Azure Machine learning in R and Python so that you can use the power of Azure to select your algorithm and hyperparameters instead of spending time doing it yourself.

Getting up to Speed with Synapse

There is a lot to learn when it comes to Synapse as it combines a lot of different components into one environment. As more and more data is being migrated to the cloud, it is uniquely designed to handle both big data components containing raw data, managed data lakes as well as more traditional data warehouse needs.  It can also be the location where all of the data is processed, secured, cleaned and analyzed using Machine Learning. There is a lot to cover and since it is new, there is not a lot of places yet where you can learn more about it.  If you are interested in a deep dive on Azure Synapse and how to use it in a Modern Data Warehouse, sign up for my precon at PASS Summit 2020 where I will cover the topic in depth.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Using Koalas in Spark

koala getting hlep from a Firefighter

The Apache Spark open source organization maintains all of the documentation for Apache Spark, which is a set of APIs which are used in Databricks and other big data processing applications.  The documentation provides the detailed information about the libraries, but the instructions for loading libraries in Databricks are not exactly the same as are used in Databricks, so if you follow the Spark installation instructions, you will get nowhere. If you follow the steps listed you will be up and running in no time.

Installing Options – Cluster or Notebook ?

If you are not using a ML workspace you can add in using dbutils like this.
dbutils.library.installPyPI("koalas")
dbutils.library.restartPython()

Unfortunately if you are using an ML workspace, this will not work and you will get the error message org.apache.spark.SparkException: Library utilities are not available on Databricks Runtime for Machine Learning. The Koalas github documentation  says “In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning”.  What this means is if you want to use it now

Most of the time I want to install on the whole cluster as I segment libraries by cluster.  This way if I want those libraries I just connect to the cluster that has them. Now the easiest way to install a library is to open up a running Databricks cluster (start it if it is not running) then go to the Libraries tab at the top of the screen. My cluster is called Yucca, and you can see that it is running because the circle next to the name is green.

After you are on the Libraries table you will see two buttons.  Click on the one labeled Install New.  A window will appear.  Select the library source of PYPI and in the Package text box enter the word koalas.  Then click on the install button.

Install Databricks LibraryThe installation may take a few minutes.  When it is complete you will see a green  status circle and the word installed.

After this you are ready to use the new library, once you import it as shown here.

 

Why do I want to install Koalas in Databricks?

If you have written Python code for Machine Learning, chances are you are using Pandas. Pandas dataframes are practically the standard for manipulating the data in Python.  They are not however part of the Spark API.  While you can move your Python code over to Databricks without making any changes to it, that is not advisable.  Databricks is able to scale pandas, so adding more resources to your code may not improve the performance.  When writing Python code for Databricks you need to use the Spark APIs in order to ensure that your code can scale and will perform optimally.   Prior to April of 2019, that meant that you had to use Spark dataframes and not pandas dataframes, which could involve a bit of rework when porting code as much code was written in pandas.  In April of last year Koalas was added to Spark, meaning that changing code to use a pandas dataframe to a koalas dataframe means that you only have to change one word. Koalas contains all of the functionalities of a pandas dataframe, so if you are familiar with one you can use the other.

More About Koalas

koala getting hlep from a Firefighter

Photo Credit:New York Fox 5

It is impossible for me to load the library without thinking about the Australian Bush Fires which are burning the homes of people and Koalas.  If your finances allow it, please consider donating to the firefighters as I am sure they can use help to save the homes of people and animals in Australia.

 

Regards,

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Introduction to Databricks

As I have been doing some work on Databricks, I thought that it would make sense that I start writing about it. Databricks is a scalable environment used to run R, Python and Scala code in the cloud. It currently can be run in either AWS or Microsoft’s Azure Cloud. For those of you who are budget-minded when it comes to learning new tools, there is also a free tier, which is available here Community.cloud.databricks.com . It has somewhat limited compute capacity, but if you are just starting out you might find it helpful.

Backstory

Databricks is an implementation of Apache Spark, part of the Hadoop collective and was a replacement for Map Reduce. Many of the people who worked on that open source project were students at Berkeley, where Apache Spark was created and added to Apache in 2013. Like many development projects, after it was completed, they had some ideas on how to improve the code. This time they decided to not make it open source but make it a commercial product so they could make some money for their development efforts. In April of 2017 Databricks was released on AWS and in March 2018 it was released in Azure.

Creating an Azure Databricks Service

Creating a Databricks Service is very straight-forward. There are only a few things that you need to complete when creating a new Databricks instance. The location becomes very important if you are looking at higher level performing instances which may not be available in all locations. Additionally, security considerations are also important for you if there is plenty of sensitive information being stored and accessed. You can either take up the exams yourself to learn about them (check out SC 900 dumps dumps here) or if that seems like a stretch, then hire someone with the know-how. Well, all of this only comes into the picture if you are working on a company project and not just for the sake of learning. If you are just getting started don’t worry about high level hosting services as you most likely will not need them and most of the compute options are available in most data centers. As always in Azure you want to make sure that you are hosting your Databricks service in the same location as your data so you will not need to pay to transfer data between data centers.

The Pricing Tier contains three options: Standard, Premium and Trial(Premium 14 Days). The trial is pretty self-explanatory and is a great way to get started using Databricks. They are of course a few differences between Standard and Premium. Premium has extra features needed for teams including Role-Based rights for the components of Databricks. And if you want ODBC authentication and Audit logs you will need to use Premium. For more information on the cost of Databricks pricing tiers, check out Microsoft’s pricing link for more information.

Once you have an instance created, you can start using Databricks. The application is contained within a managed instance, so once you launch Databricks you will be in their environment, which looks the same as the free edition.

 

Clusters, Notebooks and Data

These three components are the most important parts of Databricks as they include the compute power, where you write code and the information you work with respectively. These components are all separated in Databricks to improve scaling and provide a familiar environment to create and run code.

Cluster

The most important Databrick element, as it contains the compute. This is also the part of Databricks which will greatly increase your bill as the more resources you use to run code the more money you need to run it. One nice thing is clusters by default will terminate in 120 minutes of inactivity. I generally drop this to 20 minutes. If I am using it naturally it will not terminate, but if I am not using it, I want the charges to stop. You can also automatically spin up clusters to run jobs, so that they will only be in use when the job needs them. More about that in another post.

Notebooks

Databricks Notebook Import

Databricks Notebook Import

There are 3 supported languages in Databricks, R, Scala and Python, and within Databricks all of these languages are written in Notebooks. You don’t have to write your code in the environment. You can write it locally and then import it. However, if you want to export your Notebook then run it locally, it gets trickier. Natively all of the Notebooks in Databricks are saved as .dbc files. You can’t read them from anywhere else. Fortunately there is a workaround to format the Notebook files as .ipynb files which can be read by any notebook. Dave Wentzel from Microsoft has an elegant solution to convert .dbc to .ipynb which he includes in his blog here.

Data

You have a lot of options with data. You can import a dataset into your environment to play with or you can connect to just about anything you can think of. When you start doing data connections is when you stop using the community edition as you will want to use the Azure version to this to connect to various data resources like Azure SQL and blob storage. More on how to that later in an upcoming post.

If you are interested in hearing more about Databricks and are in Chicago, I am teaching an all day class as part of SQL Saturday Chicago and would love to have you attend. More information on that class is here.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Fallibility of Machine Learning: Why Lenovo will not sell me a laptop

Machine Learning is being adopted by more and more companies to assist in the sales process. Like all technology, Machine Learning is not correct all of the time. In fact, models with 75% accuracy are commonly accepted as good models and implemented for production. This means 25% of the time the model is wrong, meaning that the algorithm will incorrectly flag 25% of the business. What happens to transactions where an algorithm has incorrectly determined the transaction is not viable? Legitimate business is turned away. This happened to me when I tried to do business with Lenovo. If you are someone who is genuinely interested in how technology has adapted, then you may want to check out this quick guide on some tech tips that might be able to help you with any issues that crop up, unfortunately for me, Lenovo was my issue and I needed their help.

Lenovo’s Ordering System: No Laptop For You!

Buying a laptop from Lenovo reminded me of an episode of Seinfeld when Elaine was trying to buy soup. For some unknown reason, when I placed an order on their website and gave them my money, Lenovo gave me a Cancellation Notice, the email equivalent of “No Soup for you!” After placing an order, about 15 minutes later, I received a cancellation notice. I called customer service. They looked at the order and advised me the system incorrectly cancelled the order. I was told to place the order again as they had resolved the problem. I created a new order, and just like the last time, I received the No Laptop for You cancellation email. I called back. This time I was told that the system thinks I am a fraud. Now I have no laptop and I have been insulted. I asked if the system could be overridden because I was not a fraud. Customer service verified my method of payment and told me that were going to assign a case number to it as that would ensure the transaction would go through, and they would get credit for the order as they were going to place it. Apparently, customer service has some kind of financial incentive for placing sales. That did not work either as, once I again I received the No Laptop for You cancellation email. Not only did I not get a laptop, the person I spoke to also lost out as he was not going to get a credit for the sale. I called back again and this time they told me that they had no idea what was wrong with the system but it had flagged me as a fraud and a case number did not get assigned last time as it was supposed to, which was the reason that that order was canceled, again. They placed the order again and once again I received the No Laptop for You cancellation email. Every attempt at buying a laptop had failed. I had struck out with customer service as had received advice 3 times and every time I got a No Laptop for You cancellation email. At this point I tried getting the situation resolved via social media. Publicly Lenovo said they wanted to help, and sent me one direct message letting me know they would fix the system, and that was the last I ever heard from them. By not sending me another email, the message they sent me instead was No Laptop For You!

Relying on Machine Learning can Cost businesses Sales and leave them wondering about Toilet Seats

I tried to give Lenovo nearly $2000 and they refused to take my money. How many other transactions are they ignoring? Over 500? That does not seem like a terribly high number. Jokes apart, many companies tend to limit the sales of a particular new product. This can be a part of market research where companies tend to use tools like Conjoint Analysis (look for – What is Conjoint Analysis?) or similar tools where they can find out the actual requirements of the consumer as well as a viable price point for it. However, this might not be the case as I don’t remember the customer service person mentioning it. Also, 500 transactions for $2000 a piece is huge. That would mean Lenovo’s sales are needlessly down 1 million dollars because they implemented a system that turns away deals and actively prevents sales despite the best intentions of their employees to close a sale. As a result, Lenovo’s customers may have a negative impression of the company, which could lead to a decrease in laptop and computer sales. In this case, the company’s original and powerful brand image can be revived by implementing a marketing campaign, removing negative web content (with the help of an online reputation management firm), and executing effective branding strategies. Moreover, Lenevo may need to consider the primary cause of this problem — the over-reliance on machine learning.

Blindly relying on the accuracy of a computer program to determine with 100% accuracy whether or not a transaction is viable or not is not just a bad idea but is a bad business decision that can cost millions of dollars in sales. The company can instead train its employees to generate more sales by enrolling them in a sales training program. In the case of an already existing training program, it would make sense to find ways to improve retail sales training program to prepare employees for increased sales. With effective training and the use of machine learning, it is possible that any company like Lenevo is likely in line to get more revenue. Along with that, it seems crucial to showcase such products on the website that are available in the inventory. A story could solidify this statement. A friend of mine who was remodeling a bathroom bought a toilet seat on Amazon. When he logged in again, he continued to see a myriad selection of toilet seat product recommendations for the next six months at the exclusion of other products he might actually want to buy. Apparently, the machine learning algorithm determined that because he bought one toilet seat, he was a Toilet Seat Connoisseur and wanted to decorate his house with a variety of rare and unusual of toilet seats for the next six months.

Combining Machine Learning with People

I create machine learning solutions for clients and provide training sessions to help people learn how to write machine learning models. I understand the process and the steps which are used to create a machine learning experiment. First you gather and clean the data, then train it using a set of algorithms against a set of data, and then you create a model. The problem “Should I cancel this sale” is has two possible answers, yes or no, meaning it is a binary classification for anomaly detection. Never have I created a model which was 100% accurate as that is not possible. I tell clients that is not possible and help them implement solutions to handle conditions when the model is wrong. Machine Learning needs to work in concert with people who have the ability to resolve problems which are flagged by the system, as there is a place for people in all automated systems.

Most normal people would have probably given up after their order was canceled twice, but I persisted as I was amazed that such a big company like Lenovo could continue to be so wrong, and I wanted to prove I was not a fraud. Continued failure to successfully place an order convinced me that I did not want to do business with Lenovo. If a company does not want to resolve an issue where they will receive money, how likely are they to want to resolve a situation which costs them money, such as a warranty claim? Based on my experience, I have no confidence that one could get Lenovo customer service to solve a problem as they do not have the ability, even when they are financially incentivized to do so. Machine Learning and AI may decrease the number of people needed, but when things go wrong people are needed to fix them. When a machine learning model is wrong, and this will happen, the policy should be to permit your customer service people to create successful sales. If instead, your customer service insults and ignores customers when machine learning models go wrong, sales will go down as customers will be going to competitors.

I researched laptops as I was interested in having a lightweight powerful laptop which I could haul through various airports to use at clients and conferences, like Live 360 SQL Server where I will be speaking on December 3. Fortunately there are other companies who have determined they do not need to create some kind of machine learning score to sell a laptop, they just sell laptops to people who go to their website and give them money with no problems. Using the same address and credit card information which Lenovo flagged as fraudulent, I bought my new HP laptop, which I will be happy to demonstrate next time you see me at a conference or class.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Science with Python

KD Nuggets Data Science/ Machine Learning PollFor those of you who might have missed it, the website KDnuggets released their latest internet survey on data science tools, and Python came out ahead, again. Python has continued to gain as a tool that people are using for Data Science.  The article accompanying the graphic is very interesting as it brings up two data related points. The first is the survey only had “over 2300 votes” and “…one vendor – RapidMiner – had a very active campaign to vote in KDnuggets poll”.  This points the fallacy in completely relying on data with an insufficiently sized data set, as it is possible to skew the results, which is true both for surveys and data science projects.  If you look at the remaining results one thing also strikes me as interesting. Anaconda and sci-kit learn are Python libraries.  Tensorflow could be used for either R or Python.  This does tend to increase the argument for more use of R or Python over RapidMiner.  The survey also made me want to check out RapidMiner.

Thoughts around Rapid Miner for Machine Learning

While I have not had enough time to fully analyze Rapid Miner, I thought I would give my initial analysis here and do a more detailed review of it in another post.  Rapid Miner scored well in the Kaggle Survey, but also it ranked highly on the 2018 Garner Magic Quadrant for Data Science Platforms.  Rapid Miner is trying to be a tool not only for data scientists, but also for business analysts as well.  The UI is pretty intuitive, which is good because the help is not what it should be. I also was less than impressed at its data visualization capabilities, as R and Python both provide much better visuals. Of course, I used the free version of the software, which works but it is limiting.  It looks like a lot of the new stuff is going to be only available on the paid version, which decreases my desire to really learn this tool.

Machine Learning Tools

Recently I have done a number of talks on Python in SQL Server, literally all around the world, including Brisbane, Australia tomorrow and Saturday, June 2 as well as in Christchurch New Zealand. As R was written in New Zealand, I thought that it would be the last place where people would be looking to use Python with Data Science, but several of the attendees of my precon on Machine Learning for SQL Server told me that where they worked, Python was being used to solve data science problems. Now of course this is anecdotal sample, as we are not talking about a statistically significant sample set, but that doesn’t keep it from being interesting.   The demand for Python training continues to increase as Microsoft has announced they are working on incorporating Machine Learning Service blog series with SQL Server Central.  The first two post have been released. Let me know what you think of them.

Upcoming Events

I am looking forward to talking about Machine Learning with SQL Server in Brisbane both at an intense day long session and at a one hour session on Implementing Python in SQL Server 2017 at SQL Saturday #713 – Brisbane, Australia. I look forward to seeing you there. For those who can’t make it, well, hopefully our paths will cross at a future event.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Preparing for Exam 70-774 – Perform Cloud Data Science with Azure Machine Learning

There are a number of reasons why you might want to take a Microsoft cert exam. Maybe you want to focus your studies on a tangible thing, or you think it will help further your career, or you work for a Microsoft Partner and they required a certain number of people to pass the exam to maintain their current partner status.  I am not going to get into the long argument regarding whether or not a cert will help you in your career, or not, I can tell you why you might want to take the 70-774 exam. Machine Learning, or Data Science if you prefer, is an important analytic skill to have to analyze data.  I believe that it will only become more useful overtime. Azure Machine Learning is a good tool for learning the analysis process.  Once you have the concepts down, then should you need to use other tools to perform analysis it is just a matter of learning a new tool.  I talk to a number of people who are trying to learn new things, and the study them in their spare time.  It’s very easy to spend time vaguely studying something, but you may find that having a target set of items to study will focus your time, and as a bonus you get a neat badge and some measure of proof that you were spending time on the computer learning new things and not just watching cat videos.

Exam 70-774 Preparation Tips


While you could always buy the book for the exam (shameless plug as I was one of the authors), the book will not be enough and you will still need to write some code, and do some additional studying. This exam one of two needed for the MCSA in Data Science and you an take the exams in any order. The best place to start is by first looking at the 70-774 exam reference page from Microsoft.  There are four different sections in the exam, and I have created some links for each section which will help you prepare for the exam. In studying for exams in the past, the best way I have found to prepare is to look at everything on the outline and make sure that I know it.

Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning

Normalizing Data
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data

TanH
https://reference.wolfram.com/language/ref/Tanh.html

ZScore
http://stattrek.com/statistics/dictionary.aspx?definition=z-score
http://howto.commetrics.com/methodology/statistics/normalization/

Min Max
https://www.quora.com/What-is-the-meaning-of-min-max-normalization

PCA
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://stackoverflow.com/questions/9590114/importance-of-pca-or-svd-in-machine-learning

SVD
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Canonical-correlation analysis (CCA)
https://en.wikipedia.org/wiki/Canonical_correlation

Singular Value Decomposition (SVD)
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Develop Machine Learning Models

Team Data Science
https://docs.microsoft.com/fi-fi/azure/machine-learning/team-data-science-process/python-data-access

K-Means
https://www.datascience.com/blog/k-means-clustering

Confusion Matrix
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
https://en.wikipedia.org/wiki/Confusion_matrix
https://en.wikipedia.org/wiki/F1_score

Ordinal Regression
https://en.wikipedia.org/wiki/Ordinal_regression

Poisson regression
https://en.wikipedia.org/wiki/Poisson_regression

Mean Absolute Error and Root Mean Squared Error
http://www.eumetrain.org/data/4/451/english/msg/ver_cont_var/uos3/uos3_ko1.htm

Cross Validation
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f

Operationalize and Manage Azure Machine Learning Services

Connect to a published Machine Learning web service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/publish-a-machine-learning-web-service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/consume-web-service-with-web-app-template
https://docs.microsoft.com/en-us/azure/machine-learning/studio/manage-new-webservice

Use Other Services for Machine Learning

Microsoft Cognitive Toolkit
https://www.microsoft.com/en-us/cognitive-toolkit/

BrainScript
https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-basic-concepts

Streamline development by using existing resources
https://docs.microsoft.com/en-us/azure/machine-learning/studio/gallery-how-to-use-contribute-publish
Perform database analytics by using SQL Server R Services on Azure
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/provision-vm
https://docs.microsoft.com/en-us/machine-learning-server/install/r-server-vm-data-science
https://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
http://blog.revolutionanalytics.com/2017/07/xgboost-support-added-to-rattle.html
https://github.com/JohnLangford/vowpal_wabbit/wiki

I hope you have found this test preparation material helpful.  If you passed the exam, let me know by sending me a comment.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Azure Machine Learning Workbench

Microsoft released Azure Machine Learning Workbench at the Ignite conference on September 25, 2017 as a public preview.  This tool is a new tool which they are adding to their Azure ecosystem, which includes the machine learning tool they introduced three years ago, Azure Machine Learning Studio. Microsoft has said they plan on keeping both products. When asked about the two products, they said that the earlier tool, Azure Machine Learning Studio, is targeted to developers who wanted to add machine learning to their current applications, as it is an easy to use tool that doesn’t require a person to be a trained data scientist.  Azure Machine Learning Workbench is targeted to data scientists who want to bring in other libraries, like TensorFlow for Python, and delve deep into the data.

Microsoft Moves into Machine Learning Management

Microsoft is looking for Azure Machine Learning Workbench for more than a tool to use for Machine Learning analysis. It is part of a system to manage and monitor the deployment of machine learning solutions with Azure Machine Learning Model Management. The management aspects are part of the application installation.  To install the Azure Machine Learning Workbench, the application download is available only by creating an account in Microsoft’s Azure environment, where a Machine Learning Model Management resource will be created as part of the install. Within this resource, you will be directed to create a virtual environment in Azure where you will be deploying and managing Machine Learning models.

This migration into management of machine learning components is part of a pattern first seen on the on-premises version of data science functionality.  First Microsoft helped companies manage the deployment of R code with SQL Server 2016 which includes the ability to move R code into SQL Server.  Providing this capability decreased the time it took to implement a data science solution by providing a means for the code can be deployed easily without the need for the R code to be re-written or included in another application. SQL Server 2017 expanded on this idea by allowing Python code to be deployed into SQL Server as well.  With the cloud service Model Management, Microsoft is hoping to centralize the implementation so that all Machine Learning services created can be managed in one place.

Hybrid Cloud, Desktop, and Python

While you must have an Azure account to use the Machine Learning Workbench, the application is designed to run on a locally on either a Mac or Windows computer.  There is a developer edition of the tool so that one can learn the tool and not incur a bill, which is the case with the previous product, Azure Machine Learning.  The download of Machine Learning Workbench must be accessed within an Azure account and is installed to your local computer.  When running the application from your computer, the application will prompt to log into your Azure account to load Azure Machine Learning Workbench.

The application is designed to use and create Python code.  Azure Machine Learning Workbench does not contain any accommodation to incorporate machine learning components written in R, just Python.  If you have created machine learning components using R, they can be incorporated into the Azure Machine Learning Model Management if you create webservices which encapsulate the R code. The R code does not interface into Workbench, but can be made to be a part of the managed projectes in Azure. While it is possible to create a webservice for R with the earlier product with Azure Machine Learning, there is no direct way to include R with Azure Machine Learning Workbench.  There are a number of sample templates to get started using Python templates including the ubiquitous Iris dataset, Linear regression and several others.   Once the project is created, you can use your favorite IDE, it creates python code which can be read anywhere.

Staying within Machine Learning Workbench application allows you access to arguably one of the neatest parts of the Machine Learning Workbench, the data parser. This tool which was originally code-named project Pendleton and designed to be an intuitive way to modify the contents of data even better than the previous leader in parsing data, Power BI’s Power Query.

You can select the option “Derive column by example” or “Split Column by Example” and then start typing in a new column.  For example, if you want to separate a column which contains the date and the time, if you right click on that date column and select “Split Column by Example” then type the date in the new column provided, the application will immediately determine that you want two columns and crate them. The date column and a time column be created for you after typing in one date.  After the sample columns have been created, you can approve the change or reject it if does not work how you want to.

Like Power Query, each change made to the data is included in the window called Steps on the right side of the application window. When you are done modifying the data, right click on the Data Preparations source icon, which in my example is called UFO Clean, to and the UI changes made to the data are used to create Python code to perform the changes. The generated Python code can be used to the source data programmatically.

The next step in the process is to write the python code needed to evaluate the data and create a model which would in my case determine where and when you are most likely to see the next UFO based on the dataset I have included in my project.  Unlike it’s counterpart Azure Machine Learning, you will need to know how to write the necessary code needed to create a machine learning analysis in Python for Azure Machine Learning Workbench. One could write the Python code to create a machine learning analysis in any Python editor.  If you chose to use Azure Machine Learning, the Python library scikit-learn is installed as part of the application.  Other libraries which you may want to use, such as the common library matplot, you will need to load within Azure Machine Learning Workbench.

Web Service: How Azure Machine Learning Workbench Solutions are Deployed

To deploy a package, you will need to export the completed model serialized Python object, with the Python Module, Pickle. This will create a file with the suffix of pkl, which is the file that you will be deploying. Azure Machine Learning Workbench expects that you will be deploying via Docker containers or creating an Azure cluster.  You will need to register the Docker container in the Machine Learning Container for it to be deployed.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Incorporating Cognitive Services

There has been a lot of very advanced research on developing algorithms which can analyze facial expressions, voice authentication and language understanding. Microsoft has decided to make this research available by creating a series of products which allow people to incorporate advanced research into their applications.  The cognitive service that I investigated first was the Language Understanding Intelligent Service [LUIS].

Teaching the Computer to Understand Text with Cognitive Services

There is a very good example of how to make LUIS understand text here.  In the sample, you can click on a button containing text or enter text free form. What LUIS does with the text is shown on the grey box on the right, JSON script is returned displaysLUISScreenthe score LUIS gave to the intent “TurnOn”. LUIS does not turn on lights for you, but there is a really good example of some code where people are using LUIS to control their home automation.

Before you can implement a solution with LUIS you need to define the intents which are listed in the JSON script.  An intent is an action you have defined. Some example intents might be to Find a Hotel in Seattle or Tell me Amazon’s Stock Price or a lot of the other things people have Alexa do for them. The scope of what you would have LUIS do for you is a lot more focused, as the number of Intents allowed is limited, and you will have to write the code to perform the Intent.

Steps to Understand LUIS Text

As right now LUIS is in preview mode, and therefore free, this is a great time to start learning the new technology. To get started, you will need to create an account at www.luis.ai, and once that is complete, create a New App. When creating an app, on of the number of different supported languages must be selected. No key is required, as a free key will be generated later. An app requires Intents, LUIS evaluates the text to see if it indicates the Intent is likely. The text that is evaluated is compared to an Utterance, which you also need to create. For example, if you have an Intent for “SearchHotels” an Utterance would be Find me a Hotel. While this is a perfectly good Utterance, there is no reference to a location, which is something pretty important when looking for hotels.  Entities are the descriptive parts of the Utterance. If I said show me hotels in [$geography] and replaced what was in the brackets with a city, then I would have a better idea of what hotels to return.  If I add the pre-built entity geography, then LUIS will be able to describe a location, which of course can be added to my utterances if I put square brackets[] around the entity name and a dollar sign $ in front of the name. I can add words people may use to describe a location with Features. If I add the word “near”, I can add the synonym catty-corner so that LUIS will understand that that word means “near”.  Once I have a complete list of Intents, Utterances, and Features, I can train the application for it to be tested and used in a component.

Applying Cognitive Services in Real World ApplicationsHotelBot2

Once I have a customized App created for LUIS to understand text, I used that to create a BOT to explore how I could use the rules I implemented in the website. I used the Microsoft Bot Framework to create an application which calls the LUIS component I created.  To reference the code created in LUIS, the application contains a reference key which provides the ability to call LUIS  from within my application.  As I don’t write much about C# code here, I didn’t include the code here, but I would be happy to share it if you would like.  Just drop me a line and I will post it.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

 

 

Questions and Answers on Machine Learning with R

opensourcerlogoRecently, I did a webinar on Machine Learning and R. I received a number of questions during the presentation. Due to time constraints, I was unable to answer all of them, so I have provided the Question and Answers here.

Question: Can I Use R in SQL Server to plot non-linear regression curves? We use IC50 and others in Michaelis-Menten kinetics for bio-chemical work.

R running on SQL Server provides the functionality of standard CRAN R packages with the additional capability to run the SCALER functions provided by SQL Server’s implementation of R. Any other functionality performed in R can therefore also be performed on SQL Server. Like all R code, you will need to install the required R libraries in the appropriate subdirectory. The directory is something like C:\Program Files\Microsoft SQL Server\MSSQL13.<instancename>\R_SERVICES\library

Question: Can you execute/invoke an external R script from T-SQL?

Yes. With SQL Server 2016 it is possible to run an R script from within T-SQL by using the sp_execute_external_script command. This link should help you with more detailed information on how to execute R scripts within T-SQL

Question: Is it possible to run R processes in diffrent boxes other than SQL Server itself for scalability reasons?

You have the option of installing the R Server on another server. Just keep in mind that you do have to account for the additional overhead of moving all the data over the network, which needs to weigh in on your decision to move processing to a different server.

Question: Can we join data generated from an R script to SQL data directly or does it have to be inserted into a table first?

Data generated from an R script interacts with the data the same way other data in a stored procedure does. It is not necessary to write the data to a table for the data to be processed.

Question: When would you use R Server versus R on SQL Server?

R on SQL server is an implementation of R server. When running R on SQL Server, R Server is running.

Question: What is the maximum file size I can load on R server?

R Server uses SCALR which provides the ability to not only use memory but also use disk storage, providing nearly limitless capability for file size processing.

Question: Is knitr integrated or can it be integrated into R for SQL Server?

While you can install the knitr library on SQL Server to implement the functionality provided, there is no direct integration path for incorporating the wide functionality knitr provides within SQL Server. For more information about knitr, please see this link.

Question: Does MS R Open have all the functionalities of CRAN R? Does it support all packages?

MS R Open is fully CRAN R compliant. All code which runs in CRAN R can run in MS R Open and all packages are supported. Since MS R Open was rewritten to use the Intel Math Kernel Libraries, it provides technical advantages to CRAN R. MS R Open is partially Multi-threaded, instead of single threaded like CRAN R, and is up to 38% faster

Question: Do you have a link for the Microsoft Machine learning cheat sheet?

Sure. The link can be found here.

Question: Why Use Microsoft R Open over other Analytics Tools such as Open R, Python, Matlab?

Microsoft R Open is fully CRAN R compliant, and also multi-threaded and faster, providing a clear benefit over CRAN R. Python is a great tool for data analytics, but unlike R, it is not designed solely for statistically analysis but has a wider functionality scope. R is focused solely on providing statistical data analysis. Matlab is a great tool, but given its complicated user interface and high licensing cost, many users may gravitate towards R, as it is free and there is a lot of good support for learning R available online as well.

Question: Can R be used with older versions of SQL Server or only 2016?

While it is possible to use R with other versions of SQL Server using tools such as RODBC, the R Server and full integration of R in SQL Server is only available in SQL Server 2016.

Question: Do I need all R server and R client and Microsoft R Open to be installed to run R from SQL Server?

To run R on SQL Server, R Server needs to be installed. For more information on how to install R on SQL Server please see this link. To connect to the server and use the SCALR functionality, the client machine will need to have R Client installed.

Question: What are the benefits of Microsoft Standalone R Server?

R Standalone Server is available to connect to data stored in HDFS and Teradata in addition to SQL Server. To incorporate data from other data sources and provide the ability to process more data than one has memory, it might make a lot of sense to deploy a standalone R Server.

Question: Are there any memory limitations in R when handling big data?

R is designed to run in memory, meaning if you have more data than memory you may run out of memory. When running using the SCALR functions, memory and disk are used to provide nearly limitless ability to process data.

Question: Can one call R from Python?

While it may be possible to create Python code which calls R, I am unaware of the syntax. This functionality would not be supported in the context of SQL Server.

Question: Can you use R charts in SSRS?

There are two different ways to incorporate R charts in SQL Server. Since a stored procedure can provide and R visualization, it is possible to use SSRS to call a stored procedure which would create a graphic to be rendered in SSRS. Also since Microsoft announced the incorporation of Power BI with SSRS, and Power BI provides the ability to create R graphics, incorporating Power BI which renders R would be another way to use charges in SSRS

Question: Any classes like “R for the nervous DBA” that you can recommend? I have a statistician who is running R on my SQL server.

I do not know of any R classes specifically designed for SQL Server DBAs, but I am working on a post for additional configuration information which you might find helpful. Please check back in a few weeks for updates.

Question: What is a Pirate’s favorite language?

RRRRRR Mad Props to Phillip for sending me a pirate joke. I love pirate jokes. I feel compelled to tell a few when I am presenting R and I have been told all of mine are really bad.

If you have any additional questions about SQL Server and R or any pirate jokes, please comment on this post or send me a message via twitter @desertislesql.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur