Introduction to Cortana Analytics

Microsoft previewed Cortana Analytics in July 13, 2015, and since then, they have published a lot of information on their site about it. Based on what I’ve seen on the internet, there appears to be a lot of confusion as to what Cortana Analytics is. This is completely understandable when you consider the number of different products the name Cortana has represented for Microsoft. My favorite is the image with the picture of a blue girl, which is from the Xbox game Halo 3. A video game was character was the first place Microsoft used the name Cortana in 2007. At the Microsoft BUILD Developer Conference in April 2014, the name Cortana was used for the Microsoft version of the Apple’s Siri phone application. If you are interested in hearing about it, I’ve included a link to the Channel 9 video here where they talk about Cortana. Finally, a year later Microsoft comes out with a product called Cortana Analytics. No wonder people are confused.

Cortana Analytics is not a Product

Cortana Analytics: the bow tying different apps together

Cortana Analytics: the bow tying different applications together

To help bring clarity to what Cortana Analytics is and is not, I wanted to start out with what I think is the most confusing point. Cortana Analytics is not a product, but a name given to a bunch of other applications which are designed to work together. In essence, Microsoft tied a bow around a bunch of applications and called the bow, Cortana Analytics. Here’s an example scenario. Start by sending water meter data from the physical meters to the cloud, where you aggregate, analyze, store and end up with a Power BI application on your phone showing you a visualization of some aspect of the data. To make this happen from a technical perspective using Microsoft’s tools, one would need to probably create an Event Hub, run a Streaming Analytics Process, use Data Factory to call a Machine Learning experiment, migrate the data to an Azure Storage account of some kind and then create a Power BI report to be sent to your phone. All of that, is Cortana Analytics. It is not one product, but a big bow tying all of the applications they have designed to work together under one name. Power BI is part of it. On that note, I recently saw Microsoft do a demo with Power BI where they integrated the Cortana-phone like functionality of talking to Power BI and it displayed the information it was asked. I have no idea when this will be released, but it sure was a neat demonstration. In this demo, they mentioned they were adding Cortana funcationality to Power BI, which really didn’t help the confusion level with the name.

Cortana Analytics Web Presentation

I recently recorded a video presentation of Cortana Analytics where I described in greater detail the components which make up Cortana Analytics and how they work together. That video is available here. As I am working more with the components which make up Cortana Analtics, such as Machine Learning and Power BI, I will definitely be devoting more blog post to the topic, so please subscribe to my blog if you are interested in learning more about it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

What is the difference between Machine Learning and Data Mining?

An Example of Machine Learning: Google's Self-Driving Car

An Example of Machine Learning: Google’s Self-Driving Car

Often times when I give a talk about machine learning, I get a question about what is data mining and what is machine learning, which got me to thinking about the differences. Data mining has been implemented as a tool in databases for a while. SSIS even has a data mining task to run prediction queries on an SSAS data source. Machine Learning is commonly represented by Google’s self-driving car. After reading the article I linked about Google’s car or study the two disciplines, one can come to the understanding that they are not all that different. Both require the analysis of massive amounts of data to come to a conclusion. Google uses that information in the car to tell it to stop or go. In data mining, the software is used to identify patterns in data, which are used to classify the data into groups.

Data Mining is a subset of Machine Learning

There are four general categorizations of Machine Learning: Anomaly Detection, Clustering, Classification, and Regression. To determine the results, algorithms are run against data to find the patterns that the data contains. For data mining the algorithms tend to be more limited than machine learning. In essence all data mining is machine learning, but all machine learning is not data mining.

Goals of Machine Learning

There are some people who will argue that there is no difference between the two disciplines as the algorithms, such as Naïve Bayes or Decision trees are common to both as is the process to finding the answers. While I understand the argument, I tend to disagree. Machine learning is designed to give computers the ability to learn without specifically being programmed to do so, by extrapolating the large amounts of data which have been fed to it to come up with results which fit that pattern. The goal of machine learning is what differentiates it from data mining as it is designed to find meaning from the data based upon patterns identified in the process.

Deriving Meaning from the Data

As more and more data is gathered, the goal of turning data into information is being widely pursued. The tools to do this have greatly improved as well. Like Lotus 123, the tools that were initially used to create machine learning experiments bear little resemblance to the tools available today. As the science behind the study of data continues to improve, more and more people are taking advantage of the ability of new tools such as Azure Machine Learning to us data to answer all sorts of questions, from which customer is likely to leave aka Customer Churn or is it time to shut down a machine for maintenance. Whatever you chose to call it, it’s a fascinating topic, and one I plan on spending more time pursuing.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

DIY Machine Learning – Supervised Learning

When I first heard about supervised learning I had a picture in my head of a kindergarten class with a teacher trying to get the small humans to read. And perhaps that isn’t a bad analogy when talking about Machine Learning in general as it is based on the same principles as school, repetition and trial. After that the analogy falls apart though when you get to the specific criteria needed for Supervised Learning. There are two broad categories for types of machine learning which have the binary descriptions of supervised learning, which fall into the binary categories of Supervised and Unsupervised. This means you only have to know the one set of criteria for supervised learning, to determine which type you need.

Training Data

A problem solved with supervised learning will have a well-defined set of variables for its sample data and a known outcome choice. Unsupervised learning has an undefined set of variables as the task is to find the structure from data where it is not apparent nor is the type of outcome known. An example of Supervised learning would be determining if email was spam or not. You have a set of emails, which you can evaluate by examining a set of training data and you can determine using the elements of the email such as recipient, sender, IP, topic, number of recipient, field masking and other criteria to determine whether or not the email should be placed in the spam folder. Supervised learning is very dependent upon the training data to determine a result, as it uses training data to determine the results. Too much training and your experiment starts to memorize the answers, rather than developing a technique to derive solutions from them.

When Supervised Learning Should be employed in a Machine Learning Experiment

As the field of data science continues to proliferate, more people start are becoming interested in Machine Learning. Having the ability to learn with a free tool like Azure Machine Learning helps too. Like many tools while there are many things you can do, so knowing when you should do something is a big step in the right direction. While unsupervised learning provides a wide canvas for making a decision, creating a successful experiment can take more time as there are so many concepts to explore. If you have a good set of test data and a limited amount of time to come up with an answer, the better solution is to create a supervised learning experiment. The next step in the plan is to figure out what category the problem uses, a topic I plan to explore in depth in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Upcoming and Recent Events

24HOPPassSpeakingThe PASS organization is a professional organization which sponsors a number of different technical events in the technical community. Recently, I have been honored to be selected to speak at not one but two events hosted by PASS, a professional organization which provides a lot of great resources to improve knowledge of all things SQL Server and related technologies to the world. The PASS Business Intelligence Chapter provides training on all things related to Business Intelligence via the web. I was selected to talk at the last meeting in May. Thank you to all of the people who were able to attend my talk on Top 10 SSIS Tuning Tricks live. If you had to work, no problem all of the talks hosted by the PASS Business Intelligence Virtual Chapter Recordings are available on www.Youtube.com. The recording of my Top 10 SSIS Tuning Tricks session is available here.

24 Hours of PASS

Periodically PASS provides a 24 Hour Training session on SQL Related topics to provide training live to every time zone in the world. As this event is watched by people around the world, it is a real honor to be selected for this event. This time the speakers were selected from people who had not yet spoken at the PASS Summit Convention, as the theme was Growing Our Community. The theme is just another way the PASS organization is working to improve people’s skills. Not only do they provide the opportunity to learn all things data, but also provide professional development through growing the speaking skills by providing many avenues to practice these skills.

Data Analytics with Azure Machine Learning

My abstract on Improving Data Analytics with Azure Machine Learning was selected by the 24 Hours of PASS. As readers of my blog are aware, I have been working on Azure Machine Learning [ML] this year and look forward to discussing how to integrate Azure ML into current environments. Data analytics with ML are yet another way to derive meaning from data being collected and stored. I find the application of data analytic fascinating, and hope to show you why if you are able to attend. There are a number of wonderful talks scheduled at this event, so I encourage you to check out the schedule at attend as many as you can. To be sure I’ll be signing up for a number of sessions as well.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

What is a Modern Data Warehouse?

As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.

Data is a Valuable Asset

In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.

Total Destruction of the Existing DW is Not Required

Excavator working with earth and sand in sandpitWhile a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.

Incorporating Social Media Information

The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.

For More Information

I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Complex Data Analysis and Azure Machine Learning Presentation Wrap Up

Thank you for all of the people who signed up for my webinar on Data Analysis with Azure Machine Learning [ML]. I hope after watching it that you find reasons to agree that the most important thing you need to know to get started in Machine Learning is not Math, but having good knowledge of the data you want to analyze. There’s no reason not to investigate as Azure Machine Learning is free.  In order to take more time with the questions after the presentation than the webinar format allowed,  I am posting my answers here, where I am able to answer them in greater detail.

How would one choose a subset of data to “train” the model? For example, would I choose a random 1000 rows from my data set?

It is important to select a subset of data which is representative of the data which wish to evaluate. Sometime a random 1000 rows will do that, and other times you will need to use other criteria, like transactions throughout a given date range to be a better representative sample. It all comes down to knowing your data well enough to know that the data used for testing is similar to what you will be ultimately using for analysis.

Do you have to rerun or does it save results?

The process of creating an experiment requires that for each run you need to re-run the data as it does not save results.

Does Azure ML use the same logic as data mining?

In a word, no. If you look at the algorithms used for data mining you will see they overlap with some of the models available in Azure ML. Azure ML provides a richer set of models, plus a greater ability to either call models created by others or write custom models.

How much does Azure ML cost?

There is no cost for Azure ML. You can sign up and use it for free.  Click here for more information on Azure ML.

If I am using Data Factory, can I use Azure ML ?

Data Factory added the ability to call Azure ML in December, providing another place to incorporate Azure ML analytics. When an Azure experiment is complete, it is published as a web service so that the experiment can be called by any program which chooses to call it. Using the Azure ML experiments from directly within Data Factory decreases the need to write custom code, while allowing the logic to be incorporated into routine data collection processes.

http://azure.microsoft.com/blog/2014/12/16/azure-data-factory-updates-integration-with-azure-machine-learning-2/

If you have more questions about Azure ML or would like to see me present on the topic live and live in Southern California, I hope you can attend SQL Saturday #389 – Huntington Beach where I will be presenting on Azure ML and Top ten SSIS tips. I hope to see you there.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Math and Machine Learning

MLModelsI had an interesting conversation with someone at SQL Saturday Phoenix, an event that I am happy I was able to attend, regarding knowing math and getting started in Machine Learning. As someone who had majored in Math in college, he was sure that you had to know a lot of math to do Machine Learning. While I know that having really good math skills can always be helpful when creating statistical models based on probability, a big part of Machine Learning, I do not believe that you need to know a lot of math to do Azure Machine Learning [ML].

Azure Machine Learning and Throwing Spaghetti Against the Wall

For those of you who cook, you may have heard of an old school way of testing to see if the spaghetti is done. You throw the spaghetti against the wall and if it sticks, the pasta is done. If it falls right off, keep the spaghetti in the pot for a while longer. Testing machine learning models is similar, but instead of throwing the computer against the wall, you keep on testing using the large number of models available in Azure ML. Once you have determined the classification of your data, there are a number of different models for the classification which you can try without knowing all of the statistical formulas behind each model. I have listed all of the models from Azure ML here so that you can take a look at the large number of models available. By taking a representative sample of your data, and testing all of the related models, determining which one will provide a result is not terribly difficult. The reason it is not very hard is you do not have to understand the underlying math needed to run the model. Instead you need to learn how to read a ROC curve, which I included in my last blog post. While you can pick the appropriate model by having a deep understanding of the formula behind each model, you can achieve similar results by running all of the models and selecting the model based on the data.

Advanced Statistical Analysis and Azure ML

While Azure ML contains a lot of good tools to get started if you do not have a data scientist background, which recruiters lament not enough people do, why would you use Azure ML if you have coded a bunch of R Modules already to analyze your data? Because you can use Azure ML to call those modules as well and provides a framework to raise visibility and share those modules with people within your organization or the world, if you prefer.

How to Pick the Right Model

I am going to demonstrate how to pick the right model in an upcoming webinar, which is probably easier to explain in that fashion rather than in a blog post. If you want to see how to determine which model to use and not know a lot of Math, I hope you take the time to attend. Azure ML offers the ability to integrate analysis into your data environment without having to be a data scientist, while providing advanced features to accommodate those really good at math, which I will be talking about in an upcoming Preconvention event for SQL Saturday in Huntington Beach. If you happen to be in Southern California on April 10th I hope you will be able to attend that event.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Getting Started with Machine Learning – Result Analysis

Recently I’ve started working with Azure Machine Learning and looking at what I consider the most challenging part, picking the right analysis. For those people who haven’t ventured into Azure Machine Learning, it looks a lot like a data flow in SSIS. After that you need to train or more to the point evaluate which model works best. The answer to that question takes a while. What kind of data do you have? Are you looking to find errors? Determine whether data classified in a certain way can predict a result? Perform a regression analysis of data over time? Group data together to identify trends?

Is your Model better than a Monkey throwing Darts?

While you can analyze your variables and rank them to determine the chance that the variables indicate a result, there is another method that is also used to determine an outcome, the coin toss. This lowly method of analysis is right half the time. If you have more than two outcomes, or to speak the language of Machine Learning, the outcome is not binary, there is another method used to determine the accuracy of predictions, monkeys. I have read about the various skills of monkeys in both literature and financial analysis. Think about it for a minute and you may remember reading or hearing about monkeys typing on a keyboard who have been able to write Shakespeare, or a blog post.  This is known as the Infinite monkey theorem. Another thing monkeys have been known to do is throw darts. Various financial publications have been measuring the success of mutual funds to monkeys throwing darts at stocks since the last century. The goal is of course to create a model that has the better success as a monkey or a quarter. The question is how?

Probability of Picking the Right Model

ROC [Receiver Operating Characteristic] Curves are used to ensure the machine learning model generated is better than a monkey throwing darts. Your goal is a perfect game of golf. Chances are your ROC curve will be somewhere between the two. In looking at the ROC curve generated here, you can see 3 lines, a light grey, a red and a blue one.

RocCurve

The diagonal line represents a coin toss. If you were able to get one of your scored datasets to be a 1, meaning that you got a true positive rate every time, you would have played a perfect game of golf. Chances are you will have two lines like I do here and one, in this example the blue line, has a higher number of true positive rates than does the red line, so the results generated by that model are more accurate.

More ML More of the Time

I find myself spending more time with Azure ML, meaning that I will be devoting a lot more future blog posts on this topic. I am also speaking on Azure ML both as part of a Pre-Convention Event on the Modern Data Warehouse and at SQL Saturday in Phoenix. If you happen to be in Phoenix, I would love to meet you. SQL Saturdays are great learning events and I am happy that I was selected to participate in this one.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur