Preparing for Exam 70-774 – Perform Cloud Data Science with Azure Machine Learning

There are a number of reasons why you might want to take a Microsoft cert exam. Maybe you want to focus your studies on a tangible thing, or you think it will help further your career, or you work for a Microsoft Partner and they required a certain number of people to pass the exam to maintain their current partner status.  I am not going to get into the long argument regarding whether or not a cert will help you in your career, or not, I can tell you why you might want to take the 70-774 exam. Machine Learning, or Data Science if you prefer, is an important analytic skill to have to analyze data.  I believe that it will only become more useful overtime. Azure Machine Learning is a good tool for learning the analysis process.  Once you have the concepts down, then should you need to use other tools to perform analysis it is just a matter of learning a new tool.  I talk to a number of people who are trying to learn new things, and the study them in their spare time.  It’s very easy to spend time vaguely studying something, but you may find that having a target set of items to study will focus your time, and as a bonus you get a neat badge and some measure of proof that you were spending time on the computer learning new things and not just watching cat videos.

Exam 70-774 Preparation Tips


While you could always buy the book for the exam (shameless plug as I was one of the authors), the book will not be enough and you will still need to write some code, and do some additional studying. This exam one of two needed for the MCSA in Data Science and you an take the exams in any order. The best place to start is by first looking at the 70-774 exam reference page from Microsoft.  There are four different sections in the exam, and I have created some links for each section which will help you prepare for the exam. In studying for exams in the past, the best way I have found to prepare is to look at everything on the outline and make sure that I know it.

Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning

Normalizing Data
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data

TanH
https://reference.wolfram.com/language/ref/Tanh.html

ZScore
http://stattrek.com/statistics/dictionary.aspx?definition=z-score
http://howto.commetrics.com/methodology/statistics/normalization/

Min Max
https://www.quora.com/What-is-the-meaning-of-min-max-normalization

PCA
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://stackoverflow.com/questions/9590114/importance-of-pca-or-svd-in-machine-learning

SVD
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Canonical-correlation analysis (CCA)
https://en.wikipedia.org/wiki/Canonical_correlation

Singular Value Decomposition (SVD)
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Develop Machine Learning Models

Team Data Science
https://docs.microsoft.com/fi-fi/azure/machine-learning/team-data-science-process/python-data-access

K-Means
https://www.datascience.com/blog/k-means-clustering

Confusion Matrix
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
https://en.wikipedia.org/wiki/Confusion_matrix
https://en.wikipedia.org/wiki/F1_score

Ordinal Regression
https://en.wikipedia.org/wiki/Ordinal_regression

Poisson regression
https://en.wikipedia.org/wiki/Poisson_regression

Mean Absolute Error and Root Mean Squared Error
http://www.eumetrain.org/data/4/451/english/msg/ver_cont_var/uos3/uos3_ko1.htm

Cross Validation
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f

Operationalize and Manage Azure Machine Learning Services

Connect to a published Machine Learning web service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/publish-a-machine-learning-web-service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/consume-web-service-with-web-app-template
https://docs.microsoft.com/en-us/azure/machine-learning/studio/manage-new-webservice

Use Other Services for Machine Learning

Microsoft Cognitive Toolkit
https://www.microsoft.com/en-us/cognitive-toolkit/

BrainScript
https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-basic-concepts

Streamline development by using existing resources
https://docs.microsoft.com/en-us/azure/machine-learning/studio/gallery-how-to-use-contribute-publish
Perform database analytics by using SQL Server R Services on Azure
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/provision-vm
https://docs.microsoft.com/en-us/machine-learning-server/install/r-server-vm-data-science
https://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
http://blog.revolutionanalytics.com/2017/07/xgboost-support-added-to-rattle.html
https://github.com/JohnLangford/vowpal_wabbit/wiki

I hope you have found this test preparation material helpful.  If you passed the exam, let me know by sending me a comment.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Executing an Azure Machine Learning Web Service

My last blog post showed how to copy data to a blob storage account, which I needed to do to be able to call an Azure Machine Learning [ML] Web Service. When calling a ML Web Service, the data must be in an Azure Blob Storage account. Once a ML model has been trained, and a web services has been created, it’s ready for production. Calling the experiment in Data Factory allows the ML to be run with tens of thousands of rows as part of a scheduled process. Prior to inserting the ML web service in Data Factory, make sure that you test it to ensure there are no errors with the web service, as Data Factory does not expose all of the ML errors which may be encountered by the web service.

Creating Azure Machine Learning Data Factory Pipelines

ML DF PipelineTwo new steps need to be added to the existing Data Factory Pipeline, one to call the ML Web Service and one for the output. The ML pipeline requires two pieces of JSON code, a linked service to make the connection to the web service and a pipeline to invoke the job and specify the inputs and the Outputs. For the Output, the first step requires no JSON as first a blob storage container in Azure needs to be created to store it. The next steps involve writing JSON to create a linked service to connect to it and lastly an Output dataset needs to be defined.

Calling Machine Learning Service

The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Page grid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService

{
"name": "AzureMLLinkedService",
"properties": {
"description": "Connecting ML Experiment”
"hubName": " GingerDataFactoryTest_hub",
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/fbe056b6d4c74d7f9d1954367dc3fa61/services/xxa56efd75b745e28cd0512822d17eae/jobs",
"apiKey": "**********"
}
}
}

We will need another linked service for the Output, which takes the data from the experiment and writes it to a blob. The field names in the experiment are listed.

{
"name": "OutputML",
"properties": {
"structure": [
{"name": "Age", "type": "Int32" }
,
{ "name": "workclass", "type": "string" }
,
{ "name": "education-num", "type": "Int32" }
,
{ "name": "marital-status", "type": "String" }
,
{ "name": "occupation", "type": "String" }
,
{ "name": "relationship", "type": "String" }
,
{ "name": "race", "type": "String" }
,
{ "name": "sex", "type": "String" }
,
{ "name": "hours-per-week", "type": "Int32" }
,
{ "name": "native-country", "type": "String" }
,
{"name": "Scored Labels","type": "Int32"}
,
{"name": "Scored Probabilities","type": "Decimal"}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "ExperimentMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

The API key will show the actual value until you save it, at which point it will change to the stars you see here. This Linked Service will be referenced in the next bit of JSON for the pipeline


"name": "PipelineML",
"properties": {
"description": "Use Azure ML Model",
"activities": [
{
"type": "AzureMLBatchExecution",
"typeProperties": {
"webServiceInput": "InputDataSetBlob",
"webServiceOutputs": {
"output1": "OutputDataSetBlob"
},
"globalParameters": {}
},
"inputs": [
{
"name": "InputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputDataSetBlob"
}
],
"policy": {
"timeout": "02:00:00",
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MLActivity",
"description": "Execute Experiment",
"linkedServiceName": "AzureMLLinkedService"
}
],
"start": "2016-08-19T10:30:00Z",
"end": "2016-08-20T23:30:00Z",
"isPaused": true,
"hubName": " GingerDataFactoryTest_hub ",
"pipelineMode": "Scheduled"
}
}

Lastly another Dataset needs to be created to process the output. The data will be written to a file called Output.csv, which is in a folder called mloutput01/ which is located in the Blob storage container, which is the same one I used previously for the input folder used earlier.   This file will be overwritten every single time this is run.

{
"name": "OutputDataSetBlob",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "output.csv",
"folderPath": "mloutput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

If you add this code onto the previous Data Factory code, you can take data from the database and use it to run a Azure ML experiment and run as much data as you want through the experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Making Linear Regression Understandable for Machine Learning

When starting with Azure Machine Learning, it is sort of hard to wrap one’s brain around what kind of insight that Machine Learning can provide. When doing data analysis, often times we are looking for patterns. Does the volume of data really go up at the end of the month or is just the additional processes that make it seem that way? Does anyone really know if sales really pick up in August or is that just legerdemain from the sales department? Linear Regression can help determine that.

Relationships between Different Items

There are two types of indicators for linear correlation, positive and negative as shown on the following charts. The Y axis represents Grades, and the x axis is changed to show positive and negative correlationlinear-Grades of the amount of X on grades. When X is the amount of studylinear-CatVideos hours, there is a positive correlation and the line goes up. When X is changed to watching cat videos, there is a negative correlation. If you can’t draw a line around the points there is no correlation. If I were to create a graph where X indicated the quantity of the bags of Cheese Doodles consumed on grades, it would not be possible to draw a straight linear-CheeseDoodlesline, where the data points cluster around it. Since this is Line-ar regression, if that line doesn’t exist there is no correlation. Knowing there is no correlation is also useful.

 

Calculating Linear Regression

While the variable relationship is really easy to see without Math, there is an underlying formula that describes Linear Regression, and lest all of the math majors get upset I thought I would include the formula

Yi = a0 + b0Xi + ei

Y –  is the value of the Y axis, which in our example is grades

a – Is the point where the line intersects Y, or more clearly stated, where the line is. Now ideally your data should intersect at those points but since the line is sort of a guide, this won’t exactly match.

b – Contains the slope of the line

X – Is the value of the X axis, which depending on the example you are looking at is

E – This contains the error

Machine Learning with Linear Regression

In the blog examples, there are only two values, grades and something else. Machine learning can take all of your input variables and determine which values, if any impact the result. Hopefully this information provides you with a good use case for machine learning. In case you were unaware, Azure ML is availablefor free. All you need to do is sign up for an account at https://studio.azureml.net . There are a few size limitations as far as how much data you can load, but you can load enough to determine if machine learning will work in your environment.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

2015: Year End Wrap up for Releases and More

As 2015 draws to a close, I started thinking back about everything that has happened this year. 2015 GraphicTechnically this has been a big year as a many new applications were released. Here are just some of them, with links included to provide more detail.

This short list could be a lot longer as it doesn’t count the number of updates released to Power BI, which occur several times a month, the CTP releases for SQL Server 2016, the new web version of BIML, or PowerShell. It’s really hard to keep up with everything that is changing. It’s a good thing that so many people are willing to help others learn how through speaking and blogs which make learning new things easier.

Community Involvement in 2015

Keeping up with all of these events is difficult, especially given the pace of releases.  I spend a lot of time reading various blogs, watching videos and going to hear people speak. I also have been able to talk about topics of particular interest, many Power BI and Machine Learning. This year I spoke a different times at a number of different events including: Speaker Idol, two different user groups, seven webinars, five SQL Saturdays and other Tech Events. I’ve got a number of engagements on the books for next year, including PASS BA Con and SQL Saturday #461 – Austin. 2016 is shaping up to be busy too and hopefully our paths will cross.  I list all of my speaking events on my Engagement Page and I hope that you might take a look at it from time to time if you are interested in catching up in person sometime. Next year I am hoping my list of speaking engagements changes somewhat as I plan on trying harder to get accepted to speak at events where I submitted and was turned down in 2015. On a more positive note, views of my blog are up 1000%, and the number of website subscribers has more than doubled. Thank you very much for continuing to read this site and I hope you find my thoughts helpful. I posted once a week this year, which I thought was pretty good until I talked to Ken Fischer b | t who blogs twice a week. I’ll have to try harder next year. If you think of a topic you think would make a good blog post, let me know as I am always interested in feedback.

Keeping Up the Pace in 2016

Next year there will be no slowdown in the things to learn as SQL Server 2016 is going to be released. Although the exact date has not been announced, my sources tell me to look for it around May-June. The next release of SQL Server is going to be huge as it will include new tools Microsoft added to integrate Big Data and open source platforms to SQL Server. PolyBase, JSON and R are all going to be part of with SQL Server. Personally, I find the R integration most Datazen and SSRS are going to be integrated in the next release too which should really increase the implementation of mobile reporting visualizations.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

I’m Not Good at Math

How many times have you heard someone say, “I’m not good at Math”? Often times this statement is used as a reason why something technical cannot possibly be pursued. It’s a self-inflicted limitation; a reason that entire areas of study cannot be pursued. If you have ever said this, stop it. Don’t repeat it even if you believe you are not good at math. Why? Because while you may not be good at math now, there is no reason why that should stop you from learning it.

Math, Music and Programming

Years ago, back in the days before PCs and more importantly computer science degrees offered by major universities, IBM was working on developing mainframe computers and needed people to help them develop them. Since there were no computer MathandMusicscience degrees being offered at that time, they hired people with degrees in Math and Music. Music? Why Music? Music uses the same part of the brain as math does. This is one of the reasons educators think that music should be taught to small children as it has been shown to improve math scores. Personally I have found it interesting to ask technical people if they play or have played an instrument. Ask around yourself and you may be surprised at the large number of people in technical fields who have or do play a musical instrument. Musicians have the brain training needed to be good technical people, regardless of their math skills.

Learning Limits

There are no limits to what you can learn, other than the limits you put on yourself. The brain is very complex and there are infinite ways to train it to do something. Generally speaking one is not good at math because they haven’t learned it.  Oddly enough, discouraging one’s ability to learn often starts in school. If this sounds familiar, remember life isn’t school.  Often times a school setting isn’t the best way to learn anything. Performance in class is not indicative of one’s ability to learn. It may have be the ability of the instructor to teach or willingness to focus at that time. I am willing to bet you don’t view the world the way you did when you were sixteen, so why would you judge your ability to learn with that same filter?

Machine Learning is a Skill Which Can Be Learned

I know a very smart developer who told me recently that he wasn’t good at math, so he couldn’t possibly do machine learning. Really. PowerShell, Networking, TSQL, C#, SSIS, MDX and DAX you could learn but you can’t teach yourself Machine Learning? I am not going to say it is easy, but I wouldn’t say that about any of the other development and IT tasks either. If you can learn one of those, you can learn Machine Learning too, despite your belief in your math skills. There is no reason why not. I think Yoda said it best “Do or do not. There is no Try”. There is nothing really stopping you.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Azure Stream Analytics Hopping – Part 3

When incorporating streaming data from Azure Streaming Analytics, it is important to select the data to accomplish the goals of the tasks at hand. The sample company Eosh is streaming a lot of sensor data and has a variety of questions which the data is will answer.  Eosh is a concrete delivery company which is streaming the data back from their vehicles to the central dispatch system using Microsoft’s Stream Analytics. There’s a much more detailed description of what Eosh is streaming and their data needs in the first post in this series. After reviewing when Tumbling Windows and Sliding Windows, are used, in this post we are going to discuss another option for streaming data, Hopping Windows.

When to Use Hopping Windows

Eosh wants to use Hopping Windows to determine the previous action when water is added to the concrete mix. There is a flow meter sensor in the water tank which detects when the driver flips the switch to add more water. There are a number of different reasons for adding water, one being that the pouring is complete and the driver is washing out the remaining concrete. Another reason could be that the driver is stuck in traffic and the water is added to keep the concrete from setting up within the mixer. Depending on the type of concrete in the mixer, if too much water is added, the concrete will no longer have the required strength and can’t be used to create a load bearing structure. It is very important that concrete used in structural concrete be created according to specification, as concrete mixed incorrectly will crumble over time, something commonly seen in Detroit.  If too much water is added the vehicle may be routed to a different location so the concrete can be used for a non-load bearing purpose, like creating sidewalks.

Overlapping Hops

HoppingSliceBy design, all hops contain an overlapping previous time slice. The picture provides a good visualization for how the data slices are created. Eohs wants to look at the events which happened 5 minutes prior so that the adding water event can be appropriately categorized. The following Streaming query can provide that data

 

SELECT System.TimeStamp AS OutTime, VehicleID, COUNT(*)
FROM Input TIMESTAMP BY WaterStartPour
GROUP BY VehicleID, HoppingWindow(minute,10 , 5)

This query will create 10 minute slices of time. Each slice will look at the last 5 minutes previous reported and 5 minutes past that. By slicing the data in this way, the context around adding water can be evaluated to determine what kind of water add event took place. Eosh can then use this data to determine if the concrete can be delivered to the original location or if it needs to be rerouted.  This later processing will be accomplished via machine learning, which I will talk about in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Limiting the Number of Results with TABLESAMPLE

In the world of exploding piles of data, there are times you just need less not more. If you are working on a proof of concept, just want to test out some code, or want to pass a selection of data to a machine learning experiment, you don’t need several million records. Having that much data will just keep you from getting to the answer in a reasonable period of time. There are a number of standard ways people limit the data, the most common being some kind of a date filter. Using a date range though often times does not provide the variability needed. This is a particular problem with data used in a machine learning experiment it is designed to create an algorithm based on data pattern extrapolated over time. For example if you are doing any kind of regression analysis on a retail client and you either exclude or include the Christmas shopping season, the algorithm created will not be correct. The goal is to have less data for analysis pulled from the entire set of data. Fortunately SQL Server since 2005 has several methods for selecting random data

Using TABLESAMPLE

Until recently, I hadn’t used the Transact SQL TABLESAMPLE clause, but I ran into a situation where I needed to test some things and not wait all year for a result. TABLESAMPLE to the rescue. There are a couple of things where it won’t work. If you have derived tables, tables from a linked server (bad idea), or are writing a view and you want to return random data you can’t use TABLESAMPLE. If those conditions do not apply, you can use it.

The random sample that TABLESAMPLE provides is based on the number of data pages, not the number of records. If you want the number of rows to be specifically limited you will need to use Top(n) as well. I’ve written all my samples based upon AdventureWorksDW so you can run them for yourself later. I’ve listed the variety of ways to call TABLESAMPLE and shown the number of records returned.

SELECT *
FROM [dbo].[FactInternetSales] –60398 Rows returned prior to Table Sample
TABLESAMPLE(10 PERCENT)

(6073 row(s) affected)

Let’s say you want to return the same sample set multiple times. For that you will need some value. I picked 11, but you could pick any other you like.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]
TABLESAMPLE(10 PERCENT)
REPEATABLE (11)

 (6489 row(s) affected)

When looking at the number of records returned, the values are not 10 percent exactly or particularly consistent in the number of rows returned. If you only want 6039 records returned, you can try the following code, but it doesn’t really do what it says.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]  
TABLESAMPLE(6039 ROWS)

 (5640 row(s) affected)

This code will provide the exact number of records desired. Noticed I upped the number of rows returned in order to get 6039 rows. If the sample is 6039 you cannot guarantee that you have enough rows returned.

SELECT top 6039 *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales] --60398
TABLESAMPLE(6200 ROWS)

(6039 row(s) affected)

Really Random Sample

RandomSampleTABLESAMPLE has some really useful functionality, but for machine learning I need a truly random sample, which TABLESAMPLE does not provide. For that I need NEWID(). The following sample returns approximately 1% (.01) of the 60398 rows.

Select * from [AdventureWorksDW2014].[dbo].[FactInternetSales]
Where 0.01>= Cast(checksum(newid(), [ProductKey]) & 0x7fffffff as float) / Cast(0x7fffffff as int)

 

Just for fun I ran the same code 3 times and got a variety of rows returned.

(600 row(s) affected)
(607 row(s) affected)
(622 row(s) affected)

The ProductKey is added so that the NEWID() function will calculate a sample for each row. The WHERE statement calculates a random float between 0 and 1. This will truly give me the random sample I would need for a machine learning experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Using Tumbling Windows to Select Data from Azure Stream Analytics – Part 2

TumblingWindows
Tumbling Windows are another way of selecting data from an Azure Stream to drive an Azure ML Experiment.  Once again of my examples here are going to be based on the concrete company Eohs, which I referenced in a previous post when talking about Streaming Windows. Eohs is streaming data, via Azure Stream Analytics [ASA] and we need evaluate a portion of that data for an Azure Machine Learning experiment. The experiments don’t need all of the data; only a portion of that data is required. Some of the data will be reported on in real time, and other portions of data will be used for analysis at a longer window. The necessary data will be extracted via an Azure Stream Analytics Query using Windowing. In this post, we will be talking about Tumbling Windows.

Eohs: Streaming Sensor Data

Eohs has installed a tracking system which sends GPS positioning and sensor data which is sent back in near real time to the dispatching company. The dispatchers are able to monitor on their screens the location of the truck, speed, heading and some sensor information delivered every 20 seconds which allow them to know if the truck is loading concrete, pouring concrete, adding water, seatbelt information, and if the passenger door is opened. Eohs is interested in using the sensor data received to figure out if they will need to perform maintenance on their concrete mixing drums. The drums need to have maintenance performed on them based on the drum speed, concrete pouring sensor, and the amount of water added when in use.

Using Azure ML to Determine when to Perform Maintenance

TumblingWindowSince Eohs is streaming their data with ASA, we monitor the sensor information for the water and the drum speed over time to see if maintenance is required on the concrete drum. The Azure ML experiment will look at the combination of the water, drum speed and time of day to determine if maintenance is required. We will need to evaluate the sensors every 15 minutes.

Tumbling Windows in Azure Stream Analytics

We want to look at the performance of the sensors in 15 minute increments, so to do this we are going to use a tumbling window. Tumbling windows are designed to read data in fixed increments, so our query is going to read them every 30 minutes. Using the Stream Analytics Query Language, this query will provide the data.

SELECT VehicleID, Avg(DrumSpeedSensor), avg(PouringSensor), avg(WaterSensor), System.Timestamp as EvalTime 
FROM VehicleTrackingSystem TIMESTAMP BY EntryTime
Group by VehicleID, TumblingWindow(minute, 15)

This query will return the data every 15 mint. The EvalTime will be the single time value when the query was run. TIMESTAMP BY EntryTime will ensure that the data is evaluated based upon when the data was created instead of the time that the data reached the Azure server as sometimes data packets may be received out of order. Having our data split into multiple streams like this will allow for multiple experiments to be performed on our Azure Data Stream.

For Part 3 of this series we will talk about Hopping Windows and how and when to use that technique on our data. If you are interested in knowing when my next post will be available, please subscribe and you’ll receive an email when the next post is available.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Incorporating Azure Stream Analytics with Azure ML – Part 1

Using the Azure Stream Analytics Query Language to Drive an ML Experiment

In the past I have talked about some of the components of Azure Machine Learning, but I thought it might make more sense to talk about creating a solution, rather than the individual components.  As that will take a while, this post  begins a multi-part series to bring in some real world examples to make the concepts around streaming data and Azure Machine Learning [ML] less abstract by starting with the data, adding several ML experiments, then talking about ways to implement the solution. The blog series is focused on the streaming data from a sample company the concrete company Eohs.

Streaming Data in Azure

Eohs has installed a vehicle tracking system which sends GPS positioning and sensor data which is sent back in near real time to the dispatching company. The dispatchers are able to monitor on their screens the location of the truck, speed, heading and some sensor information delivered every 20 seconds which allow them to know if the truck is loading concrete, pouring concrete, adding water, seat belt information, and if the passenger door is opened. Eohs has some policies for their drivers which can involve termination if they are violated. Drivers are not permitted to stop the truck anywhere other than the assigned delivery location, which cuts down on fraud and helps reduce insurance costs. This data is streamed via Azure Stream Analytics [ASA].

Cortana Analytics Implementation of Azure ML

Since Eohs is streaming their data with ASA, we want to implement an Azure ML Experiment to notify dispatch in real time any violation of their policies. As I discussed in a previous blog, since Cortana Analytics includes Azure ML and Stream Analytics, this would using the components is considered a Cortana Analytics implementation. We have created a Machine Learning Experiment which will look at the GPS position of the delivery location, and determine if a driver is stopped for an extra-ordinary length of time at a delivery location, as well as stopped in a non-delivery location. The dispatchers are immediately notified of this, so they can call the driver to figure out what is happening to the truck. What kind of data is needed to be sent to the Azure ML experiment to analyze?

Sliding Windows in Azure Stream Analytics

SlidingWindowsThe Azure ML Experiment needs to evaluate all of the vehicle data which shows that the truck is stopped for a while, generally speaking greater than 90 seconds. After all some traffic lights take 90 seconds to get through, so eliminating the short stops would be helpful in decreasing the data needed to be evaluated. ASA uses a SQL-like query language which makes it easy to split the data so only the data that the experiment needs will be sent. We want to evaluate a window of time where data returned is only the data where the vehicle shows it is stopped for 91 seconds. Finding the 91 second stops is considered a sliding window. Here’s the code you would need to do this.

SELECT VehicleID, Avg(GPSLat), avg(GPSLong), min(Speed), max(PourSensor),Max(WaterSensor), dateadd(second, -91, System.Timestamp) as StartEvalTime
, System.Timestamp as EndEvalTime
FROM VehicleTrackingSystem TIMESTAMP by SensorTime
Group by VehicleID, SlidingWindow(second,91)
HAVING min(Speed) <1

 

EndEvalTime is the Time that this event was calculated by the system. Since I wanted both the start and end evaluation time, the time was calculated by using the DATEADD function. If one of the data elements arrived out of order, using the TIMESTAMP function will ensure that they events will be evaluated in the order they happened instead of the order when the data was received.

Other Windowing in Azure Stream Analytics

window_slideASA also supports two other windowing functions, Tumbling and Hopping. In my next post I will be discussing how and when to use a Tumbling Window. If you are interested in reading the posts as they occur, please subscribe to desertislesql.com to be notified when the next post is available.

 

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

What did I just Evaluate in Azure Machine Learning?

Recently I read a twitter message where the person said they had just run an Azure Machine Learning [ML] experiment and it worked successfully. He wondered what the results mean? I thought it might be helpful to explain how to interpret some of the results. There are a number of different types of algorithms you can use in Azure ML, each has a different way of evaluating the result of the experiment. Azure ML has three major algorithm categories: classification, clustering and regression. Since I believe classification and clustering are more often discussed when reading about ML, first I thought I’d talk about regression.

When should one use Regression Models?

RegressionModelEvaluateWhile it is possible to make a reasonable guess on the other algorithm categories mean, regression algorithms are not so intuitive. A regression algorithms are used to predict a specific value output number based upon a series of variables. In order to run this kind of experiment, you will need a key to uniquely identify the record to be able to use this model, and be looking to return a number for a results.

Reading the Results

If you visualize the output of a ML project, the visualization looks like the picture on the left. The number you want to pay attention to is the coefficient of determination. This value tells you how good your predictions are. The closer this number is to 1, the better your variables are at predicting the results. If you look at the values in my experiment, it will round up to the number 1, since the value is 0.642447. That means the variables I am evaluating can mostly be used to determine a value. If you read about when to use a regression models, the common example using them is to predict a home value, which if you have all the information people use for creating real estate comparisons, such as number of rooms, closeness to a busy street, lot size etc.

No Result is a Result

Sometime there is a value in finding nothing. Recently I did some ML analysis for a company, who provided a set of data for analysis to determine how it was influencing the outcome. After running a number of different algorithms, weights, and other changes in the experiment, I determined that their data didn’t impact the final outcome in any significant way. Prior to presenting the findings to them, I thought that they would not have a favorable opinion of the analysis I did, as they were looking for an answer I didn’t find. What did happen is they were surprised and happy that I was able to show them that their thoughts about the data were not true, so they could focus their efforts to different areas. Keep that in mind next time you find nothing, because no result is an answer too.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur