Data Factory – Troubleshooting Errors

IBM PunchCard computer reader

Punch card and punch card reading machine

If you have been reading my Blog Series on Data Factory, you will notice that I didn’t talk about what to do when you have errors, until now. Data Factory is different from most other programming languages which you may be familiar with, such as C#, Java, R, SSIS or VB. In those languages one can step through the code and look at the variables or values while the code is running. perform any of the standard set of troubleshooting techniques which are used in other languages to help resolve errors. Data Factory doesn’t provide a way to really determine what is going on internally when the code is running. Debugging Data Factory harkens back to day when people lined up punch cards and waited to get the output at the other side.

Data Factory Error Codes

Unfortunately, while developing Data Factory I became very familiar with errors. All of the errors show up at the end and provide very little insight as to what in the process failed. Here’s an example.

Database operation failed on server ‘Sink:DBName01.database.windows.net’ with SQL Error Number ‘40197’. Error message from database execution : The service has encountered an error processing your request. Please try again. Error code 4815. A severe error occurred on the current command. The results, if any, should be discarded.

I didn’t get any results to discard, so that helpful hint was not applicable, ever. I was able to resolve this error a few times by resolving different reasons why the error might have occurred by looking through the code and guessing at what line could possibly have caused the error. Generally speaking these errors were found after the first few things I tried did not work. Metaphorically, I had to wait for my punch cards to be read through the machine to see if I had correctly guessed what might be wrong. I have heard this process described as “black box”, but I think a more accurate description would be a punch card computer, as black box is too cool of a name for a process this heinous. In one instance, this error was received when the data in the field was longer than the field definition. That took a while to find since I had over 25 fields to review. Another time I got this same error when I had a typo in a field name, making the message appear completely arbitrary.

Data Factory Error 1000

Batch Execution failed. The response from the Machine Learning service at endpoint…(excluded specific job reference codes here)… {“Exception”:{“ErrorId”:”LibraryException”,”ErrorCode”:”1000″,”ExceptionType”:”ModuleException”,”Message”:”Error 1000: TLC library exception: Exception of type ‘Microsoft.Numerics.AFxLibraryException’ was thrown

If you search on the internets, you will notice that Error 1000 comes up a lot. The reason for that is it is a catch-all error number. There are 999 actual error messages coded and anything that isn’t one of those errors is error 1000. This problem could be anything. In my case it was because my web service wanted me to strongly type the data coming in, instead of having the data default to text, which worked outside of the web service. I did not resolve my problem after reading the error message and coming up with a logical guess based on my code. I had to create an Official support ticket to answer this question as the answer eluded me.

Data Factory Troubleshooting Error Strategies

Regrettably, there is no really good way of resolving errors. It’s not possible to look at anything in process and see how the data is being processed. Instead errors pop up when the execution fails. There are a few things you can do though. If you are calling an Azure Machine Learning Web service, I recommend modifying your input data so that if it looks like a number but it is read in as text, to use the meta data editor to type the data. Make sure you test the batch execution prior to loading it in Data Factory. For other pipelines, I employed a binary sort to determine what field might have the error. I commented out half the fields, and half the fields until I could determine what field had an error. Eventually I figured out the field which must be short. I also decreased the size of data using the code to 3 rows so I wouldn’t have to wait so long for it to fail.

Data Factory Series

This post is the end of my five part series on Azure Data Factory. I hope you have found it both interesting and useful when trying to learn Data Factory. If you have found this series interesting, please subscribe to my blog to be notified of the latest topics. Given that I plan on doing a lot of speaking in the very near future on topics such as R, SQL Server 2016 and Power BI, those topics are going to be on my blog in the very near future.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Migrating Data from Azure Blob Storage to SQL Server

There are instances where data resides in Azure Blob Storage and the data is needed in a SQL database. For example, if one ran a Machine Learning experiment in Data Factory, the results would be stored in Azure Blob storage, and for analysis purposes, it may make a lot more sense to move the data to SQL database. Moving data around in Data Factory, means writing JSON. In this example we will be using an Azure SQL DB, but it is not essential that the data be stored in Azure. An on-premises SQL Server could also be used, as long as a gateway was added for the connection, the other steps would be the same. There are five different Data Factory elements required to move data from an Azure blob to a database: a pipeline for the data, a data set containing the definition for the blob, a linked service for the blob, a data set containing a definition for the SQL Data, and a linked service to connect to the SQL database.

JSON Data Service

The data to be moved to SQL is stored in a blob storage container called OutputML, and both the linked service and that data set are included in a previous post on running an ML experiment.  In this linked service, the JSON creates the field definition to be written to a table in a SQL database called CensusMLOutput. There are fewer data types than there are in SQL, meaning the JSON here doesn’t exactly match the table definition, but the less granular datatypes are accepted by SQL.

"name": "OutputML",
"properties": {
"structure": [
{
"name": "Age",
"type": "Int32"
},
{
"name": "workclass",
"type": "string"
},
{
"name": "education-num",
"type": "Int32"
},
{
"name": "marital-status",
"type": "String"
},
{
"name": "occupation",
"type": "String"
},
{
"name": "relationship",
"type": "String"
},
{
"name": "race",
"type": "String"
},
{
"name": "sex",
"type": "String"
},
{
"name": "hours-per-week",
"type": "Int32"
},
{
"name": "native-country",
"type": "String"
},
{
"name": "Scored Labels",
"type": "Int32"
},
{
"name": "Scored Probabilities",
"type": "Decimal"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "CensusMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

JSON for Linked Service Output

The Data set defined references a Linked Service named LinkedServiceOutput. In this JSON the information needed to connect to the database where the table is for the code to write to it.

{
"name": "LinkedServiceOutput",
"properties": {
"description": "",
"hubName": "GingerDataFactoryTest_hub",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=jytr4gph.database.windows.net;Initial Catalog=MLData;Integrated Security=False;User ID=gingerg;Password=**********;Connect Timeout=30;Encrypt=True"
}
}
}

The code includes my id and a password, which is encrypted when the linked service is saved. Now that we have the data components defined, all that is required is for a Azure Data Factory pipeline to move the data.

 

JSON Data Factory Pipeline to Move Data to SQL

The pipeline PipelineCopyMLOutput is pretty straightforward, as it defines the action which should take place, copy and implements it. One thing to note that unlike copying a csv file, the data in a table is appened, meaning every time that this pipeline runs, more data will be added to the table. This code does not contain anything to prevent data from being duplicated, which will happen if the input does not change.


{
"name": "PipelineCopyMLOutput",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "OutputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputML"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy Activity"
}
],
"start": "2016-08-24T16:44:00Z",
"end": "2016-08-25T19:00:00Z",
"isPaused": true,
"hubName": "GingerDataFactoryTest_hub",
"pipelineMode": "Scheduled"
}
}

To run all of this JSON, you can wait an for it to run on schedule or run it Ad-hoc, which I detail in this post .

Data Factory Workflow

Combing all of the Data Factory components which are included in this and in previous posts, the entire work flow diagram is shown below. In the first pipeline, data is copied from the database to blob storage. Next the blob storage data is used to run an Azure ML experiment which outputs data to blob storage. Lastly the results from the experiment are copied to a database.  Notice all of the lovely green checks which exist in the diagram.

DataFactory

This blog series on data factory has covered everything from creating Azure components needed to using Data Factory to run a ML Web service and sending the results to the data base.  In my next and last post for a while on Data Factory, I will be discussing troubleshooting, an essential process in getting all the code to work. To be notified when new posts appear, please subscribe to my blog to receive the latest. I hope that you have found this to be useful. If so, please leave me comments or message me on Twitter as I would love to hear what others are doing with Data Factory.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Data Factory – Executing an Azure Machine Learning Web Service

My last blog post showed how to copy data to a blob storage account, which I needed to do to be able to call an Azure Machine Learning [ML] Web Service. When calling a ML Web Service, the data must be in an Azure Blob Storage account. Once a ML model has been trained, and a web services has been created, it’s ready for production. Calling the experiment in Data Factory allows the ML to be run with tens of thousands of rows as part of a scheduled process. Prior to inserting the ML web service in Data Factory, make sure that you test it to ensure there are no errors with the web service, as Data Factory does not expose all of the ML errors which may be encountered by the web service.

Creating Azure Machine Learning Data Factory Pipelines

ML DF PipelineTwo new steps need to be added to the existing Data Factory Pipeline, one to call the ML Web Service and one for the output. The ML pipeline requires two pieces of JSON code, a linked service to make the connection to the web service and a pipeline to invoke the job and specify the inputs and the Outputs. For the Output, the first step requires no JSON as first a blob storage container in Azure needs to be created to store it. The next steps involve writing JSON to create a linked service to connect to it and lastly an Output dataset needs to be defined.

Calling Machine Learning Service

The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Page grid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService

{
"name": "AzureMLLinkedService",
"properties": {
"description": "Connecting ML Experiment”
"hubName": " GingerDataFactoryTest_hub",
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/fbe056b6d4c74d7f9d1954367dc3fa61/services/xxa56efd75b745e28cd0512822d17eae/jobs",
"apiKey": "**********"
}
}
}

We will need another linked service for the Output, which takes the data from the experiment and writes it to a blob. The field names in the experiment are listed.

{
"name": "OutputML",
"properties": {
"structure": [
{"name": "Age", "type": "Int32" }
,
{ "name": "workclass", "type": "string" }
,
{ "name": "education-num", "type": "Int32" }
,
{ "name": "marital-status", "type": "String" }
,
{ "name": "occupation", "type": "String" }
,
{ "name": "relationship", "type": "String" }
,
{ "name": "race", "type": "String" }
,
{ "name": "sex", "type": "String" }
,
{ "name": "hours-per-week", "type": "Int32" }
,
{ "name": "native-country", "type": "String" }
,
{"name": "Scored Labels","type": "Int32"}
,
{"name": "Scored Probabilities","type": "Decimal"}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "ExperimentMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

The API key will show the actual value until you save it, at which point it will change to the stars you see here. This Linked Service will be referenced in the next bit of JSON for the pipeline


"name": "PipelineML",
"properties": {
"description": "Use Azure ML Model",
"activities": [
{
"type": "AzureMLBatchExecution",
"typeProperties": {
"webServiceInput": "InputDataSetBlob",
"webServiceOutputs": {
"output1": "OutputDataSetBlob"
},
"globalParameters": {}
},
"inputs": [
{
"name": "InputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputDataSetBlob"
}
],
"policy": {
"timeout": "02:00:00",
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MLActivity",
"description": "Execute Experiment",
"linkedServiceName": "AzureMLLinkedService"
}
],
"start": "2016-08-19T10:30:00Z",
"end": "2016-08-20T23:30:00Z",
"isPaused": true,
"hubName": " GingerDataFactoryTest_hub ",
"pipelineMode": "Scheduled"
}
}

Lastly another Dataset needs to be created to process the output. The data will be written to a file called Output.csv, which is in a folder called mloutput01/ which is located in the Blob storage container, which is the same one I used previously for the input folder used earlier.   This file will be overwritten every single time this is run.

{
"name": "OutputDataSetBlob",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "output.csv",
"folderPath": "mloutput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

If you add this code onto the previous Data Factory code, you can take data from the database and use it to run a Azure ML experiment and run as much data as you want through the experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Copying Data from a Database to an Azure Blob Store

My previous post provides instructions on Getting Started with Data Factory. To borrow a line from Ron White, “I told you that story so I could tell you this one”. If you followed the steps in that post, you have a Data Factory resource. Time to use it. This post describes how to process some data by copying data currently stored in the database to a Blob Storage account. To complete that task, you will need to write some JSON. Please be careful when you do this as writing JSON may involve cursing JSON. Since Jason was in the list of top four male baby names in the US from 1973 – 1982, chances are you may work near someone named Jason when writing JSON, who may think the cursing is directed at him.

Author and Deploy Linked Services

Author and DeployTo be able to pass data from a database, a connection needs to be established to a database. That connection is made by creating a Linked Service. In Data Factory, if you create a New data store, a new Linked Service JSON template will be created. After Clicking on the New Data Store Icon, a list of databases will appear. In this example I am going to pick an Azure SQL Database. If I wanted to access a local SQL Server Database, a gateway to connect to it would also be required. The JSON Linked service template will appear under the menu Drafts and will be labeled Draft-1. A copy of the coded is included below.

{"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>; UserID=<username>@<servername>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30"
}
}
}

If nothing is changed and the Deploy icon at the top of the screen is clicked, a new item AzureSqlLinkedService will appear under Linked services. There will be a circle with an exclamation point next to it as the connection does not contain valid data, but you can deploy it. To get rid of the error, replace all of the information default provided after the colon with valid values, starting with the name, which has been defaulted to AzureSqlLinkedService.The name property dictates what the Linked service will be called and  cannot be change the name later. To change the name later, highlight the name of the service click on the Clone,  change the name, deploy, highlight the old name, click on the …More text at the top of the screen, select delete, then confirm the deletion the Linked service by clicking on the Yes Button. Conversely, just select a good name before clicking on the Deploy icon. Within the connection string, replace all of the items between < and > with valid values. If there are any errors after deploying, fix the data, click on Deploy and the circle with the exclamation point will go away. Invalid database names are accepted, as long as they are in the right format. If the name is not valid, an error will be received when data factory is run.

Because we need a connection to a database and a Azure Blob, two Linked Services are required, one for each different type. Prior to completing this step, create an Azure Blob storage account by clicking on Add on All Resources. Create the second Linked service, like the first. Click on New data store then select Azure Storage. Using the template for an Azure Blob Storage linked services, I have modified it below adding the “hubName” as it is required


{
"name": "GingerAzureBSLinkedService01",
"properties": {
"description": "Test Azure Blob Storage Account for DF",
"hubName": "GingerDataFactoryTest_hub",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=gingerblobstorage01;AccountKey=**********"
}
}
}

The hub name is the name of your Data Factory resource you created earlier, and the name will be from you storage account. You will need to get the actual Account Key and replace it. Once the JSON is deployed, the value will appear as a line of asterisk again.

Create a Data Factory Dataset

Two datasets representing the data accessed in the two linked services must be written in JSON. Click on …More and select New dataset. Unlike Linked Services, you actually do have an option to create a New dataset as the name is consistent. Creating the JSON for the database is a bit of a pain as you have to define every single column accessed and as well as the datatype for each one. Of course both order and capitalization are important, as they must exactly match what is in the database, or you will get an error after the code is run. You won’t see any errors when you type it. Here is a sample

{
"name": "InputDataSet",
"properties": {
"structure": [
{
"name": "Age",
"type": "Int32"
},
{
"name": "workclass",
"type": "string"
},
{
"name": "education-num",
"type": "Int32"
},
{
"name": "marital-status",
"type": "String"
},
{
"name": "occupation",
"type": "String"
},
{
"name": "relationship",
"type": "String"
},
{
"name": "race",
"type": "String"
},
{
"name": "sex",
"type": "String"
},
{
"name": "hours-per-week",
"type": "Int32"
},
{
"name": "native-country",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "InputLinkedServiceAzureDB01",
"typeProperties": {
"tableName": "vCensusInfo"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}

In this sample, I am actually getting data from a view, not a table. Even though all the code references a table, using a view instead works fine. The linked service name does of course represent the Linked service created in the previous step.

Fortunately, it is not necessary to list every column when inserting values to a Blob so the JSON for that is much easier. While it is possible not to enter a file name, if you ever wish to use the data in the blob store later, you will need to create one. This JSON will write a file input.csv to a blob store to the path mlinput01.  If this is run twice, the file will be overwritten the second time.

{
"name": "InputDataSetBlob",
"properties": {
"published": false,
"description": "Input Blob Dataset to feed Azure ML",
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "inputdata.csv",
"folderPath": "mlinput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

JSON for a Data Factory Pipeline

The last JSON required needs to provide the action, in our example copying from a database to a Azure Blob store. Click on …More and select New pipeline.

The JSON code looks like this

{
"name": "PipelineTemplate",
"properties": {
"description": "<Enter the pipeline description here>",
"activities": [],
"start": "<The start date-time of the duration in which data processing will occur or the data slices will be processed. Example : 2014-05-01T00:00:00Z>",
"end": "<The end date-time of the duration in which data processing will occur or the data slices will be processed. Example: 2014-05-05T00:00:00Z>"
}
}

That JSON looks really simple. Unfortunately, for this to work a lot more JSON is required, especially under the activities


{
"name": "PipelineCopy01",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataSet"
}
],
"outputs": [
{
"name": "InputDataSetBlob"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy Activity"
}
],
"start": "2016-08-07T00:01:00Z",
"end": "2016-08-08T00:01:00Z",
"isPaused": false,
"hubName": "GingerDataFactoryTest_hub",
"pipelineMode": "Scheduled"
}
}

Once the JSON has all of the curly braces and commas necessary, it can be deployed without a red circle. Scroll back over to the right to get the the Action menu again and select Diagram. The diagram should appear as shown below.

Ad-Hoc Running Data Factory Pipelines

This job is scheduled to run on the hour, but chances are you want to run it whenever you have the code done, rather than waiting. To run the pipeline, underneath the Contents section, click on the Datasets. A list of datasets will appear. Select the one for the Azure Blob storage output, which in my example is called InputDataSetBlob. Select a value under slices, then click on it and another window will appear on the right. Click on one of the items on that window. Another screen will appear on the right. On this window on the top left corner click on Run. Ideally at this point if your spelling and capitalization all works out, after some duration, the status should say Succeeded. To validate that the transfer actually took place, go to your blob storage account, and you should see the file created. You can even download the file to make sure it actually has data in it. If you don’t change the name, and run it again, the data will be overwritten, and no error will occur, so feel free to run this as much as you like.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Getting Started

For those unfamiliar with Azure Data Factory, it is the application in Azure which moves data and calls processes to use the data. In a lot of ways this makes data factory a lot like SSIS. In addition to moving data around, data factory also contains all of the monitoring tools found in an SSMS job. If you add those things together you get a pretty good picture of what data factory does.

Setting up Data Factory in Azure

Data Factory is a resource which lives in a Resource Group in Azure. To add it to the list of available resources, click on the plus button in the AzureDataFactoryCreateAzure portal and search for Data Factory. Select the icon and this window will appear. Click on the create button at the bottom of the screen to enter the setup information. First pick a name for the service, then select from the dropdown box the appropriate subscription from the options you have available. Like everything else in Azure the services exist within in a container, either pick an existing container to use, or create a new one. Generally speaking I will create a new one so if for any reason later on I want to destroy what I have done, I can just delete the container. Pick the location based on two factors, Azure Data Factory is not available everywhere so you are limited to use only the ones where it is available. If you pick one where it isn’t available, you will get an error message letting you know why you cannot create the resource. Whenever possible within Azure to pick the same resource where your data lives. There are charges within Azure if you migrate data across resources and no charge if you stay in the same resource. You may want to go look at where the data lives which will be used in Data Factory before deciding where to put it. I always check the Pin to Dashboard option so that I can find the resource later, but it is not required and can be done later. Click on the create button to create a Data Factory Resource. If you have selected Pin to Dashboard you will see a little window which says Deploying Data Factory. This little window goes away once Data Factory is completed, and you will have an entry in the list of resources for Data Factory.

Data Factory Tiles

Clicking on the Data Factory resource icon will bring up the following window. Initially the only button you will care about is the one on the far left StartingWithDataFactoryAuthor and Deploy. This is where you will create the Data Factory Components. The authoring is all done in JSON. It is not the most intuitive language in the world and is very picky about brackets. There is some intellisense, and when you click on a bracket, it will highlight the bracket which it lines up with, which is somewhat helpful. In my experience, JSON can be frustrating, as it is rather picky. I don’t think my experience is unique because Microsoft has created two titles which write the JSON components for you, Copy Data and Sample Pipelines. The Copy data function has a wizard to walk through gathering the data required to write the JSON. I have had mixed success in getting it to work. Sometimes it did, and sometimes I got an error. It is in preview, so I imaging over time the performance will improve. Sample pipelines has a few samples of tasks you might want to complete that can be deployed in your environment. Diagram view is very important as it opens a window on the right hand side of the screen showing you a picture of what has been created in JSON. Monitor and Manage brings up an entirely new tab, which is welcomed by those of us who are tired of sliding to the right to see new windows. This screen shows a diagram and what has run. Metrics and Operations open a window showing what has run.

I hope you found this post handy for creating an Azure Data Factory resource. In a later post I will discuss what you can do with Azure Data Factory and provide more detail around the Contents section of this screen and discuss Datasets, Pipelines and Linked Services. Please subscribe to my blog to be notified of my latest posts.

 

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

 

 

 

 

 

 

 

 

 

 

Making Linear Regression Understandable for Machine Learning

When starting with Azure Machine Learning, it is sort of hard to wrap one’s brain around what kind of insight that Machine Learning can provide. When doing data analysis, often times we are looking for patterns. Does the volume of data really go up at the end of the month or is just the additional processes that make it seem that way? Does anyone really know if sales really pick up in August or is that just legerdemain from the sales department? Linear Regression can help determine that.

Relationships between Different Items

There are two types of indicators for linear correlation, positive and negative as shown on the following charts. The Y axis represents Grades, and the x axis is changed to show positive and negative correlationlinear-Grades of the amount of X on grades. When X is the amount of studylinear-CatVideos hours, there is a positive correlation and the line goes up. When X is changed to watching cat videos, there is a negative correlation. If you can’t draw a line around the points there is no correlation. If I were to create a graph where X indicated the quantity of the bags of Cheese Doodles consumed on grades, it would not be possible to draw a straight linear-CheeseDoodlesline, where the data points cluster around it. Since this is Line-ar regression, if that line doesn’t exist there is no correlation. Knowing there is no correlation is also useful.

 

Calculating Linear Regression

While the variable relationship is really easy to see without Math, there is an underlying formula that describes Linear Regression, and lest all of the math majors get upset I thought I would include the formula

Yi = a0 + b0Xi + ei

Y –  is the value of the Y axis, which in our example is grades

a – Is the point where the line intersects Y, or more clearly stated, where the line is. Now ideally your data should intersect at those points but since the line is sort of a guide, this won’t exactly match.

b – Contains the slope of the line

X – Is the value of the X axis, which depending on the example you are looking at is

E – This contains the error

Machine Learning with Linear Regression

In the blog examples, there are only two values, grades and something else. Machine learning can take all of your input variables and determine which values, if any impact the result. Hopefully this information provides you with a good use case for machine learning. In case you were unaware, Azure ML is availablefor free. All you need to do is sign up for an account at https://studio.azureml.net . There are a few size limitations as far as how much data you can load, but you can load enough to determine if machine learning will work in your environment.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

2015: Year End Wrap up for Releases and More

As 2015 draws to a close, I started thinking back about everything that has happened this year. 2015 GraphicTechnically this has been a big year as a many new applications were released. Here are just some of them, with links included to provide more detail.

This short list could be a lot longer as it doesn’t count the number of updates released to Power BI, which occur several times a month, the CTP releases for SQL Server 2016, the new web version of BIML, or PowerShell. It’s really hard to keep up with everything that is changing. It’s a good thing that so many people are willing to help others learn how through speaking and blogs which make learning new things easier.

Community Involvement in 2015

Keeping up with all of these events is difficult, especially given the pace of releases.  I spend a lot of time reading various blogs, watching videos and going to hear people speak. I also have been able to talk about topics of particular interest, many Power BI and Machine Learning. This year I spoke a different times at a number of different events including: Speaker Idol, two different user groups, seven webinars, five SQL Saturdays and other Tech Events. I’ve got a number of engagements on the books for next year, including PASS BA Con and SQL Saturday #461 – Austin. 2016 is shaping up to be busy too and hopefully our paths will cross.  I list all of my speaking events on my Engagement Page and I hope that you might take a look at it from time to time if you are interested in catching up in person sometime. Next year I am hoping my list of speaking engagements changes somewhat as I plan on trying harder to get accepted to speak at events where I submitted and was turned down in 2015. On a more positive note, views of my blog are up 1000%, and the number of website subscribers has more than doubled. Thank you very much for continuing to read this site and I hope you find my thoughts helpful. I posted once a week this year, which I thought was pretty good until I talked to Ken Fischer b | t who blogs twice a week. I’ll have to try harder next year. If you think of a topic you think would make a good blog post, let me know as I am always interested in feedback.

Keeping Up the Pace in 2016

Next year there will be no slowdown in the things to learn as SQL Server 2016 is going to be released. Although the exact date has not been announced, my sources tell me to look for it around May-June. The next release of SQL Server is going to be huge as it will include new tools Microsoft added to integrate Big Data and open source platforms to SQL Server. PolyBase, JSON and R are all going to be part of with SQL Server. Personally, I find the R integration most Datazen and SSRS are going to be integrated in the next release too which should really increase the implementation of mobile reporting visualizations.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

I’m Not Good at Math

How many times have you heard someone say, “I’m not good at Math”? Often times this statement is used as a reason why something technical cannot possibly be pursued. It’s a self-inflicted limitation; a reason that entire areas of study cannot be pursued. If you have ever said this, stop it. Don’t repeat it even if you believe you are not good at math. Why? Because while you may not be good at math now, there is no reason why that should stop you from learning it.

Math, Music and Programming

Years ago, back in the days before PCs and more importantly computer science degrees offered by major universities, IBM was working on developing mainframe computers and needed people to help them develop them. Since there were no computer MathandMusicscience degrees being offered at that time, they hired people with degrees in Math and Music. Music? Why Music? Music uses the same part of the brain as math does. This is one of the reasons educators think that music should be taught to small children as it has been shown to improve math scores. Personally I have found it interesting to ask technical people if they play or have played an instrument. Ask around yourself and you may be surprised at the large number of people in technical fields who have or do play a musical instrument. Musicians have the brain training needed to be good technical people, regardless of their math skills.

Learning Limits

There are no limits to what you can learn, other than the limits you put on yourself. The brain is very complex and there are infinite ways to train it to do something. Generally speaking one is not good at math because they haven’t learned it.  Oddly enough, discouraging one’s ability to learn often starts in school. If this sounds familiar, remember life isn’t school.  Often times a school setting isn’t the best way to learn anything. Performance in class is not indicative of one’s ability to learn. It may have be the ability of the instructor to teach or willingness to focus at that time. I am willing to bet you don’t view the world the way you did when you were sixteen, so why would you judge your ability to learn with that same filter?

Machine Learning is a Skill Which Can Be Learned

I know a very smart developer who told me recently that he wasn’t good at math, so he couldn’t possibly do machine learning. Really. PowerShell, Networking, TSQL, C#, SSIS, MDX and DAX you could learn but you can’t teach yourself Machine Learning? I am not going to say it is easy, but I wouldn’t say that about any of the other development and IT tasks either. If you can learn one of those, you can learn Machine Learning too, despite your belief in your math skills. There is no reason why not. I think Yoda said it best “Do or do not. There is no Try”. There is nothing really stopping you.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Azure Stream Analytics Hopping – Part 3

When incorporating streaming data from Azure Streaming Analytics, it is important to select the data to accomplish the goals of the tasks at hand. The sample company Eosh is streaming a lot of sensor data and has a variety of questions which the data is will answer.  Eosh is a concrete delivery company which is streaming the data back from their vehicles to the central dispatch system using Microsoft’s Stream Analytics. There’s a much more detailed description of what Eosh is streaming and their data needs in the first post in this series. After reviewing when Tumbling Windows and Sliding Windows, are used, in this post we are going to discuss another option for streaming data, Hopping Windows.

When to Use Hopping Windows

Eosh wants to use Hopping Windows to determine the previous action when water is added to the concrete mix. There is a flow meter sensor in the water tank which detects when the driver flips the switch to add more water. There are a number of different reasons for adding water, one being that the pouring is complete and the driver is washing out the remaining concrete. Another reason could be that the driver is stuck in traffic and the water is added to keep the concrete from setting up within the mixer. Depending on the type of concrete in the mixer, if too much water is added, the concrete will no longer have the required strength and can’t be used to create a load bearing structure. It is very important that concrete used in structural concrete be created according to specification, as concrete mixed incorrectly will crumble over time, something commonly seen in Detroit.  If too much water is added the vehicle may be routed to a different location so the concrete can be used for a non-load bearing purpose, like creating sidewalks.

Overlapping Hops

HoppingSliceBy design, all hops contain an overlapping previous time slice. The picture provides a good visualization for how the data slices are created. Eohs wants to look at the events which happened 5 minutes prior so that the adding water event can be appropriately categorized. The following Streaming query can provide that data

 

SELECT System.TimeStamp AS OutTime, VehicleID, COUNT(*)
FROM Input TIMESTAMP BY WaterStartPour
GROUP BY VehicleID, HoppingWindow(minute,10 , 5)

This query will create 10 minute slices of time. Each slice will look at the last 5 minutes previous reported and 5 minutes past that. By slicing the data in this way, the context around adding water can be evaluated to determine what kind of water add event took place. Eosh can then use this data to determine if the concrete can be delivered to the original location or if it needs to be rerouted.  This later processing will be accomplished via machine learning, which I will talk about in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Limiting the Number of Results with TABLESAMPLE

In the world of exploding piles of data, there are times you just need less not more. If you are working on a proof of concept, just want to test out some code, or want to pass a selection of data to a machine learning experiment, you don’t need several million records. Having that much data will just keep you from getting to the answer in a reasonable period of time. There are a number of standard ways people limit the data, the most common being some kind of a date filter. Using a date range though often times does not provide the variability needed. This is a particular problem with data used in a machine learning experiment it is designed to create an algorithm based on data pattern extrapolated over time. For example if you are doing any kind of regression analysis on a retail client and you either exclude or include the Christmas shopping season, the algorithm created will not be correct. The goal is to have less data for analysis pulled from the entire set of data. Fortunately SQL Server since 2005 has several methods for selecting random data

Using TABLESAMPLE

Until recently, I hadn’t used the Transact SQL TABLESAMPLE clause, but I ran into a situation where I needed to test some things and not wait all year for a result. TABLESAMPLE to the rescue. There are a couple of things where it won’t work. If you have derived tables, tables from a linked server (bad idea), or are writing a view and you want to return random data you can’t use TABLESAMPLE. If those conditions do not apply, you can use it.

The random sample that TABLESAMPLE provides is based on the number of data pages, not the number of records. If you want the number of rows to be specifically limited you will need to use Top(n) as well. I’ve written all my samples based upon AdventureWorksDW so you can run them for yourself later. I’ve listed the variety of ways to call TABLESAMPLE and shown the number of records returned.

SELECT *
FROM [dbo].[FactInternetSales] –60398 Rows returned prior to Table Sample
TABLESAMPLE(10 PERCENT)

(6073 row(s) affected)

Let’s say you want to return the same sample set multiple times. For that you will need some value. I picked 11, but you could pick any other you like.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]
TABLESAMPLE(10 PERCENT)
REPEATABLE (11)

 (6489 row(s) affected)

When looking at the number of records returned, the values are not 10 percent exactly or particularly consistent in the number of rows returned. If you only want 6039 records returned, you can try the following code, but it doesn’t really do what it says.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]  
TABLESAMPLE(6039 ROWS)

 (5640 row(s) affected)

This code will provide the exact number of records desired. Noticed I upped the number of rows returned in order to get 6039 rows. If the sample is 6039 you cannot guarantee that you have enough rows returned.

SELECT top 6039 *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales] --60398
TABLESAMPLE(6200 ROWS)

(6039 row(s) affected)

Really Random Sample

RandomSampleTABLESAMPLE has some really useful functionality, but for machine learning I need a truly random sample, which TABLESAMPLE does not provide. For that I need NEWID(). The following sample returns approximately 1% (.01) of the 60398 rows.

Select * from [AdventureWorksDW2014].[dbo].[FactInternetSales]
Where 0.01>= Cast(checksum(newid(), [ProductKey]) & 0x7fffffff as float) / Cast(0x7fffffff as int)

 

Just for fun I ran the same code 3 times and got a variety of rows returned.

(600 row(s) affected)
(607 row(s) affected)
(622 row(s) affected)

The ProductKey is added so that the NEWID() function will calculate a sample for each row. The WHERE statement calculates a random float between 0 and 1. This will truly give me the random sample I would need for a machine learning experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur