Data Factory – Troubleshooting Errors

IBM PunchCard computer reader

Punch card and punch card reading machine

If you have been reading my Blog Series on Data Factory, you will notice that I didn’t talk about what to do when you have errors, until now. Data Factory is different from most other programming languages which you may be familiar with, such as C#, Java, R, SSIS or VB. In those languages one can step through the code and look at the variables or values while the code is running. perform any of the standard set of troubleshooting techniques which are used in other languages to help resolve errors. Data Factory doesn’t provide a way to really determine what is going on internally when the code is running. Debugging Data Factory harkens back to day when people lined up punch cards and waited to get the output at the other side.

Data Factory Error Codes

Unfortunately, while developing Data Factory I became very familiar with errors. All of the errors show up at the end and provide very little insight as to what in the process failed. Here’s an example.

Database operation failed on server ‘Sink:DBName01.database.windows.net’ with SQL Error Number ‘40197’. Error message from database execution : The service has encountered an error processing your request. Please try again. Error code 4815. A severe error occurred on the current command. The results, if any, should be discarded.

I didn’t get any results to discard, so that helpful hint was not applicable, ever. I was able to resolve this error a few times by resolving different reasons why the error might have occurred by looking through the code and guessing at what line could possibly have caused the error. Generally speaking these errors were found after the first few things I tried did not work. Metaphorically, I had to wait for my punch cards to be read through the machine to see if I had correctly guessed what might be wrong. I have heard this process described as “black box”, but I think a more accurate description would be a punch card computer, as black box is too cool of a name for a process this heinous. In one instance, this error was received when the data in the field was longer than the field definition. That took a while to find since I had over 25 fields to review. Another time I got this same error when I had a typo in a field name, making the message appear completely arbitrary.

Data Factory Error 1000

Batch Execution failed. The response from the Machine Learning service at endpoint…(excluded specific job reference codes here)… {“Exception”:{“ErrorId”:”LibraryException”,”ErrorCode”:”1000″,”ExceptionType”:”ModuleException”,”Message”:”Error 1000: TLC library exception: Exception of type ‘Microsoft.Numerics.AFxLibraryException’ was thrown

If you search on the internets, you will notice that Error 1000 comes up a lot. The reason for that is it is a catch-all error number. There are 999 actual error messages coded and anything that isn’t one of those errors is error 1000. This problem could be anything. In my case it was because my web service wanted me to strongly type the data coming in, instead of having the data default to text, which worked outside of the web service. I did not resolve my problem after reading the error message and coming up with a logical guess based on my code. I had to create an Official support ticket to answer this question as the answer eluded me.

Data Factory Troubleshooting Error Strategies

Regrettably, there is no really good way of resolving errors. It’s not possible to look at anything in process and see how the data is being processed. Instead errors pop up when the execution fails. There are a few things you can do though. If you are calling an Azure Machine Learning Web service, I recommend modifying your input data so that if it looks like a number but it is read in as text, to use the meta data editor to type the data. Make sure you test the batch execution prior to loading it in Data Factory. For other pipelines, I employed a binary sort to determine what field might have the error. I commented out half the fields, and half the fields until I could determine what field had an error. Eventually I figured out the field which must be short. I also decreased the size of data using the code to 3 rows so I wouldn’t have to wait so long for it to fail.

Data Factory Series

This post is the end of my five part series on Azure Data Factory. I hope you have found it both interesting and useful when trying to learn Data Factory. If you have found this series interesting, please subscribe to my blog to be notified of the latest topics. Given that I plan on doing a lot of speaking in the very near future on topics such as R, SQL Server 2016 and Power BI, those topics are going to be on my blog in the very near future.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Migrating Data from Azure Blob Storage to SQL Server

There are instances where data resides in Azure Blob Storage and the data is needed in a SQL database. For example, if one ran a Machine Learning experiment in Data Factory, the results would be stored in Azure Blob storage, and for analysis purposes, it may make a lot more sense to move the data to SQL database. Moving data around in Data Factory, means writing JSON. In this example we will be using an Azure SQL DB, but it is not essential that the data be stored in Azure. An on-premises SQL Server could also be used, as long as a gateway was added for the connection, the other steps would be the same. There are five different Data Factory elements required to move data from an Azure blob to a database: a pipeline for the data, a data set containing the definition for the blob, a linked service for the blob, a data set containing a definition for the SQL Data, and a linked service to connect to the SQL database.

JSON Data Service

The data to be moved to SQL is stored in a blob storage container called OutputML, and both the linked service and that data set are included in a previous post on running an ML experiment.  In this linked service, the JSON creates the field definition to be written to a table in a SQL database called CensusMLOutput. There are fewer data types than there are in SQL, meaning the JSON here doesn’t exactly match the table definition, but the less granular datatypes are accepted by SQL.

"name": "OutputML",
"properties": {
"structure": [
{
"name": "Age",
"type": "Int32"
},
{
"name": "workclass",
"type": "string"
},
{
"name": "education-num",
"type": "Int32"
},
{
"name": "marital-status",
"type": "String"
},
{
"name": "occupation",
"type": "String"
},
{
"name": "relationship",
"type": "String"
},
{
"name": "race",
"type": "String"
},
{
"name": "sex",
"type": "String"
},
{
"name": "hours-per-week",
"type": "Int32"
},
{
"name": "native-country",
"type": "String"
},
{
"name": "Scored Labels",
"type": "Int32"
},
{
"name": "Scored Probabilities",
"type": "Decimal"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "CensusMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

JSON for Linked Service Output

The Data set defined references a Linked Service named LinkedServiceOutput. In this JSON the information needed to connect to the database where the table is for the code to write to it.

{
"name": "LinkedServiceOutput",
"properties": {
"description": "",
"hubName": "GingerDataFactoryTest_hub",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=jytr4gph.database.windows.net;Initial Catalog=MLData;Integrated Security=False;User ID=gingerg;Password=**********;Connect Timeout=30;Encrypt=True"
}
}
}

The code includes my id and a password, which is encrypted when the linked service is saved. Now that we have the data components defined, all that is required is for a Azure Data Factory pipeline to move the data.

 

JSON Data Factory Pipeline to Move Data to SQL

The pipeline PipelineCopyMLOutput is pretty straightforward, as it defines the action which should take place, copy and implements it. One thing to note that unlike copying a csv file, the data in a table is appened, meaning every time that this pipeline runs, more data will be added to the table. This code does not contain anything to prevent data from being duplicated, which will happen if the input does not change.


{
"name": "PipelineCopyMLOutput",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "OutputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputML"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy Activity"
}
],
"start": "2016-08-24T16:44:00Z",
"end": "2016-08-25T19:00:00Z",
"isPaused": true,
"hubName": "GingerDataFactoryTest_hub",
"pipelineMode": "Scheduled"
}
}

To run all of this JSON, you can wait an for it to run on schedule or run it Ad-hoc, which I detail in this post .

Data Factory Workflow

Combing all of the Data Factory components which are included in this and in previous posts, the entire work flow diagram is shown below. In the first pipeline, data is copied from the database to blob storage. Next the blob storage data is used to run an Azure ML experiment which outputs data to blob storage. Lastly the results from the experiment are copied to a database.  Notice all of the lovely green checks which exist in the diagram.

DataFactory

This blog series on data factory has covered everything from creating Azure components needed to using Data Factory to run a ML Web service and sending the results to the data base.  In my next and last post for a while on Data Factory, I will be discussing troubleshooting, an essential process in getting all the code to work. To be notified when new posts appear, please subscribe to my blog to receive the latest. I hope that you have found this to be useful. If so, please leave me comments or message me on Twitter as I would love to hear what others are doing with Data Factory.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Data Factory – Executing an Azure Machine Learning Web Service

My last blog post showed how to copy data to a blob storage account, which I needed to do to be able to call an Azure Machine Learning [ML] Web Service. When calling a ML Web Service, the data must be in an Azure Blob Storage account. Once a ML model has been trained, and a web services has been created, it’s ready for production. Calling the experiment in Data Factory allows the ML to be run with tens of thousands of rows as part of a scheduled process. Prior to inserting the ML web service in Data Factory, make sure that you test it to ensure there are no errors with the web service, as Data Factory does not expose all of the ML errors which may be encountered by the web service.

Creating Azure Machine Learning Data Factory Pipelines

ML DF PipelineTwo new steps need to be added to the existing Data Factory Pipeline, one to call the ML Web Service and one for the output. The ML pipeline requires two pieces of JSON code, a linked service to make the connection to the web service and a pipeline to invoke the job and specify the inputs and the Outputs. For the Output, the first step requires no JSON as first a blob storage container in Azure needs to be created to store it. The next steps involve writing JSON to create a linked service to connect to it and lastly an Output dataset needs to be defined.

Calling Machine Learning Service

The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Page grid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService

{
"name": "AzureMLLinkedService",
"properties": {
"description": "Connecting ML Experiment”
"hubName": " GingerDataFactoryTest_hub",
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/fbe056b6d4c74d7f9d1954367dc3fa61/services/xxa56efd75b745e28cd0512822d17eae/jobs",
"apiKey": "**********"
}
}
}

We will need another linked service for the Output, which takes the data from the experiment and writes it to a blob. The field names in the experiment are listed.

{
"name": "OutputML",
"properties": {
"structure": [
{"name": "Age", "type": "Int32" }
,
{ "name": "workclass", "type": "string" }
,
{ "name": "education-num", "type": "Int32" }
,
{ "name": "marital-status", "type": "String" }
,
{ "name": "occupation", "type": "String" }
,
{ "name": "relationship", "type": "String" }
,
{ "name": "race", "type": "String" }
,
{ "name": "sex", "type": "String" }
,
{ "name": "hours-per-week", "type": "Int32" }
,
{ "name": "native-country", "type": "String" }
,
{"name": "Scored Labels","type": "Int32"}
,
{"name": "Scored Probabilities","type": "Decimal"}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "ExperimentMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

The API key will show the actual value until you save it, at which point it will change to the stars you see here. This Linked Service will be referenced in the next bit of JSON for the pipeline


"name": "PipelineML",
"properties": {
"description": "Use Azure ML Model",
"activities": [
{
"type": "AzureMLBatchExecution",
"typeProperties": {
"webServiceInput": "InputDataSetBlob",
"webServiceOutputs": {
"output1": "OutputDataSetBlob"
},
"globalParameters": {}
},
"inputs": [
{
"name": "InputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputDataSetBlob"
}
],
"policy": {
"timeout": "02:00:00",
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MLActivity",
"description": "Execute Experiment",
"linkedServiceName": "AzureMLLinkedService"
}
],
"start": "2016-08-19T10:30:00Z",
"end": "2016-08-20T23:30:00Z",
"isPaused": true,
"hubName": " GingerDataFactoryTest_hub ",
"pipelineMode": "Scheduled"
}
}

Lastly another Dataset needs to be created to process the output. The data will be written to a file called Output.csv, which is in a folder called mloutput01/ which is located in the Blob storage container, which is the same one I used previously for the input folder used earlier.   This file will be overwritten every single time this is run.

{
"name": "OutputDataSetBlob",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "output.csv",
"folderPath": "mloutput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

If you add this code onto the previous Data Factory code, you can take data from the database and use it to run a Azure ML experiment and run as much data as you want through the experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Copying Data from a Database to an Azure Blob Store

My previous post provides instructions on Getting Started with Data Factory. To borrow a line from Ron White, “I told you that story so I could tell you this one”. If you followed the steps in that post, you have a Data Factory resource. Time to use it. This post describes how to process some data by copying data currently stored in the database to a Blob Storage account. To complete that task, you will need to write some JSON. Please be careful when you do this as writing JSON may involve cursing JSON. Since Jason was in the list of top four male baby names in the US from 1973 – 1982, chances are you may work near someone named Jason when writing JSON, who may think the cursing is directed at him.

Author and Deploy Linked Services

Author and DeployTo be able to pass data from a database, a connection needs to be established to a database. That connection is made by creating a Linked Service. In Data Factory, if you create a New data store, a new Linked Service JSON template will be created. After Clicking on the New Data Store Icon, a list of databases will appear. In this example I am going to pick an Azure SQL Database. If I wanted to access a local SQL Server Database, a gateway to connect to it would also be required. The JSON Linked service template will appear under the menu Drafts and will be labeled Draft-1. A copy of the coded is included below.

{"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>; UserID=<username>@<servername>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30"
}
}
}

If nothing is changed and the Deploy icon at the top of the screen is clicked, a new item AzureSqlLinkedService will appear under Linked services. There will be a circle with an exclamation point next to it as the connection does not contain valid data, but you can deploy it. To get rid of the error, replace all of the information default provided after the colon with valid values, starting with the name, which has been defaulted to AzureSqlLinkedService.The name property dictates what the Linked service will be called and  cannot be change the name later. To change the name later, highlight the name of the service click on the Clone,  change the name, deploy, highlight the old name, click on the …More text at the top of the screen, select delete, then confirm the deletion the Linked service by clicking on the Yes Button. Conversely, just select a good name before clicking on the Deploy icon. Within the connection string, replace all of the items between < and > with valid values. If there are any errors after deploying, fix the data, click on Deploy and the circle with the exclamation point will go away. Invalid database names are accepted, as long as they are in the right format. If the name is not valid, an error will be received when data factory is run.

Because we need a connection to a database and a Azure Blob, two Linked Services are required, one for each different type. Prior to completing this step, create an Azure Blob storage account by clicking on Add on All Resources. Create the second Linked service, like the first. Click on New data store then select Azure Storage. Using the template for an Azure Blob Storage linked services, I have modified it below adding the “hubName” as it is required


{
"name": "GingerAzureBSLinkedService01",
"properties": {
"description": "Test Azure Blob Storage Account for DF",
"hubName": "GingerDataFactoryTest_hub",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=gingerblobstorage01;AccountKey=**********"
}
}
}

The hub name is the name of your Data Factory resource you created earlier, and the name will be from you storage account. You will need to get the actual Account Key and replace it. Once the JSON is deployed, the value will appear as a line of asterisk again.

Create a Data Factory Dataset

Two datasets representing the data accessed in the two linked services must be written in JSON. Click on …More and select New dataset. Unlike Linked Services, you actually do have an option to create a New dataset as the name is consistent. Creating the JSON for the database is a bit of a pain as you have to define every single column accessed and as well as the datatype for each one. Of course both order and capitalization are important, as they must exactly match what is in the database, or you will get an error after the code is run. You won’t see any errors when you type it. Here is a sample

{
"name": "InputDataSet",
"properties": {
"structure": [
{
"name": "Age",
"type": "Int32"
},
{
"name": "workclass",
"type": "string"
},
{
"name": "education-num",
"type": "Int32"
},
{
"name": "marital-status",
"type": "String"
},
{
"name": "occupation",
"type": "String"
},
{
"name": "relationship",
"type": "String"
},
{
"name": "race",
"type": "String"
},
{
"name": "sex",
"type": "String"
},
{
"name": "hours-per-week",
"type": "Int32"
},
{
"name": "native-country",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "InputLinkedServiceAzureDB01",
"typeProperties": {
"tableName": "vCensusInfo"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}

In this sample, I am actually getting data from a view, not a table. Even though all the code references a table, using a view instead works fine. The linked service name does of course represent the Linked service created in the previous step.

Fortunately, it is not necessary to list every column when inserting values to a Blob so the JSON for that is much easier. While it is possible not to enter a file name, if you ever wish to use the data in the blob store later, you will need to create one. This JSON will write a file input.csv to a blob store to the path mlinput01.  If this is run twice, the file will be overwritten the second time.

{
"name": "InputDataSetBlob",
"properties": {
"published": false,
"description": "Input Blob Dataset to feed Azure ML",
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "inputdata.csv",
"folderPath": "mlinput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

JSON for a Data Factory Pipeline

The last JSON required needs to provide the action, in our example copying from a database to a Azure Blob store. Click on …More and select New pipeline.

The JSON code looks like this

{
"name": "PipelineTemplate",
"properties": {
"description": "<Enter the pipeline description here>",
"activities": [],
"start": "<The start date-time of the duration in which data processing will occur or the data slices will be processed. Example : 2014-05-01T00:00:00Z>",
"end": "<The end date-time of the duration in which data processing will occur or the data slices will be processed. Example: 2014-05-05T00:00:00Z>"
}
}

That JSON looks really simple. Unfortunately, for this to work a lot more JSON is required, especially under the activities


{
"name": "PipelineCopy01",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataSet"
}
],
"outputs": [
{
"name": "InputDataSetBlob"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy Activity"
}
],
"start": "2016-08-07T00:01:00Z",
"end": "2016-08-08T00:01:00Z",
"isPaused": false,
"hubName": "GingerDataFactoryTest_hub",
"pipelineMode": "Scheduled"
}
}

Once the JSON has all of the curly braces and commas necessary, it can be deployed without a red circle. Scroll back over to the right to get the the Action menu again and select Diagram. The diagram should appear as shown below.

Ad-Hoc Running Data Factory Pipelines

This job is scheduled to run on the hour, but chances are you want to run it whenever you have the code done, rather than waiting. To run the pipeline, underneath the Contents section, click on the Datasets. A list of datasets will appear. Select the one for the Azure Blob storage output, which in my example is called InputDataSetBlob. Select a value under slices, then click on it and another window will appear on the right. Click on one of the items on that window. Another screen will appear on the right. On this window on the top left corner click on Run. Ideally at this point if your spelling and capitalization all works out, after some duration, the status should say Succeeded. To validate that the transfer actually took place, go to your blob storage account, and you should see the file created. You can even download the file to make sure it actually has data in it. If you don’t change the name, and run it again, the data will be overwritten, and no error will occur, so feel free to run this as much as you like.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Getting Started

For those unfamiliar with Azure Data Factory, it is the application in Azure which moves data and calls processes to use the data. In a lot of ways this makes data factory a lot like SSIS. In addition to moving data around, data factory also contains all of the monitoring tools found in an SSMS job. If you add those things together you get a pretty good picture of what data factory does.

Setting up Data Factory in Azure

Data Factory is a resource which lives in a Resource Group in Azure. To add it to the list of available resources, click on the plus button in the AzureDataFactoryCreateAzure portal and search for Data Factory. Select the icon and this window will appear. Click on the create button at the bottom of the screen to enter the setup information. First pick a name for the service, then select from the dropdown box the appropriate subscription from the options you have available. Like everything else in Azure the services exist within in a container, either pick an existing container to use, or create a new one. Generally speaking I will create a new one so if for any reason later on I want to destroy what I have done, I can just delete the container. Pick the location based on two factors, Azure Data Factory is not available everywhere so you are limited to use only the ones where it is available. If you pick one where it isn’t available, you will get an error message letting you know why you cannot create the resource. Whenever possible within Azure to pick the same resource where your data lives. There are charges within Azure if you migrate data across resources and no charge if you stay in the same resource. You may want to go look at where the data lives which will be used in Data Factory before deciding where to put it. I always check the Pin to Dashboard option so that I can find the resource later, but it is not required and can be done later. Click on the create button to create a Data Factory Resource. If you have selected Pin to Dashboard you will see a little window which says Deploying Data Factory. This little window goes away once Data Factory is completed, and you will have an entry in the list of resources for Data Factory.

Data Factory Tiles

Clicking on the Data Factory resource icon will bring up the following window. Initially the only button you will care about is the one on the far left StartingWithDataFactoryAuthor and Deploy. This is where you will create the Data Factory Components. The authoring is all done in JSON. It is not the most intuitive language in the world and is very picky about brackets. There is some intellisense, and when you click on a bracket, it will highlight the bracket which it lines up with, which is somewhat helpful. In my experience, JSON can be frustrating, as it is rather picky. I don’t think my experience is unique because Microsoft has created two titles which write the JSON components for you, Copy Data and Sample Pipelines. The Copy data function has a wizard to walk through gathering the data required to write the JSON. I have had mixed success in getting it to work. Sometimes it did, and sometimes I got an error. It is in preview, so I imaging over time the performance will improve. Sample pipelines has a few samples of tasks you might want to complete that can be deployed in your environment. Diagram view is very important as it opens a window on the right hand side of the screen showing you a picture of what has been created in JSON. Monitor and Manage brings up an entirely new tab, which is welcomed by those of us who are tired of sliding to the right to see new windows. This screen shows a diagram and what has run. Metrics and Operations open a window showing what has run.

I hope you found this post handy for creating an Azure Data Factory resource. In a later post I will discuss what you can do with Azure Data Factory and provide more detail around the Contents section of this screen and discuss Datasets, Pipelines and Linked Services. Please subscribe to my blog to be notified of my latest posts.

 

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

 

 

 

 

 

 

 

 

 

 

2015: Year End Wrap up for Releases and More

As 2015 draws to a close, I started thinking back about everything that has happened this year. 2015 GraphicTechnically this has been a big year as a many new applications were released. Here are just some of them, with links included to provide more detail.

This short list could be a lot longer as it doesn’t count the number of updates released to Power BI, which occur several times a month, the CTP releases for SQL Server 2016, the new web version of BIML, or PowerShell. It’s really hard to keep up with everything that is changing. It’s a good thing that so many people are willing to help others learn how through speaking and blogs which make learning new things easier.

Community Involvement in 2015

Keeping up with all of these events is difficult, especially given the pace of releases.  I spend a lot of time reading various blogs, watching videos and going to hear people speak. I also have been able to talk about topics of particular interest, many Power BI and Machine Learning. This year I spoke a different times at a number of different events including: Speaker Idol, two different user groups, seven webinars, five SQL Saturdays and other Tech Events. I’ve got a number of engagements on the books for next year, including PASS BA Con and SQL Saturday #461 – Austin. 2016 is shaping up to be busy too and hopefully our paths will cross.  I list all of my speaking events on my Engagement Page and I hope that you might take a look at it from time to time if you are interested in catching up in person sometime. Next year I am hoping my list of speaking engagements changes somewhat as I plan on trying harder to get accepted to speak at events where I submitted and was turned down in 2015. On a more positive note, views of my blog are up 1000%, and the number of website subscribers has more than doubled. Thank you very much for continuing to read this site and I hope you find my thoughts helpful. I posted once a week this year, which I thought was pretty good until I talked to Ken Fischer b | t who blogs twice a week. I’ll have to try harder next year. If you think of a topic you think would make a good blog post, let me know as I am always interested in feedback.

Keeping Up the Pace in 2016

Next year there will be no slowdown in the things to learn as SQL Server 2016 is going to be released. Although the exact date has not been announced, my sources tell me to look for it around May-June. The next release of SQL Server is going to be huge as it will include new tools Microsoft added to integrate Big Data and open source platforms to SQL Server. PolyBase, JSON and R are all going to be part of with SQL Server. Personally, I find the R integration most Datazen and SSRS are going to be integrated in the next release too which should really increase the implementation of mobile reporting visualizations.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Introduction to Cortana Analytics

Microsoft previewed Cortana Analytics in July 13, 2015, and since then, they have published a lot of information on their site about it. Based on what I’ve seen on the internet, there appears to be a lot of confusion as to what Cortana Analytics is. This is completely understandable when you consider the number of different products the name Cortana has represented for Microsoft. My favorite is the image with the picture of a blue girl, which is from the Xbox game Halo 3. A video game was character was the first place Microsoft used the name Cortana in 2007. At the Microsoft BUILD Developer Conference in April 2014, the name Cortana was used for the Microsoft version of the Apple’s Siri phone application. If you are interested in hearing about it, I’ve included a link to the Channel 9 video here where they talk about Cortana. Finally, a year later Microsoft comes out with a product called Cortana Analytics. No wonder people are confused.

Cortana Analytics is not a Product

Cortana Analytics: the bow tying different apps together

Cortana Analytics: the bow tying different applications together

To help bring clarity to what Cortana Analytics is and is not, I wanted to start out with what I think is the most confusing point. Cortana Analytics is not a product, but a name given to a bunch of other applications which are designed to work together. In essence, Microsoft tied a bow around a bunch of applications and called the bow, Cortana Analytics. Here’s an example scenario. Start by sending water meter data from the physical meters to the cloud, where you aggregate, analyze, store and end up with a Power BI application on your phone showing you a visualization of some aspect of the data. To make this happen from a technical perspective using Microsoft’s tools, one would need to probably create an Event Hub, run a Streaming Analytics Process, use Data Factory to call a Machine Learning experiment, migrate the data to an Azure Storage account of some kind and then create a Power BI report to be sent to your phone. All of that, is Cortana Analytics. It is not one product, but a big bow tying all of the applications they have designed to work together under one name. Power BI is part of it. On that note, I recently saw Microsoft do a demo with Power BI where they integrated the Cortana-phone like functionality of talking to Power BI and it displayed the information it was asked. I have no idea when this will be released, but it sure was a neat demonstration. In this demo, they mentioned they were adding Cortana funcationality to Power BI, which really didn’t help the confusion level with the name.

Cortana Analytics Web Presentation

I recently recorded a video presentation of Cortana Analytics where I described in greater detail the components which make up Cortana Analytics and how they work together. That video is available here. As I am working more with the components which make up Cortana Analtics, such as Machine Learning and Power BI, I will definitely be devoting more blog post to the topic, so please subscribe to my blog if you are interested in learning more about it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Azure Data Lake: Why you might want one

On April 29, 2015 Microsoft announced they were offering a new product Azure Data Lake. For those of us who know what a data lake is, one might have thought that having a new data lake product was, perhaps redundant, because Microsoft already supported data lakes with HDInsight and Hadoop. To understand why you might want a separate product, let’s look at what a data lake is.  I think the best definition of a data lake that I read recently was here. Here’s the TL;DR version “A ‘data lake’ is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.” Ok so here’s the question, one  can spin up an HDInsight Hadoop cluster on Azure and put all of your data there, which means you can already create a data lake. Since you can already create a data lake, why did Microsoft go and create a new product?

Hardware Optimization and the Data Lake

If you look at Microsoft’s most recent Azure release, you’ll see they are releasing products designed to operate together. Service Bus, Event Hubs, Streaming Analytics, Machine Learning and Data Factory are designed to process lots of data, especially a lot of short pieces of data, like Vehicle GPS messages, or other types of real time status messages. In reading the product release for Azure Data Lake, they highlight it’s ability to store and more importantly retrieve this kind of data.  DataFactory The difference between the HDInsight already on Azure and the Data Lake product is the hardware dedicated to make the storage and the integration designed to improve access to the data. Data Factory is designed to move your data in the cloud to anywhere, including a data lake. If you look at the graphic Microsoft provides to illustrate what Data Factory is designed to integrate, the rest of the outputs listed have products associated with them. Now there is a product associated with the data lake too. Data lakes are designed to store all data, but unlike a database operational data store, data lakes are designed to have the database schema applied when the data is read, not when the data is written. This allows for faster writing of the data, but it does tend to make accessing the data slower. The Azure Data Lake hardware, according to the release, is designed to address this issue by providing computing power designed for massively parallel processing to provide the data when needed, which would be on the reading and analysis of the data, not when it is written. This sort of targeted computing power differs from the HDInsight Hadoop offering, which is uses a standard hardware model for storage and access. By tailoring the hardware to meet the needs of the specific type of data stored, in theory this will greatly improve performance, which will increase the adoption of not only the Azure Data Lake, but the tools to analyze and collect the data too. It’s going to be interesting to see how the marketplace responds as this could really push massive amounts of data to the Azure cloud. Time will tell.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

What is a Modern Data Warehouse?

As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.

Data is a Valuable Asset

In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.

Total Destruction of the Existing DW is Not Required

Excavator working with earth and sand in sandpitWhile a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.

Incorporating Social Media Information

The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.

For More Information

I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur