There are instances where data resides in Azure Blob Storage and the data is needed in a SQL database. For example, if one ran a Machine Learning experiment in Data Factory, the results would be stored in Azure Blob storage, and for analysis purposes, it may make a lot more sense to move the data to SQL database. Moving data around in Data Factory, means writing JSON. In this example we will be using an Azure SQL DB, but it is not essential that the data be stored in Azure. An on-premises SQL Server could also be used, as long as a gateway was added for the connection, the other steps would be the same. There are five different Data Factory elements required to move data from an Azure blob to a database: a pipeline for the data, a data set containing the definition for the blob, a linked service for the blob, a data set containing a definition for the SQL Data, and a linked service to connect to the SQL database.
JSON Data Service
The data to be moved to SQL is stored in a blob storage container called OutputML, and both the linked service and that data set are included in a previous post on running an ML experiment. In this linked service, the JSON creates the field definition to be written to a table in a SQL database called CensusMLOutput. There are fewer data types than there are in SQL, meaning the JSON here doesn’t exactly match the table definition, but the less granular datatypes are accepted by SQL.
"name": "OutputML",
"properties": {
"structure": [
{
"name": "Age",
"type": "Int32"
},
{
"name": "workclass",
"type": "string"
},
{
"name": "education-num",
"type": "Int32"
},
{
"name": "marital-status",
"type": "String"
},
{
"name": "occupation",
"type": "String"
},
{
"name": "relationship",
"type": "String"
},
{
"name": "race",
"type": "String"
},
{
"name": "sex",
"type": "String"
},
{
"name": "hours-per-week",
"type": "Int32"
},
{
"name": "native-country",
"type": "String"
},
{
"name": "Scored Labels",
"type": "Int32"
},
{
"name": "Scored Probabilities",
"type": "Decimal"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "CensusMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}
JSON for Linked Service Output
The Data set defined references a Linked Service named LinkedServiceOutput. In this JSON the information needed to connect to the database where the table is for the code to write to it.
{
"name": "LinkedServiceOutput",
"properties": {
"description": "",
"hubName": "GingerDataFactoryTest_hub",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=jytr4gph.database.windows.net;Initial Catalog=MLData;Integrated Security=False;User ID=gingerg;Password=**********;Connect Timeout=30;Encrypt=True"
}
}
}
The code includes my id and a password, which is encrypted when the linked service is saved. Now that we have the data components defined, all that is required is for a Azure Data Factory pipeline to move the data.
JSON Data Factory Pipeline to Move Data to SQL
The pipeline PipelineCopyMLOutput is pretty straightforward, as it defines the action which should take place, copy and implements it. One thing to note that unlike copying a csv file, the data in a table is appened, meaning every time that this pipeline runs, more data will be added to the table. This code does not contain anything to prevent data from being duplicated, which will happen if the input does not change.
{
"name": "PipelineCopyMLOutput",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "OutputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputML"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy Activity"
}
],
"start": "2016-08-24T16:44:00Z",
"end": "2016-08-25T19:00:00Z",
"isPaused": true,
"hubName": "GingerDataFactoryTest_hub",
"pipelineMode": "Scheduled"
}
}
To run all of this JSON, you can wait an for it to run on schedule or run it Ad-hoc, which I detail in this post .
Data Factory Workflow
Combing all of the Data Factory components which are included in this and in previous posts, the entire work flow diagram is shown below. In the first pipeline, data is copied from the database to blob storage. Next the blob storage data is used to run an Azure ML experiment which outputs data to blob storage. Lastly the results from the experiment are copied to a database. Notice all of the lovely green checks which exist in the diagram.
This blog series on data factory has covered everything from creating Azure components needed to using Data Factory to run a ML Web service and sending the results to the data base. In my next and last post for a while on Data Factory, I will be discussing troubleshooting, an essential process in getting all the code to work. To be notified when new posts appear, please subscribe to my blog to receive the latest. I hope that you have found this to be useful. If so, please leave me comments or message me on Twitter as I would love to hear what others are doing with Data Factory.
Yours Always
Ginger Grant
Data aficionado et SQL Raconteur
Pingback: Migrating Data To SQL Server Using Data Factory – Curated SQL