Analyzing JSON in U-SQL

In USQL there are built-in extractors for parsing text, comma delimited or tab delimined files. Once again, parsing JSON becomes problematic. There is a solution built into USQL, write some C# code to extend it or use someone else’s C# code to extend USQL. Since I wanted to parse JSON, fortunately there are libraries available on github containing the information required to do it. Download the github package and open up the Microsoft.Analytics.Samples project in Visual Studio. When I did this the first time, there was a problem loading the Newtonsoft.Json reference, so I right clicked on the references and downloaded the missing parts again. Build the solution and check out the code in the directory …Examples\DataFormats\Microsoft.Analytics.Samples.Formats\bin\Debug\ . There will be two DLLs, Microsoft.Analytics.Samples.Formats.dll and Newtonsoft.Json.dll. These dlls then need to be registered in Data Lake Analytics and locally if you chose to run your USQL locally. As at some point the goal is to run from within Data Lake analytics, you will need to copy both of these dlls to the data lake. I created a folder for the dlls called Assemblies, and ran this command


USE DATABASE [master];
CREATE ASSEMBLY [Newtonsoft.Json] FROM @"/Assemblies/Newtonsoft.Json.dll";
CREATE ASSEMBLY [Microsoft.Analytics.Samples.Formats] FROM @"Assemblies/Microsoft.Analytics.Samples.Formats.dll";

Notice I told the USQL where to find the dlls, in the Assemblies folder. This step only needs to be completed once per data lake. After this job successfully runs, then the dlls which allow the JSON to be parsed, can be referenced.

Here is my sample JSON, which I have copied to the folder Samples/Data/TestNew.Json, in the Data Lake

{
"appInstanceId": "357ced1e-cf05-459c-9317-794bq24f61c2",
"firmwareVersion": "1.0.2.4",
"serialNumber": "254542-694967",
"Side": "0",
"Latitude": "33.8848744",
"Longitude": "-128.403276",
"GeneratedDate": "2016-10-04T21:18:19Z"
}

Now that I have added the JSON to the Data Lake and the assemblies have been added, I can write some USQL to Parse the JSON. First I will need to reference the libraries, then create a schema, as there is no schema for a Data Lake. After those steps are completed, it’s possible to write SQL to query a JSON file. There is no UI to look at the results, so the results will be writing to a file. I am going to output the data to a csv file called JSONOutput.csv. Here’s the code to do that.

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

DECLARE @infile string="/Samples/Data/TestNew.json";

@logSchema =
EXTRACT name string
, appInstanceId string
, firmwareVersion string
, serialNumber string
, Side string
, Latitude float
, Longitude float
FROM @infile
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();

@testthis = SELECT appInstanceId
, COUNT(*) AS LocationCount
FROM @logSchema
GROUP BY appInstanceId;

OUTPUT @testthis
TO "/Samples/Data/JSONoutput.csv"
USING Outputters.Csv();

vsrunjson

Using Visual Studio, I am running the USQL Job. There isn’t much data to parse, and you can see in the summary widows that it took 21 seconds to prepare, and 33 seconds to run.

When go to the web and look at the Data Lake Analytics page, I can also see that the job completed. I have noticed that this appears pretty close to the same time on the web and on visual studio.

azuredlscreen

Clicking on the bar graph represented by today will allow me to select the job which ran, showing the same screen as appears in Visual Studio.

Thanks to Erik Zwiefel and Mark Vaillancourt b | t both of Microsoft for helping me figure out the process to use JSON in Data Lake Analytics, as I didn’t understand the steps which are required to parse JSON. I hope this blog makes it possible for you to figure out how to make it work.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Using Visual Studio for U-SQL Data Analytics Jobs

The pricing for USQL is based upon how many Analytic Units and Completed Jobs. To decrease the amount of money being spent, it would be most efficient if only completed jobs ran on SQL, not the 27 times the job was run to debug it. Fortunately, all of the debugging can be performed locally and only working jobs need to be run. Another thing that you may notice if you are exclusively using the Azure Portal for running Data Lake Analytic jobs is there is no way to actually save a job. Once the job is completed, you can review the job then click on the View Script button. Don’t rely on the button though, because for reasons unknown, sometimes the View Script Button is not enabled, meaning that it is not possible to see what ran.

Data Lake Analytics Setup for Visual Studio

There are a few steps required before any code is run. If the Data Lake Analytics Tools are not installed within Visual Studio, download them here and install them. When the tools are installed, the menu item Data Lake appears in Visual studio. The second step is to model your PC with the same file structure as your data lake. The default location which the Data Lake tools will look for your data structure is C:\Users\<<insertyourname>>\AppData\Local\USQLDataRoot . What this means is if you have folders and subfolders created in your data lake, your PC needs to have the same structure, including the data.

Running Data Lake Jobs Locally

If you take a look at the screen picture of Visual Studio with the data lake installed, you will notice a series of buttons at the top of the screen. The middle button currently is set to (Local). The drop down box at the top of the top of the screen will allow you to set the job to either your Azure Data Lake Analytics instance or locally. If it is running locally there will be no charges incurred on Azure. Also in Visual Studio, of course you can save the name of the USQL file.

vsdatalakescript

 

When the context is switched to the Data Lake Analytics instance from Azure, you will probably want to check out the Summit button. There is only one option, Advanced. In this window, you can change the Job name. It is default set to the name of the script being run, but if you are running the same script over and over, you may wish to change this name so that the different instances can be identified. Parallelism can also be set to the value that is actually being used in the job. Take a look at the job view, which is the tab to the left of Script. This screen shows the processes in use when the job is run and set the value accordingly. You will be charged for the Parallelism value that is set, not the amount actually used. Setting to a lower value can decrease the cost of running a job.

The tab on the far left, shows the job with the same view shown in the Azure Portal Stream Analytics job. That screen is shown below.

usqlscreenRunning on Visual Studio also has the benefit of less changing between screens than the Azure Portal, which is another reason to develop here. Now that I have this environment set up, I plan on writing all of my Data Lake Analytics jobs here, as I find the development environment works better for me. Let me know what you think of it by commenting below. If you are interested in finding out more about running Data Lake Analytics Jobs, especially if you are trying to parse JSON, please subscribe to my blog as that topic will be in an upcoming post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur