Analyzing JSON in U-SQL

In USQL there are built-in extractors for parsing text, comma delimited or tab delimined files. Once again, parsing JSON becomes problematic. There is a solution built into USQL, write some C# code to extend it or use someone else’s C# code to extend USQL. Since I wanted to parse JSON, fortunately there are libraries available on github containing the information required to do it. Download the github package and open up the Microsoft.Analytics.Samples project in Visual Studio. When I did this the first time, there was a problem loading the Newtonsoft.Json reference, so I right clicked on the references and downloaded the missing parts again. Build the solution and check out the code in the directory …Examples\DataFormats\Microsoft.Analytics.Samples.Formats\bin\Debug\ . There will be two DLLs, Microsoft.Analytics.Samples.Formats.dll and Newtonsoft.Json.dll. These dlls then need to be registered in Data Lake Analytics and locally if you chose to run your USQL locally. As at some point the goal is to run from within Data Lake analytics, you will need to copy both of these dlls to the data lake. I created a folder for the dlls called Assemblies, and ran this command


USE DATABASE [master];
CREATE ASSEMBLY [Newtonsoft.Json] FROM @"/Assemblies/Newtonsoft.Json.dll";
CREATE ASSEMBLY [Microsoft.Analytics.Samples.Formats] FROM @"Assemblies/Microsoft.Analytics.Samples.Formats.dll";

Notice I told the USQL where to find the dlls, in the Assemblies folder. This step only needs to be completed once per data lake. After this job successfully runs, then the dlls which allow the JSON to be parsed, can be referenced.

Here is my sample JSON, which I have copied to the folder Samples/Data/TestNew.Json, in the Data Lake

{
"appInstanceId": "357ced1e-cf05-459c-9317-794bq24f61c2",
"firmwareVersion": "1.0.2.4",
"serialNumber": "254542-694967",
"Side": "0",
"Latitude": "33.8848744",
"Longitude": "-128.403276",
"GeneratedDate": "2016-10-04T21:18:19Z"
}

Now that I have added the JSON to the Data Lake and the assemblies have been added, I can write some USQL to Parse the JSON. First I will need to reference the libraries, then create a schema, as there is no schema for a Data Lake. After those steps are completed, it’s possible to write SQL to query a JSON file. There is no UI to look at the results, so the results will be writing to a file. I am going to output the data to a csv file called JSONOutput.csv. Here’s the code to do that.

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

DECLARE @infile string="/Samples/Data/TestNew.json";

@logSchema =
EXTRACT name string
, appInstanceId string
, firmwareVersion string
, serialNumber string
, Side string
, Latitude float
, Longitude float
FROM @infile
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();

@testthis = SELECT appInstanceId
, COUNT(*) AS LocationCount
FROM @logSchema
GROUP BY appInstanceId;

OUTPUT @testthis
TO "/Samples/Data/JSONoutput.csv"
USING Outputters.Csv();

vsrunjson

Using Visual Studio, I am running the USQL Job. There isn’t much data to parse, and you can see in the summary widows that it took 21 seconds to prepare, and 33 seconds to run.

When go to the web and look at the Data Lake Analytics page, I can also see that the job completed. I have noticed that this appears pretty close to the same time on the web and on visual studio.

azuredlscreen

Clicking on the bar graph represented by today will allow me to select the job which ran, showing the same screen as appears in Visual Studio.

Thanks to Erik Zwiefel and Mark Vaillancourt b | t both of Microsoft for helping me figure out the process to use JSON in Data Lake Analytics, as I didn’t understand the steps which are required to parse JSON. I hope this blog makes it possible for you to figure out how to make it work.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

U-SQL and Azure Data Lake Analytics

There are a number of different SQL Flavors–HQL, PL/SQL, MySQL, U-SQL, T-SQL — all of which are a derivative of Ansi-SQL, which is I suppose in today’s parlance, A-SQL. Many people have not heard of U-SQL, which Microsoft introduced on September 28, 2015. Since the announcement was in the Visual Studio Blog, a number of data people may have missed it. U-SQL is meant to combine the ease SQL with the functionality of C# to create a language which can process any kind of data, like videos or text, by creating the ability to customize the code and infinitely scale. This is very useful if for example all of the data is stored in an Azure Data Lake.

Using U-SQL in Azure Data Lake Analytics

In my previous series on Stream Analytics, I wrote some U-SQL. That U-SQL didn’t look much different than Ansi-SQL, which is sort of the point of porting the functionality to a different yet familiar language. Another application which heavily uses U-SQL is Azure Data Lake. Data Lake stores its data in HDInsight, but you don’t need to write hive to query the data, as U-SQL will do it. Like Hive, U-SQL can be used to create a schema on top of some data, and then query it.

For example, to write a query on this csv file stored in a Data Lake, I would need to create the data definition for the data, then I could easily write a statement to query it.

PopsicleDataLake

@searchlog =
EXTRACT SaleDate string,
SaleLocation string,
Lemon   int,
Orange     int,
Temperature     int,
Leaflets int,
Price                      string
FROM "Samples/Data/Popsicle.tsv"
USING Extractors.Tsv();


@testthis = SELECT SaleLocation
, COUNT(*) AS LocationCount
FROM @searchlog
GROUP BY SaleLocation;


OUTPUT @testthis
TO "Samples/Data/Output/SaleLocCount.csv"
USING Outputters.Csv();

In this U-SQL code, I am creating a structure for the data, querying some fields, and writing the output to another file. Make sure that you don’t forget the semi-colons as that will cause errors. Also if any of your fields are blank you will have to code for that as well. From with Data Lake Analytics, the U-SQL is run as a job, creating a new file. Note the time that it took to finish the job.

USQLJob

 

The reason data is stored in a Data Lake is to provide a single storage location for the data, which will be used in analytics. U-SQL provides a powerful tool for getting the data out.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur