Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data. Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics. With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.
Analyzing Data with Data Lake Analytics
Data Lake Analytics offers many of the same features as Databricks. You can write code to analyze data and the analysis can be automatically parallelized to scale. Microsoft has released a new version of Data Lake, which they are calling Data Lake Storage Gen2 to improve the performance of analysis performed with Data Lakes. The difference, between the old version and the new one, is the hierarchical namespace to Azure Blob Storage which provides an indexing capability which means that operations can be performed on a directory rather than enumerating through all of the data. Data stored within a Data Lake can be accessed just like HDFS and Microsoft has provided a new driver for accessing data in a Data Lake which can be used with SQL Data Warehouse, HDinsight and Databricks. With Data Lake Analytics, the data analysis is designed to be performed in U-SQL. While it supports R and Python libraries, users of the technology will need to get up to speed on U-SQL which is a lot like C#. This knowledge needs to be learned. Since U-SQL is so new, only a few years old, there is not a large number of people who are familiar with it.
Analyzing Data with Databricks
When analyzing data with Databricks, there are three different languages which you can use: R, Scala, and Python. Data can be read in from a variety of different Azure Storage options, including Blob Storage, Data Lake, and by using a JDBC connection. You can also connect to Azure SQL DB, as well as Azure SQL Data Warehouse. Since there are three different languages which can be used, there is no reason to learn a new language as most people are already very familiar with at least one of the three supported languages.
In addition to the ability to develop code, Databricks offers some other features which are not found in Data Lake Analytics. Many projects anticipate that people are going to be working in teams and will need to have an environment to share code and version it. This capability is baked into Azure Databricks as it provides an environment for sharing data with others and natively saving the data to a GitHub repository. The development environment is Jupyter Notebooks which provides a great way to document the code and include data samples, all at the same time. Databricks also includes a job schedule component so that work created in Databricks can use a native scheduler which has the ability to retry and send configurable messages on error or completion. These additional features, plus the ability to code in a language which is already widely used in the industry, give Databricks the edge in determining which technology to use going forward.
Yours Always
Ginger Grant
Data aficionado et SQL Raconteur