On April 29, 2015 Microsoft announced they were offering a new product Azure Data Lake. For those of us who know what a data lake is, one might have thought that having a new data lake product was, perhaps redundant, because Microsoft already supported data lakes with HDInsight and Hadoop. To understand why you might want a separate product, let’s look at what a data lake is. I think the best definition of a data lake that I read recently was here. Here’s the TL;DR version “A ‘data lake’ is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.” Ok so here’s the question, one can spin up an HDInsight Hadoop cluster on Azure and put all of your data there, which means you can already create a data lake. Since you can already create a data lake, why did Microsoft go and create a new product?
Hardware Optimization and the Data Lake
If you look at Microsoft’s most recent Azure release, you’ll see they are releasing products designed to operate together. Service Bus, Event Hubs, Streaming Analytics, Machine Learning and Data Factory are designed to process lots of data, especially a lot of short pieces of data, like Vehicle GPS messages, or other types of real time status messages. In reading the product release for Azure Data Lake, they highlight it’s ability to store and more importantly retrieve this kind of data. The difference between the HDInsight already on Azure and the Data Lake product is the hardware dedicated to make the storage and the integration designed to improve access to the data. Data Factory is designed to move your data in the cloud to anywhere, including a data lake. If you look at the graphic Microsoft provides to illustrate what Data Factory is designed to integrate, the rest of the outputs listed have products associated with them. Now there is a product associated with the data lake too. Data lakes are designed to store all data, but unlike a database operational data store, data lakes are designed to have the database schema applied when the data is read, not when the data is written. This allows for faster writing of the data, but it does tend to make accessing the data slower. The Azure Data Lake hardware, according to the release, is designed to address this issue by providing computing power designed for massively parallel processing to provide the data when needed, which would be on the reading and analysis of the data, not when it is written. This sort of targeted computing power differs from the HDInsight Hadoop offering, which is uses a standard hardware model for storage and access. By tailoring the hardware to meet the needs of the specific type of data stored, in theory this will greatly improve performance, which will increase the adoption of not only the Azure Data Lake, but the tools to analyze and collect the data too. It’s going to be interesting to see how the marketplace responds as this could really push massive amounts of data to the Azure cloud. Time will tell.
Yours Always
Ginger Grant
Data aficionado et SQL Raconteur