Reviewing Azure Databricks and Data Lake Analytics

Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data.  Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics.  With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.

Analyzing Data with Data Lake Analytics

Data Lake Analytics offers many of the same features as Databricks.  You can write code to analyze data and the analysis can be automatically parallelized to scale.  Microsoft has released a new version of Data Lake, which they are calling Data Lake Storage Gen2 to improve the performance of analysis performed with Data Lakes.  The difference, between the old version and the new one, is the hierarchical namespace to Azure Blob Storage which provides an indexing capability which means that operations can be performed on a directory rather than enumerating through all of the data.  Data stored within a Data Lake can be accessed just like HDFS and Microsoft has provided a new driver for accessing data in a Data Lake which can be used with SQL Data Warehouse, HDinsight and Databricks.  With Data Lake Analytics, the data analysis is designed to be performed in U-SQL. While it supports R and Python libraries, users of the technology will need to get up to speed on U-SQL which is a lot like C#.  This knowledge needs to be learned. Since U-SQL is so new, only a few years old, there is not a large number of people who are familiar with it.

Analyzing Data with Databricks

When analyzing data with Databricks, there are three different languages which you can use: R, Scala, and Python.  Data can be read in from a variety of different Azure Storage options, including Blob Storage, Data Lake, and by using a JDBC connection. You can also connect to Azure SQL DB, as well as Azure SQL Data Warehouse. Since there are three different languages which can be used, there is no reason to learn a new language as most people are already very familiar with at least one of the three supported languages.

In addition to the ability to develop code, Databricks offers some other features which are not found in Data Lake Analytics.  Many projects anticipate that people are going to be working in teams and will need to have an environment to share code and version it.  This capability is baked into Azure Databricks as it provides an environment for sharing data with others and natively saving the data to a GitHub repository.  The development environment is Jupyter Notebooks which provides a great way to document the code and include data samples, all at the same time.  Databricks also includes a job schedule component so that work created in Databricks can use a native scheduler which has the ability to retry and send configurable messages on error or completion.  These additional features, plus the ability to code in a language which is already widely used in the industry, give Databricks the edge in determining which technology to use going forward.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Limitations in Time Series Data Analysis and the growth of Advanced Data Analytics

As someone who regularly analyzes data, I have done my share of time series analysis to determine trends over time.  I am struck by the fallibility of this sort of analysis.  For those who are unfamiliar with this time of analysis, time series analysis is performed to try to identify patters in the noise of data to help predict future trends through the use of algorithms like ARIMA. As I kid I remember hearing in fast announcer voice the following text “Past performance is no indication of future results”.  As a matter of fact this is a rule that the SEC requires mutual funds to tell all of their investors this statement.  Yet I get asked to do it anyway.  While I enjoy working with data and using advance analysis techniques including R and Python, I think it is important to realize the limitations of this sort of analysis.  It is considered a good experiment in Machine Learning if you are 85% right.  This is not acceptable if you are talking about a self-driving car as running people over 15% of the time is generally not considered acceptable. There are times when looking at the future that the data is not always going to provide an answer.  When looking to find answers in data, that needs to be something people keep in mind.  While you can find some answers in data, other answers will require prognostication or plan old guessing.

Impact on Technology realized in 2026

http://michaeljswart.com/2016/06/t-sql-tuesday-079-its-2016/comment-page-1/#comment-186750Data analysis is all about pattern matching, and while I don’t find it to be infallible, looking at a wideset of data has led me to plan accordingly.  While I am no Faith Popcorn, my analysis of what I see in the marketplace has led me to make some changes in my own life as I believe change is coming to the industry.  Adam Mechanic’s prompting for looking ahead to 2016 has provided the impetus to publish these theories.  What I see in the marketplace is the tools which are used to support databases are improving.  I see the ability of software to provide relevant hints and automate tuning of database queries and performance to continually improve, meaning there will be less of a need to employ people to perform this task.  I see with databases being pushed more and more to the cloud and managed services less and less need to employ many people to perform dba roles. Where I see the industry moving is towards more people being employed in analyzing the data to determine meaning from it.  I see that in 2026 very little data analysis being performed with R and most analysis being performed in Python.  This means that if you are looking ahead, and are employed in areas where people are being supplemented with tools, the time is now to learn skills in areas where there is growth. If you have been thinking about learning data science, Python and advanced analytics tools now is the time to start so that you will be prepared for the future.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Using Data Analysis to Pick Super Bowl Winners

I know that there is no way to compete with the major sports networks in the compilation of statistics about the two teams footballplaying in the super bowl. Instead I am going to focus on one feature, self-interest. Like many people, I have money in the stock market and I want my investments to make money next year. For this reason, I am an unqualified supporter of the Atlanta Falcons in the 2017 super bowl. The single data point I am using for my analysis is the fact that the falcons are an NFC Team, and when the NFC wins the stock market goes up. Go Falcons!

Correlation without Causation

Correlation does not imply causation is a common term in statistics and data analysis. It means that just because two variables move in relation to one another one does not mean that there is a cause and effect relationship between the two, even though it may seem like it. Just because when I washed my car it rained does not mean that I can control the precipitation patterns in the desert based on my propensity to visit the car wash. You may be thinking that having an NFC team win the super bowl and the stock market is an example of correlation without causation. After all the NFL does not control the world wide financial markets. If you look at the data though, 80% of the time the markets go up when the NFC wins. That is 50 years of data that supports that the winner does impact the market. Why might that be? Perhaps it follows Quantum Mechanics.

Observer Effect of Quantum Mechanics

When studying physics, specifically quantum mechanics researchers noticed that the observation changed the results. This is QuantumMechanicssomething commonly looked at when creating forecasts. Are the forecasts correct because the models are correct or because people believe them enough to make it happen. The superbowl winner impact on the stockmarket is well known. Perhaps it is for this reason that it becomes a self-fulfilling prophesy. This is the entire belief of many self-help ideas. If you believe it will happen, work to make it happen, it will happen. For whatever the reason, one cannot ignore 50 years of data.

Perhaps Patriots fans may think that I am pulling a lot of esoteric facts out of the air because I want the Patriots to lose. In all seriousness though, it is all about the data, and the observable effects of data knowledge. If you are watching the game and your team did not make the playoffs, and you are wondering who to root for because you do not care about the winner, perhaps this post helped you to decide.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur