In the past few months, I have been examining Azure Synapse and what it can do. When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW. For this reason, many people think that Synapse is just an improved version of a cloud data warehouse. Microsoft did improve SQL DW when it moved it to Synapse. The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change. Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource. You will still need to partition your DW as large datasets require partitioning to perform well. Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.
Non-Data Warehouse Elements of Azure Synapse
To me the more interesting parts about Azure Synapse have nothing to do with data warehouses. Azure Synapse also contains the ability to query files stored in Azure Data Lake Gen 2 as if they were SQL files. This is a great way to analyze large data without first cleaning it up and putting it into a relational environment. Within Synapse you can formulate a query using syntax for selecting parts of files, providing the ability to look at many files as if they were one. You can also create processes which bring data into your synapse environment using Orchestration, a process that people who are familiar with Azure Data Factory will find very familiar. Synapse also contains the ability to analyze data in Cosmos DB without doing ETL or moving the data at all using a scalable architecture which will not impact the transactions being processed simultaneously on the same Cosmos DB.
Azure Synapse and Spark
By far the most interesting component of Azure Synapse is the Spark connection. Microsoft has added the ability to create Spark Pools into Azure Synapse. To be honest I was somewhat surprised that this functionality is included here first and not in Azure Machine Learning, where to use Spark you need to access clusters created them in Databricks. Spark provides the ability to dynamically scale resources when running processes. This is very handy when writing machine learning code which can really use the performance improvements Spark brings. Because this is Microsoft’s Spark, you can also write your code to access it in .Net if you like, in addition to the more common Spark Languages, Scala, R or Python. You can also incorporate the AutoML API created for Azure Machine learning in R and Python so that you can use the power of Azure to select your algorithm and hyperparameters instead of spending time doing it yourself.
Getting up to Speed with Synapse
There is a lot to learn when it comes to Synapse as it combines a lot of different components into one environment. As more and more data is being migrated to the cloud, it is uniquely designed to handle both big data components containing raw data, managed data lakes as well as more traditional data warehouse needs. It can also be the location where all of the data is processed, secured, cleaned and analyzed using Machine Learning. There is a lot to cover and since it is new, there is not a lot of places yet where you can learn more about it. If you are interested in a deep dive on Azure Synapse and how to use it in a Modern Data Warehouse, sign up for my precon at PASS Summit 2020 where I will cover the topic in depth.
Yours Always
Ginger Grant
Data aficionado et SQL Raconteur