In the past few months, I have been examining Azure Synapse and what it can do. When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW. For this reason, many people think that Synapse is just an improved version of a cloud data warehouse. Microsoft did improve SQL DW when it moved it to Synapse. The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change. Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource. You will still need to partition your DW as large datasets require partitioning to perform well. Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.
Non-Data Warehouse Elements of Azure Synapse
To me the more interesting parts about Azure Synapse have nothing to do with data warehouses. Azure Synapse also contains the ability to query files stored in Azure Data Lake Gen 2 as if they were SQL files. This is a great way to analyze large data without first cleaning it up and putting it into a relational environment. Within Synapse you can formulate a query using syntax for selecting parts of files, providing the ability to look at many files as if they were one. You can also create processes which bring data into your synapse environment using Orchestration, a process that people who are familiar with Azure Data Factory will find very familiar. Synapse also contains the ability to analyze data in Cosmos DB without doing ETL or moving the data at all using a scalable architecture which will not impact the transactions being processed simultaneously on the same Cosmos DB.
Azure Synapse and Spark
Check out the promotional video here
By far the most interesting component of Azure Synapse is the Spark connection. Microsoft has added the ability to create Spark Pools into Azure Synapse. To be honest I was somewhat surprised that this functionality is included here first and not in Azure Machine Learning, where to use Spark you need to access clusters created them in Databricks. Spark provides the ability to dynamically scale resources when running processes. This is very handy when writing machine learning code which can really use the performance improvements Spark brings. Because this is Microsoft’s Spark, you can also write your code to access it in .Net if you like, in addition to the more common Spark Languages, Scala, R or Python. You can also incorporate the AutoML API created for Azure Machine learning in R and Python so that you can use the power of Azure to select your algorithm and hyperparameters instead of spending time doing it yourself.
Getting up to Speed with Synapse
There is a lot to learn when it comes to Synapse as it combines a lot of different components into one environment. As more and more data is being migrated to the cloud, it is uniquely designed to handle both big data components containing raw data, managed data lakes as well as more traditional data warehouse needs. It can also be the location where all of the data is processed, secured, cleaned and analyzed using Machine Learning. There is a lot to cover and since it is new, there is not a lot of places yet where you can learn more about it. If you are interested in a deep dive on Azure Synapse and how to use it in a Modern Data Warehouse, sign up for my precon at PASS Summit 2020 where I will cover the topic in depth.
Data aficionado et SQL Raconteur
As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.
Data is a Valuable Asset
In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.
Total Destruction of the Existing DW is Not Required
While a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.
Incorporating Social Media Information
The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.
For More Information
I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.
Data aficionado et SQL Raconteur
It’s funny the different meanings words have when you put them in different order, a point which anyone who has imitated the dialectic of Yoda can tell you. I find words fascinating as they are not static but have meanings which change over time. For example the Iron Maiden meant something totally different before there were electric guitars. Thinking of works and things changing, as one year closes and another year begins, I start to evaluate past and future topics. Earlier this year, I held an informal poll on twitter to find out how long people tend to talk on the same topic. The answers were quite varied. Some people keep on talking about the same topic as long as there seems to be interest in hearing about it. That way you can get to be a really good speaker on that topic. Another feels obligated to create a new topic each time out to provide him a challenge. The answer that personally I related to, was keep on talking about the topic until you are tired of hearing about it, which takes about a year.
SQL Saturday Albuquerque
My first upcoming engagement for 2015 will be as SQL Server Albuquerque where I will be talking about SSIS. I generally talk about things I am interested in or presently working on, and having working on a lot of ETL recently, I thought that it would be an interesting topic which I think most people would find helpful. As a consultant, I see a lot of code and wonder why parts of it were written that way. One big reason is someone thought the design was a good one. Since that is an objective decision, I thought it might be helpful to clarify design decisions with facts so that that people would be able to employ good logic for their design decisions.
Technology changes and their Impact on Data Development
Another topic which really interests me is the changes that new technologies are having on the database world. With the increased implementation of Hadoop and cloud things are really changing in the way data is being both stored and used. Predictive Analytics, Machine Learning, Cloud implementations, Interactive Data visualizations are changing what people are expecting from the way their data is stored and used. Expectations for data professionals are increasing as the business is looking away from HIPPO and towards the knowledge that they have gathered or integrated data from public sources.
Modern Data Warehouse
I have the pleasure of assisting in a day-long session to talk about Architecting the Modern Data Warehouse . During this one day session we will be showing how to use new technology such as HD Insight and Machine Learning to implement a modern data warehouse. Instead of just talking about new technologies we will be putting them to use to show how they can be used today. I’m really looking forward to it.
If you are able to attend any of these or any upcoming sessions, please stop by and introduce yourself as I would love to meet readers of my blog in person.
Data aficionado et SQL Raconteur