Articles

Azure Data Lake: Why you might want one

On April 29, 2015 Microsoft announced they were offering a new product Azure Data Lake. For those of us who know what a data lake is, one might have thought that having a new data lake product was, perhaps redundant, because Microsoft already supported data lakes with HDInsight and Hadoop. To understand why you might want a separate product, let’s look at what a data lake is.  I think the best definition of a data lake that I read recently was here. Here’s the TL;DR version “A ‘data lake’ is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.” Ok so here’s the question, one  can spin up an HDInsight Hadoop cluster on Azure and put all of your data there, which means you can already create a data lake. Since you can already create a data lake, why did Microsoft go and create a new product?

Hardware Optimization and the Data Lake

If you look at Microsoft’s most recent Azure release, you’ll see they are releasing products designed to operate together. Service Bus, Event Hubs, Streaming Analytics, Machine Learning and Data Factory are designed to process lots of data, especially a lot of short pieces of data, like Vehicle GPS messages, or other types of real time status messages. In reading the product release for Azure Data Lake, they highlight it’s ability to store and more importantly retrieve this kind of data.  DataFactory The difference between the HDInsight already on Azure and the Data Lake product is the hardware dedicated to make the storage and the integration designed to improve access to the data. Data Factory is designed to move your data in the cloud to anywhere, including a data lake. If you look at the graphic Microsoft provides to illustrate what Data Factory is designed to integrate, the rest of the outputs listed have products associated with them. Now there is a product associated with the data lake too. Data lakes are designed to store all data, but unlike a database operational data store, data lakes are designed to have the database schema applied when the data is read, not when the data is written. This allows for faster writing of the data, but it does tend to make accessing the data slower. The Azure Data Lake hardware, according to the release, is designed to address this issue by providing computing power designed for massively parallel processing to provide the data when needed, which would be on the reading and analysis of the data, not when it is written. This sort of targeted computing power differs from the HDInsight Hadoop offering, which is uses a standard hardware model for storage and access. By tailoring the hardware to meet the needs of the specific type of data stored, in theory this will greatly improve performance, which will increase the adoption of not only the Azure Data Lake, but the tools to analyze and collect the data too. It’s going to be interesting to see how the marketplace responds as this could really push massive amounts of data to the Azure cloud. Time will tell.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

What is a Modern Data Warehouse?

As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.

Data is a Valuable Asset

In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.

Total Destruction of the Existing DW is Not Required

Excavator working with earth and sand in sandpitWhile a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.

Incorporating Social Media Information

The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.

For More Information

I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Introduction to Hadoop Presentation Follow-up

Thank you so much for everyone who was able to attend my webinar http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/676 . (If you weren’t able to attend, you can always click on the link for a recording)

It’s always hard to talk about Hadoop as the subject is so broad that there were a lot of things that I had to leave out, so it is fortunate that I have this blog to discuss the topics I wasn’t able to cover. I thought that I would take this time to respond to the questions I received.

Presentation Q & A

Do you need to Learn Java in order to develop with Hadoop?

No. If you wish to develop Hadoop in the cloud with HD Insight, you have the option of developing with .net. If you are working in the Linux environments, which is where a lot of Hadoop is being developed, you will need to learn Java.

Do you know of any courses or sessions available where you can learn about Big Data or Hadoop?

My friend Josh Luedeman is going to be teaching an online class on Big Data next year.  If you don’t want to wait that long I recommend checking out a code camp in your area, such as Desert Code Camp where they are offering courses in Azure,  or SQL Saturday, especially the BI editions

How do you recommend a person with a BI background in SQL get started in learning Hadoop and where can I get the VMs?

The two ways I recommend for a person with a BI background to get involved with Hadoop is either through a Hortonworks VM or in the Microsoft’s Azure cloud with HD Insight.  Hortonworks provides a VM and Microsoft’s environment is hosted on their cloud. As the company that Microsoft partnered with to develop their Hadoop offerings, Hortonworks has very good documentation targeted to people who have more of a Microsoft BI stack background.  If you chose to go with HD Insight, there is a lot of really good documentation and video training available as well.

How do you compare Hadoop with the PDW?

While both Hadoop and Microsoft’s PDW, which they now call APS, were both designed to handle big data, but the approaches are wildly different. Microsoft built the APS to handle the larger data requirements of people who have structured data, mostly housed in SQL Server.  Hadoop was developed in an open source environment to handle unstructured data.

How can I transfer data into HD Insight?

This is a great question, which I promise to devote an entire blog post to very soon. I’ll give you the Reader’s Digest version here.  There are a number of ways you can transfer data into HD Insight.  The first step is to transfer the data into the Azure cloud, which you can do via SSIS, with a minor modification of the process I blogged about earlier here.  The other methods you could use to transfer data are via secured FTP or by using Powershell.  You will need to call the REST API which you use to provision an HDInsight Cluster.  There is also a UI you can use within HDInsight to transfer data as well.

I really appreciate the interest in the Webinar.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

From the clouds, data – DIY Guide to using SSIS with Azure

My apologies to Jack Johnson for lifting his song title for this blog, but I couldn’t help it. I thought that it might be useful to discuss the How-Tos of data migration to the cloud, which reminded me of all the cloud songs I know. I always thought the cloud metaphor was sort of an odd name for a remote data center, but that is probably why I am not in the business of naming things. But I digress. Back to the topic at hand. Moving data from SSIS to databases hosted on Microsoft’s Azure cloud requires some different steps than just moving data around on premise. Since things on the cloud change quite quickly, I thought I would demonstrate currently what this looks like using SSDT in Visual Studio 2012.

Breaking through the Wall

Before getting started with SSDT, there are some configuration steps one needs to complete on Azure first. The Windows Azure SQL database firewall must first be set up to allow connections from your IP, or your data will never get to the cloud. There is an option on the database screen to “Connect to your database”. If the firewall is not opened on the IP and port in use, you will get this message.

AzureDBScreenConfigFirewall-message

Selecting the Yes is needed to be able to connect to the Azure database via SSIS.

Also one might want to validate that the right drivers are loaded on SSDT as well. If you are running via Visual Studio 2012 or 2013, no worries as the drivers are already there, but for earlier versions new drivers may be required. If one is planning on loading data to an Azure SQL Server database, the ODBC or ADO.Net are the connections needed for Azure. The old data connection standby, Ole-DB is going to be left in the toolbox like last year’s dress, as it won’t work for the cloud. Much like fashion, everything old is new again so ODBC is once again the “It” connection. You can use ADO.Net too, but I won’t be here.

The next step in the process is getting the connection information needed to connect to the Azure database. Microsoft made this step quite easy. Look on the Azure Database screen where I’ve pasted a pink arrow. Click there.

 AzureDBScreenFull-EnabledCropped

 

This makes it so easy as a screen pops up with all the connection information you need. You just need to copy the ODBC section, and remember what your password is as you will need to enter it.

AzureODBCConnectionInfo

While we are still in Azure, I thought it would be a good idea to display where the SSIS package we will be creating will be putting the data. Here’s the table structure where the data will be placed.

AzureDBScreenTable

 

Here’s the query screen showing that right now the table is empty.

AzureDBScreenTable-NoData

SSIS Package Transferring Data to Azure

After you have all of the information you need from Azure, it is a relatively simple thing to create an SSIS package, with an OLEDB connection for my on premise database and an ODBC data connection to Azure using the information copied from the Azure database connection screen to transfer data to my Azure Database.

AzureDBSSISRunning

Going back to Azure, you can see 19,972 rows were added.

AzureDBScreenTable-Data

One word of caution, as you see here in the progress log, adding data can be a very slow process.

AzureDBSSISProgress

I highlighted the Elapsed time in red so that it would be easy to see that a simple file transfer took over two minutes.

Location, Location

One thing which is important to consider is where you are going to be moving your data. I demonstrated what I think may be the more common scenario, where the data is not on the cloud, and you want to put it to the cloud. Microsoft refers this as Hybrid Data Movement. Of course this may not be the case. If you are running SQL Server on a Virtual Machine in the cloud it may make a lot more sense to run SSIS on that virtual machine. If that is the case, for optimal performance, locate the SSIS in a VM in the same data center as the database because otherwise, due to the bandwidth and network latency, it will be slower. When transmitting data around the cloud, whether it be from on premises to the cloud or from one server to another on the cloud, you might want to consider compressing the data prior to sending if at all possible to decrease the size of the data being transmitted. It may be faster to extract the data you want on premises and transmit a compressed file to be applied on the cloud server. This can get to be more complicated as it requires setting up an Secure FTP server to transmit the files, which then have to be applied. Natively SSIS doesn’t have a compression tool, but there are third party products, such as Task Factory, which will allow you to not only compress the output but send it to your VM via Secured FTP from within the SSIS package.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur