Articles

SQL Server 2016 and PolyBase

The next release of SQL Server, SQL Server 2016 is continuing with a convention which was employed in previous releases, which I call the Cadillac release system.  At General Motors, in the past new features were first offered on their most luxurious brand, Cadillac, and if these features prove successful, they are rolled out to Buick and the rest of the product lines.  Microsoft does the same thing.  Their ‘Cadillac’  is the PDW [Parallel Data Warehouse], Microsoft’s Data Appliance. One notable example of this release model was the addition of column store indexes to SQL Server. Column store indexes were first available on the PDW, or APS as is was known then, and Microsoft later added column store indexes to SQL Server 2012. Later that same year, at SQL PASS Summit 2012, I heard about a really neat feature available in the PDW, PolyBase. The recording I heard is available here, where Dr. David DeWitt of Microsoft explained PolyBase in great detail. I have been waiting to hear that PolyBase was going to be released to SQL Server ever since.  On May the Fourth, 2015, Microsoft announced the preview release of SQL Server 2016. Listed in the release announcement was the feature I’d been waiting for, PolyBase.

Sqoop Limitations

PolyBase provides the ability to integrate a Hadoop cluster with SQL Server, which will allow you to query the data in a Hadoop Cluster from SQL Server. While the Apache environment provided the Sqoop HadoopSqoopapplication to integrate Hadoop with other relational databases, it wasn’t really enough. With Sqoop, the data is actually moved from the Hadoop cluster into SQL Server, or the relational database of your choice. This is problematic because you needed to know before you ran Sqoop that you had enough room within your database to hold all the data. I remembered this the hard way when I ran out of space playing with Sqoop and SQL Server. From a performance perspective, this kind of data transfer is also, shall we say, far from optimal. Another way to look at Sqoop is that it provides the Hadoop answer to SSIS. After all Sqoop is performing a data move, just like SSIS code. The caveat is SSIS is generally faster than Sqoop, and provides a greater feature set too.

Polybase – Hadoop Integration with SQL Server

Unlike Sqoop, PolyBase does not load data into SQL Server. Instead it provides SQL Server with the ability to query Hadoop while leaving the data in the HDFS clusters. Since Hadoop is schema-on-read, within SQL server you generate the schema to apply to your data stored in Hadoop. After the table schema is known, PolyBase provides the ability to then query data outside of SQL Server from within SQL Server. Using PolyBase it is possible to integrate data from two completely different file systems, providing freedom to store the data in either place. No longer will people start automatically equating retrieving data in Hadoop with MapReduce. With PolyBase all of the SQL knowledge accumulated by millions of people becomes a useful tool which provides the ability to retrieve valuable information from Hadoop with SQL. This is a very exciting development which I think will encourage more Hadoop adoption and better yet, integration with existing data. I am really looking forward SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Azure ML, SSIS and the Modern Data Warehouse

Recently I was afforded the opportunity to speak at several different events, all of which I thoroughly enjoyed. I was able to speak on Azure Machine learning first at the Arizona SQL Server Users Group meeting. I really appreciate all who attended as we had quite a crowd. Since the meeting is held MachineLearningTalkpractically on Arizona State University’s Tempe Campus, it was great to see a number of students attending, most likely due to Ram’s continued marketing efforts on meetup.com. After talking to him about it, I was impressed at his success at improving attendance by promoting the event on Meetup, and wonder if many SQL Server User Groups have experienced the same benefits. If you have, please let me know. Thanks Joe for taking a picture of the event too.

Modern Data Warehousing Precon

The second event where I had the opportunity to talk about technology was at the Precon at SQL Saturday in Huntington Beach, where I spoke about Modern Data Warehousing. It was a real honor to be selected for this event, and I really enjoyed interacting with all of the attendees. Special thanks to Alan Faulkner for his assistance. We discussed the changing data environment including cloud based storage, analytics, Hadoop, handling ever increasing amounts of data from different sources, increasing demands of users, the review of technology solutions demonstrate ways to resolve these issues in their environments.

Talking and More Importantly Listening

The following day was SQL Saturday in Huntington Beach #389. Thanks to Andrew, Laurie, Thomas and the rest of the volunteers for making this a great event as I know a little bit about the work that goes into planning and pulling off the event. My sessions on Azure ML, Predicting the future with Machine Learning and Top 10 SSIS Tuning Tricks were both selected and I had great turnout on both sessions. To follow-up with a question I received during my SSIS Session, Balanced Data Distributor was first released as a new SSIS transform for SQL Server 2008 and 2008 R2, so you can use it for versions prior to SQL Server 2012. I’ve posted more information about it here. I also got a chance to meet a real live data scientist, the first time that has happened.  Not only did I get a chance to speak but a chance to listen. I really enjoyed the sessions from Steve Hughes on the Building a Modern Data Warehouse and Analytics Solution in Azure, Kevin Kline on , and Julie Koesmarno on Interactive & Actionable Data Visualisation With Power View. As always it’s wonderful to get a chance to visit in person with the people who’s technical expertise I read. In addition to listening to technical jokes which people outside of the SQL community would not find humorous, it’s great to discuss technology with other practitioners. Thanks to Mr. Smith for providing me a question which I didn’t know the answer, which now I feel compelled to go find. I’ll be investigating the scalability of Azure ML and R so that I will be able to have an answer for him next time I see him. I really enjoy the challenge of not only investigating and applying new technology but figuring out how to explain what I’ve learned. I look forward to the opportunity to present again, and when I do I’ll be sure to update this site so hopefully I get a chance to meet the people who read this.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

What is a Modern Data Warehouse?

As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.

Data is a Valuable Asset

In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.

Total Destruction of the Existing DW is Not Required

Excavator working with earth and sand in sandpitWhile a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.

Incorporating Social Media Information

The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.

For More Information

I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Introduction to Hadoop Presentation Follow-up

Thank you so much for everyone who was able to attend my webinar http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/676 . (If you weren’t able to attend, you can always click on the link for a recording)

It’s always hard to talk about Hadoop as the subject is so broad that there were a lot of things that I had to leave out, so it is fortunate that I have this blog to discuss the topics I wasn’t able to cover. I thought that I would take this time to respond to the questions I received.

Presentation Q & A

Do you need to Learn Java in order to develop with Hadoop?

No. If you wish to develop Hadoop in the cloud with HD Insight, you have the option of developing with .net. If you are working in the Linux environments, which is where a lot of Hadoop is being developed, you will need to learn Java.

Do you know of any courses or sessions available where you can learn about Big Data or Hadoop?

My friend Josh Luedeman is going to be teaching an online class on Big Data next year.  If you don’t want to wait that long I recommend checking out a code camp in your area, such as Desert Code Camp where they are offering courses in Azure,  or SQL Saturday, especially the BI editions

How do you recommend a person with a BI background in SQL get started in learning Hadoop and where can I get the VMs?

The two ways I recommend for a person with a BI background to get involved with Hadoop is either through a Hortonworks VM or in the Microsoft’s Azure cloud with HD Insight.  Hortonworks provides a VM and Microsoft’s environment is hosted on their cloud. As the company that Microsoft partnered with to develop their Hadoop offerings, Hortonworks has very good documentation targeted to people who have more of a Microsoft BI stack background.  If you chose to go with HD Insight, there is a lot of really good documentation and video training available as well.

How do you compare Hadoop with the PDW?

While both Hadoop and Microsoft’s PDW, which they now call APS, were both designed to handle big data, but the approaches are wildly different. Microsoft built the APS to handle the larger data requirements of people who have structured data, mostly housed in SQL Server.  Hadoop was developed in an open source environment to handle unstructured data.

How can I transfer data into HD Insight?

This is a great question, which I promise to devote an entire blog post to very soon. I’ll give you the Reader’s Digest version here.  There are a number of ways you can transfer data into HD Insight.  The first step is to transfer the data into the Azure cloud, which you can do via SSIS, with a minor modification of the process I blogged about earlier here.  The other methods you could use to transfer data are via secured FTP or by using Powershell.  You will need to call the REST API which you use to provision an HDInsight Cluster.  There is also a UI you can use within HDInsight to transfer data as well.

I really appreciate the interest in the Webinar.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

H.I.P.P.O or How decisions are made without data

Hippo with glasses  cartoonOne of the reasons that I enjoy working with data, and especially the new analytical models is because when you have lots of data, you can analyze it to make decisions which may go against many misconceived preconceptions. One of the data projects I worked on in the past, gathered all of the call center data from the phone switches and matched it against the number of payments received from the people who were called. When the results were analyzed, the business chose to make different decisions than they had in the past. Previously the call centers were evaluated by the number of calls made, not what happened when someone was called. As a result in the change of the evaluation methodology, some call centers were closed, some managers were promoted, and other managers were fired. Absent data, decisions are made which can be called into question.

Decisions made without Data are really H.I.P.P.O.

In absence of having any data, decisions are often made using the H.I.P.P.O method, which stands for HIghest Paid Person’s Opinion. When data is gathered and displayed in a transparent manner, the managers knew they were underperforming and knew the consequences and were highly motivated to improve. HR felt confident that the reasons for letting people go were not going to be challenged, so they felt free to act as well.

Data Removes Ambiguity in the Decision Process

There are many examples where providing accurate clear data removes the questions people have regarding decisions. One other example which comes to mind was the question regarding the selection of speakers at the upcoming PASS Summit. As a disclaimer, I did not submit, so I was not surprised when I was not selected. PASS released the speaker feedback providing the data people needed to understand the criteria for acceptance. Gathering data and developing systems to accurately display it isn’t just a task undertaken because people like the technology, it is the method where transparency and decision making can be undertaken. So next time you are asked to create a report think about what you are really doing. It’s not an exercise in SSRS, you are providing tools which allow decisions to be made based on facts, not HIPPOs.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

The Scoop on Sqoop

In the weeks following my talk at Desert Code Camp and SQL Saturday in Detroit about Big Data, I have been receiving inquiries at my blog regarding sqoop, so I thought that I might get more specific on how it works. Sqoop is part of the Apache borg-like collective of tools which was created to use databases, any databases. Lots of people have databases and like them. Databases are really good ways to store data. Just think if Oracle would have been cheaper and faster Hadoop may have never been created because Hadoop was created to solve those problems, I guess at least in this situation resistance was far from futile, but I digress. Let’s say you have some data which you would like to load up into your SQL database. Since you are picking the data to load up into SQL Server, I am expecting you are picking some data which is already structured.

A while ago I worked on a GPS tracking application. We collected data on trucks every 10 seconds, which means that we were collecting a lot of data. To decrease the data in the database, the data was archived off after 30 days. If I was working there now, I would recommend that the data be archived to HDFS. You could store it very cheaply that way and using Sqoop, load the data back again if someone threatened to sue or something worse…
Here’s how you make an archive that work using Sqoop and HDFS
1. Create an HDFS datastore
2. Load the drivers for SQL server, because they only give you mySQL
3. Run the Sqoop command
4. This extracts the data and inserts into HDFS
Ok, let’s say you want the data back. The trickiest part is getting back only the data you are interested in and not everything you have. You can run out of space in SQL server by loading all of this data up, so be careful. First you need to know some information about SQL Server. Run this query on your destination
Select CONNECTIONPROPERTY(‘Net_transport’) as net_transport
, CONNECTIONPROPERTY(‘local_tcp_port’) as tcp
, CONNECTIONPROPERTY(‘Client_net_address’) as client_net_address

If it comes back that you have mixed instead of TCP, go into SQL Server configuration manager to change it to TCP. You will need that information to know what to put here. I am of course assuming that you have already created a SQL user id called Hadoop with a password of bigdata.

sqoop import –connect “jdbc:sqlserver://192.168.138.1:1433;database=AdventureWorks;username=hadoop;password=bigdata” –table

Assuming you kicked this off in the right path and all, congratulations, you have just used Sqoop!

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Hadoop Tools Peg Board

 

Looking at all of the tools available for Hadoop reminds me of the work area in my Grandad’s  basement. There he had a giant pegboard, ok maybe it just seemed big because I wasn’t, and he had all these tools on it. Different kinds of hammers, screwdrivers and saws and things I couldn’t identify.  At first glance Hadoop looks a lot like that. There are lots of tools available, but you will get better results when you know when to use the claw hammer versus the ballpeen variety. Sometimes, the difference between tools are not so obvious, like between Hive and Pig.  Other times the difference in tools are substantial for example the difference between Hive and Impala.

Big Data is an overarching term which can portray anything, from a bunch of websites to vehicle GPS tracking information which you get every 10 seconds.  Due to the cheaper costs for storage, businesses want to save everything, and they are relying on the data people they employ to extract the desired answers they want from this reservoir of data, whenever the mood strikes them.  In much of the recent literature, this is known as the Cake-in-the-Lake paradigm. The data is stored in HDFS is a giant pool, or Lake, and the data requested is the Cake. I have to digress and wonder who comes up with these metaphors.  The useful information is the cake, and you need to go diving in the lake to find it. In this metaphor you are searching for soggy pastry.  Wouldn’t it make more sense to go pearl diving for good information?   I guess since “Pearl” or really “Perl” has already been taken as a name, someone thought a rhyme which evokes mental images of ruined bake goods would be better.  Putting aside the metaphors, there are a number of tools and ways to get the good stuff out of the accumulated data pile.

As I am a database person, the tool which has most intrigued me is Cloudera’s Impala.  No longer just your father’s Chevy, this tool is a full on SQL database on top of an HDFS file system. This is very attractive due to it’s high coolness potential as this allows users to write real ANSI SQL statements on top of Hadoop.   Ok here’s my question, so when is that going to work?  One of the big things stored in Hadoop is unstructured data. As I recall the reason that you don’t put unstructured data in a database is that the structure of said data does not lend itself to a formalized schema.  Think about the structure of a series of web pages.  What kind of schema are you going to impose upon that? It won’t work out well.  If on the other hand, the data in the HDFS file structure is a large set of semi-structured data like sensor data or data which is inherently structured, Impala could be a good solution.   Unfortunately, there is no one tool which will work for everything. If you need to parse through social media posts to find trending instances of people interested in buying a house for the first time in various parts of the country, you may have to use Map Reduce.  Map Reduce is a batch process and a pain to write so a lot of tools exist so you don’t have to use it.  Depending on what you are being asked to do, breaking out a Map Reduce program remains the best solution.

With Hadoop, what tool you can use is greatly influenced by what you are storing. Big data or small, you will still need to take a look at it so as to determine how you can categorize what is inside before taking that tool off the pegboard.  And if you are going to be playing around with Hadoop, you are more than likely going to need to know how to use more than one tool.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur