Introduction to Hadoop Presentation Follow-up

Thank you so much for everyone who was able to attend my webinar http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/676 . (If you weren’t able to attend, you can always click on the link for a recording)

It’s always hard to talk about Hadoop as the subject is so broad that there were a lot of things that I had to leave out, so it is fortunate that I have this blog to discuss the topics I wasn’t able to cover. I thought that I would take this time to respond to the questions I received.

Presentation Q & A

Do you need to Learn Java in order to develop with Hadoop?

No. If you wish to develop Hadoop in the cloud with HD Insight, you have the option of developing with .net. If you are working in the Linux environments, which is where a lot of Hadoop is being developed, you will need to learn Java.

Do you know of any courses or sessions available where you can learn about Big Data or Hadoop?

My friend Josh Luedeman is going to be teaching an online class on Big Data next year.  If you don’t want to wait that long I recommend checking out a code camp in your area, such as Desert Code Camp where they are offering courses in Azure,  or SQL Saturday, especially the BI editions

How do you recommend a person with a BI background in SQL get started in learning Hadoop and where can I get the VMs?

The two ways I recommend for a person with a BI background to get involved with Hadoop is either through a Hortonworks VM or in the Microsoft’s Azure cloud with HD Insight.  Hortonworks provides a VM and Microsoft’s environment is hosted on their cloud. As the company that Microsoft partnered with to develop their Hadoop offerings, Hortonworks has very good documentation targeted to people who have more of a Microsoft BI stack background.  If you chose to go with HD Insight, there is a lot of really good documentation and video training available as well.

How do you compare Hadoop with the PDW?

While both Hadoop and Microsoft’s PDW, which they now call APS, were both designed to handle big data, but the approaches are wildly different. Microsoft built the APS to handle the larger data requirements of people who have structured data, mostly housed in SQL Server.  Hadoop was developed in an open source environment to handle unstructured data.

How can I transfer data into HD Insight?

This is a great question, which I promise to devote an entire blog post to very soon. I’ll give you the Reader’s Digest version here.  There are a number of ways you can transfer data into HD Insight.  The first step is to transfer the data into the Azure cloud, which you can do via SSIS, with a minor modification of the process I blogged about earlier here.  The other methods you could use to transfer data are via secured FTP or by using Powershell.  You will need to call the REST API which you use to provision an HDInsight Cluster.  There is also a UI you can use within HDInsight to transfer data as well.

I really appreciate the interest in the Webinar.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

The Scoop on Sqoop

In the weeks following my talk at Desert Code Camp and SQL Saturday in Detroit about Big Data, I have been receiving inquiries at my blog regarding sqoop, so I thought that I might get more specific on how it works. Sqoop is part of the Apache borg-like collective of tools which was created to use databases, any databases. Lots of people have databases and like them. Databases are really good ways to store data. Just think if Oracle would have been cheaper and faster Hadoop may have never been created because Hadoop was created to solve those problems, I guess at least in this situation resistance was far from futile, but I digress. Let’s say you have some data which you would like to load up into your SQL database. Since you are picking the data to load up into SQL Server, I am expecting you are picking some data which is already structured.

A while ago I worked on a GPS tracking application. We collected data on trucks every 10 seconds, which means that we were collecting a lot of data. To decrease the data in the database, the data was archived off after 30 days. If I was working there now, I would recommend that the data be archived to HDFS. You could store it very cheaply that way and using Sqoop, load the data back again if someone threatened to sue or something worse…
Here’s how you make an archive that work using Sqoop and HDFS
1. Create an HDFS datastore
2. Load the drivers for SQL server, because they only give you mySQL
3. Run the Sqoop command
4. This extracts the data and inserts into HDFS
Ok, let’s say you want the data back. The trickiest part is getting back only the data you are interested in and not everything you have. You can run out of space in SQL server by loading all of this data up, so be careful. First you need to know some information about SQL Server. Run this query on your destination
Select CONNECTIONPROPERTY(‘Net_transport’) as net_transport
, CONNECTIONPROPERTY(‘local_tcp_port’) as tcp
, CONNECTIONPROPERTY(‘Client_net_address’) as client_net_address

If it comes back that you have mixed instead of TCP, go into SQL Server configuration manager to change it to TCP. You will need that information to know what to put here. I am of course assuming that you have already created a SQL user id called Hadoop with a password of bigdata.

sqoop import –connect “jdbc:sqlserver://192.168.138.1:1433;database=AdventureWorks;username=hadoop;password=bigdata” –table

Assuming you kicked this off in the right path and all, congratulations, you have just used Sqoop!

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Hadoop Tools Peg Board

 

Looking at all of the tools available for Hadoop reminds me of the work area in my Grandad’s  basement. There he had a giant pegboard, ok maybe it just seemed big because I wasn’t, and he had all these tools on it. Different kinds of hammers, screwdrivers and saws and things I couldn’t identify.  At first glance Hadoop looks a lot like that. There are lots of tools available, but you will get better results when you know when to use the claw hammer versus the ballpeen variety. Sometimes, the difference between tools are not so obvious, like between Hive and Pig.  Other times the difference in tools are substantial for example the difference between Hive and Impala.

Big Data is an overarching term which can portray anything, from a bunch of websites to vehicle GPS tracking information which you get every 10 seconds.  Due to the cheaper costs for storage, businesses want to save everything, and they are relying on the data people they employ to extract the desired answers they want from this reservoir of data, whenever the mood strikes them.  In much of the recent literature, this is known as the Cake-in-the-Lake paradigm. The data is stored in HDFS is a giant pool, or Lake, and the data requested is the Cake. I have to digress and wonder who comes up with these metaphors.  The useful information is the cake, and you need to go diving in the lake to find it. In this metaphor you are searching for soggy pastry.  Wouldn’t it make more sense to go pearl diving for good information?   I guess since “Pearl” or really “Perl” has already been taken as a name, someone thought a rhyme which evokes mental images of ruined bake goods would be better.  Putting aside the metaphors, there are a number of tools and ways to get the good stuff out of the accumulated data pile.

As I am a database person, the tool which has most intrigued me is Cloudera’s Impala.  No longer just your father’s Chevy, this tool is a full on SQL database on top of an HDFS file system. This is very attractive due to it’s high coolness potential as this allows users to write real ANSI SQL statements on top of Hadoop.   Ok here’s my question, so when is that going to work?  One of the big things stored in Hadoop is unstructured data. As I recall the reason that you don’t put unstructured data in a database is that the structure of said data does not lend itself to a formalized schema.  Think about the structure of a series of web pages.  What kind of schema are you going to impose upon that? It won’t work out well.  If on the other hand, the data in the HDFS file structure is a large set of semi-structured data like sensor data or data which is inherently structured, Impala could be a good solution.   Unfortunately, there is no one tool which will work for everything. If you need to parse through social media posts to find trending instances of people interested in buying a house for the first time in various parts of the country, you may have to use Map Reduce.  Map Reduce is a batch process and a pain to write so a lot of tools exist so you don’t have to use it.  Depending on what you are being asked to do, breaking out a Map Reduce program remains the best solution.

With Hadoop, what tool you can use is greatly influenced by what you are storing. Big data or small, you will still need to take a look at it so as to determine how you can categorize what is inside before taking that tool off the pegboard.  And if you are going to be playing around with Hadoop, you are more than likely going to need to know how to use more than one tool.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur