Looking at all of the tools available for Hadoop reminds me of the work area in my Grandad’s basement. There he had a giant pegboard, ok maybe it just seemed big because I wasn’t, and he had all these tools on it. Different kinds of hammers, screwdrivers and saws and things I couldn’t identify. At first glance Hadoop looks a lot like that. There are lots of tools available, but you will get better results when you know when to use the claw hammer versus the ballpeen variety. Sometimes, the difference between tools are not so obvious, like between Hive and Pig. Other times the difference in tools are substantial for example the difference between Hive and Impala.
Big Data is an overarching term which can portray anything, from a bunch of websites to vehicle GPS tracking information which you get every 10 seconds. Due to the cheaper costs for storage, businesses want to save everything, and they are relying on the data people they employ to extract the desired answers they want from this reservoir of data, whenever the mood strikes them. In much of the recent literature, this is known as the Cake-in-the-Lake paradigm. The data is stored in HDFS is a giant pool, or Lake, and the data requested is the Cake. I have to digress and wonder who comes up with these metaphors. The useful information is the cake, and you need to go diving in the lake to find it. In this metaphor you are searching for soggy pastry. Wouldn’t it make more sense to go pearl diving for good information? I guess since “Pearl” or really “Perl” has already been taken as a name, someone thought a rhyme which evokes mental images of ruined bake goods would be better. Putting aside the metaphors, there are a number of tools and ways to get the good stuff out of the accumulated data pile.
As I am a database person, the tool which has most intrigued me is Cloudera’s Impala. No longer just your father’s Chevy, this tool is a full on SQL database on top of an HDFS file system. This is very attractive due to it’s high coolness potential as this allows users to write real ANSI SQL statements on top of Hadoop. Ok here’s my question, so when is that going to work? One of the big things stored in Hadoop is unstructured data. As I recall the reason that you don’t put unstructured data in a database is that the structure of said data does not lend itself to a formalized schema. Think about the structure of a series of web pages. What kind of schema are you going to impose upon that? It won’t work out well. If on the other hand, the data in the HDFS file structure is a large set of semi-structured data like sensor data or data which is inherently structured, Impala could be a good solution. Unfortunately, there is no one tool which will work for everything. If you need to parse through social media posts to find trending instances of people interested in buying a house for the first time in various parts of the country, you may have to use Map Reduce. Map Reduce is a batch process and a pain to write so a lot of tools exist so you don’t have to use it. Depending on what you are being asked to do, breaking out a Map Reduce program remains the best solution.
With Hadoop, what tool you can use is greatly influenced by what you are storing. Big data or small, you will still need to take a look at it so as to determine how you can categorize what is inside before taking that tool off the pegboard. And if you are going to be playing around with Hadoop, you are more than likely going to need to know how to use more than one tool.
Yours Always
Ginger Grant
Data aficionado et SQL Raconteur