Articles for the Month of April 2015

Azure Data Lake: Why you might want one

On April 29, 2015 Microsoft announced they were offering a new product Azure Data Lake. For those of us who know what a data lake is, one might have thought that having a new data lake product was, perhaps redundant, because Microsoft already supported data lakes with HDInsight and Hadoop. To understand why you might want a separate product, let’s look at what a data lake is.  I think the best definition of a data lake that I read recently was here. Here’s the TL;DR version “A ‘data lake’ is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.” Ok so here’s the question, one  can spin up an HDInsight Hadoop cluster on Azure and put all of your data there, which means you can already create a data lake. Since you can already create a data lake, why did Microsoft go and create a new product?

Hardware Optimization and the Data Lake

If you look at Microsoft’s most recent Azure release, you’ll see they are releasing products designed to operate together. Service Bus, Event Hubs, Streaming Analytics, Machine Learning and Data Factory are designed to process lots of data, especially a lot of short pieces of data, like Vehicle GPS messages, or other types of real time status messages. In reading the product release for Azure Data Lake, they highlight it’s ability to store and more importantly retrieve this kind of data.  DataFactory The difference between the HDInsight already on Azure and the Data Lake product is the hardware dedicated to make the storage and the integration designed to improve access to the data. Data Factory is designed to move your data in the cloud to anywhere, including a data lake. If you look at the graphic Microsoft provides to illustrate what Data Factory is designed to integrate, the rest of the outputs listed have products associated with them. Now there is a product associated with the data lake too. Data lakes are designed to store all data, but unlike a database operational data store, data lakes are designed to have the database schema applied when the data is read, not when the data is written. This allows for faster writing of the data, but it does tend to make accessing the data slower. The Azure Data Lake hardware, according to the release, is designed to address this issue by providing computing power designed for massively parallel processing to provide the data when needed, which would be on the reading and analysis of the data, not when it is written. This sort of targeted computing power differs from the HDInsight Hadoop offering, which is uses a standard hardware model for storage and access. By tailoring the hardware to meet the needs of the specific type of data stored, in theory this will greatly improve performance, which will increase the adoption of not only the Azure Data Lake, but the tools to analyze and collect the data too. It’s going to be interesting to see how the marketplace responds as this could really push massive amounts of data to the Azure cloud. Time will tell.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

What Does Analytics Mean?

A lot of words get used in technology and after a little while, no one bothers to mention what the word means. That’s too bad when the definition of a word gets changed, but that’s not the case with analytics. I found out that analytics is not a new word. It was coined in the 16th century to describe trigonometry, which makes me even more surprised WordPress’ spell checker always puts a red line under it as a misspelled or unknown word. I had someone tell me recently that they really weren’t sure what it was supposed to mean. Wikipedia says “Analytics is the discovery and communication of meaningful patterns in data”. That’s what as data professional doing when we provide data in a manner which answers questions, such as providing KPIs, machine learning algorithms, or visualization. It’s not enough to be the keepers of the data library, data should also be used to provide meaning. Here’s another reason to work on analytics, the dollars the trade press is predicting will be spent on business analytics by 2018.

Steps to Providing Analytics

When describing the process for providing analytics, I am sure many people will recognize parts of the process as they are engaged in them now. The first step is to understand the data. Understanding the data does not only mean having knowledge of the structure of the data, as that obviously will be necessary to select it, but also needing to know how the business uses the data. Which fields contain the data they actually use? The second step is to preparing the data, including determining what data to include. Do you have all of the data you need to do the analysis? If the answer to that question is no, the analytic process will stop. You may have to exclude some data if it is incomplete or of dubious quality.

Once one has the needed data, it’s time to start the third step, data modeling. Modeling is where you categorize and make various decisions regarding the data. For example, if you are wearing a blue shirt and tan pants and you are looking at the laptops and you happen to be in Best Buy, you have found an employee. Determining if your model is evaluated in the next step. Generally speaking the analysis will include items where you know the outcome. For example, if you are trying to predict when your website volume will increase, you want to look at the historical events that made that happen. Marketing people do this to determine if the ad campaigns were successful, for example.

The Dynamic Analytical Process

After the model is created and sucessfully tested and evaluated, it’s time to deploy it and monitor the outcomes. One thing to remember about complex analytical models is they will probably change. One example of this is an analytical model many people are familiar with, the FICO score. FICO scores were created to predict credit risk. They have been tweaked quite a lot as the latest real estate crash showed that the fact a high FICO score showing someone paid credit cards on time was a lousy predictor of whether or not that same person would default on a mortgage. Netflix changes the movies they recommend when new movies come out. Things change all the time, so working on analytics means the work is never “done”. All the better for those of us who enjoy data analytics.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Azure ML, SSIS and the Modern Data Warehouse

Recently I was afforded the opportunity to speak at several different events, all of which I thoroughly enjoyed. I was able to speak on Azure Machine learning first at the Arizona SQL Server Users Group meeting. I really appreciate all who attended as we had quite a crowd. Since the meeting is held MachineLearningTalkpractically on Arizona State University’s Tempe Campus, it was great to see a number of students attending, most likely due to Ram’s continued marketing efforts on meetup.com. After talking to him about it, I was impressed at his success at improving attendance by promoting the event on Meetup, and wonder if many SQL Server User Groups have experienced the same benefits. If you have, please let me know. Thanks Joe for taking a picture of the event too.

Modern Data Warehousing Precon

The second event where I had the opportunity to talk about technology was at the Precon at SQL Saturday in Huntington Beach, where I spoke about Modern Data Warehousing. It was a real honor to be selected for this event, and I really enjoyed interacting with all of the attendees. Special thanks to Alan Faulkner for his assistance. We discussed the changing data environment including cloud based storage, analytics, Hadoop, handling ever increasing amounts of data from different sources, increasing demands of users, the review of technology solutions demonstrate ways to resolve these issues in their environments.

Talking and More Importantly Listening

The following day was SQL Saturday in Huntington Beach #389. Thanks to Andrew, Laurie, Thomas and the rest of the volunteers for making this a great event as I know a little bit about the work that goes into planning and pulling off the event. My sessions on Azure ML, Predicting the future with Machine Learning and Top 10 SSIS Tuning Tricks were both selected and I had great turnout on both sessions. To follow-up with a question I received during my SSIS Session, Balanced Data Distributor was first released as a new SSIS transform for SQL Server 2008 and 2008 R2, so you can use it for versions prior to SQL Server 2012. I’ve posted more information about it here. I also got a chance to meet a real live data scientist, the first time that has happened.  Not only did I get a chance to speak but a chance to listen. I really enjoyed the sessions from Steve Hughes on the Building a Modern Data Warehouse and Analytics Solution in Azure, Kevin Kline on , and Julie Koesmarno on Interactive & Actionable Data Visualisation With Power View. As always it’s wonderful to get a chance to visit in person with the people who’s technical expertise I read. In addition to listening to technical jokes which people outside of the SQL community would not find humorous, it’s great to discuss technology with other practitioners. Thanks to Mr. Smith for providing me a question which I didn’t know the answer, which now I feel compelled to go find. I’ll be investigating the scalability of Azure ML and R so that I will be able to have an answer for him next time I see him. I really enjoy the challenge of not only investigating and applying new technology but figuring out how to explain what I’ve learned. I look forward to the opportunity to present again, and when I do I’ll be sure to update this site so hopefully I get a chance to meet the people who read this.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Musing about Microsoft’s Acquisition of Datazen and Power BI

DataZenMicrosoft just announced that they have bought Datazen, a mobile data visualization product. While I have no idea what Microsoft is actually going to do with the Datazen product, I couldn’t resist the chance to speculate about it. In earlier posts, I’ve talked about the conversion of what Power BI was before Power BI Designer was released and what Power BI is now. Since then I have been working on creating new Power BI dashboards. The process left me, shall we say underwhelmed? The tools in Excel allow for much greater flexibility and options than new Power BI. Now to be fair, new Power BI was released December 18th, 2014, so it’s not possible for it to contain all of the rich feature and functionality that the Excel tools do. That’s all well and good, but what it won’t do led to some frustration. If the new Power BI was the way that Microsoft was going to climb up to the top of the Gartner BI visualization charts, I didn’t think it was going to do the trick.

Anyone Still Using Lotus 123?

The one thing that I kept on thinking about when looking at the new Power BI is, there has to be a part of the plan I’m not getting. I didn’t see how this product would have the feature and functionality needed by the time the reviews came around again next February. In looking back in time, I couldn’t help of thinking of a time when Microsoft was battling it out in another space, spreadsheets. When Excel first came out, the big leader in the space was Lotus 123, which has since disappeared. (If you are running it where you work, please post comment to let me know, because I think Lotus 123 is gone.) The reason for Microsoft’s dominance in spreadsheets was Excel got a lot better at providing spreadsheets the way people wanted to use them.

Datazen, Hopefully Not the Next ProClarity

Microsoft’s purchase of Datazen looks to be a way to leverage a product with some really cool features to enhance the capabilities of Power BI. Datazen is a mobile application, but they have some good looking visualizations which hopefully could be incorporated into Power BI. There’s only one thing that may be a reason for pause. In 2006, Microsoft made another acquisition. They bought a company called ProClarity. ProClarity had some really neat features, some of which were included in Performance Point, but for the most part, the application was killed. I hope that history is not a guide in the purchase of Datazen, because Datazen has some great visualizations which could really help the new Power BI, and it would be good if Microsoft could figure out how to merge the features into the new Power BI to help improve the their position in the data visualization marketplace. I look forward to seeing how the two companies merge the Datazen features into Microsoft’s data visualization components.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Motivation

Speaking for myself, sometimes I have a hard time getting motivated. I know that I need to get a bunch of work done, and I find myself mesmerized by the internet as pet pictures or the news or twitter momentarily provide really compelling reasons not to work on the list of things that I have to get done. Eventually, I pull my head out, and start getting things accomplished. I seek out articles which have motivational tips too. One of the best tips I read went something like the inhabitants of Planet Kardashian will exist whether or not you are aware of their foibles. (They have planets now? Star Trek fortold a reality show?) What I took from the tip is; what’s going on other places will continue to go on whether you know about it or not, so you can find out about it after your work is done. Some days that works well, others, more of a goal. I write to do lists, place sticky notes around where I can’t help but see them and engage in most of the other tricks I’ve read to motivate myself. Sometimes it is not enough to push myself, I need an outside force.

External Incentive

People can provide a big external incentive. For an example of this, check out how hard sometimes people try to impress people they will never see again at stop lights. I know that I have been guilty of similar behavior, just not Green lightat stoplights. Being a part of an online community helps in finding motivation, as there are other people trying to do the same thing that you are. Motivation can come from anywhere, from a blog or even twitter. I found motivation in both places. After reading Ed Leighton-Dick’s post, I found an external motivator. His blog also showed me how powerful a post can be. Thanks to twitter, a lot of people saw his post and a number of people in the SQL Server Community have posted links and wrote their own blogs in support of his efforts. As I am sure Psy can attest, one can never know how much people are going to respond to what you put out on the internet, so kudos to Ed to being the Psy of the SQL Server Community. A number of people are now finding themselves motivated to bring their thoughts out of their head and onto the keyboard. Sharing of knowledge will help us all get smarter and better at our jobs. If you happen to be on twitter and see an interesting blog post with the hashtag #SQLNewBlogger, thank Ed as he helped make it happen.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur