Articles for the Month of March 2015

What is a Modern Data Warehouse?

As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.

Data is a Valuable Asset

In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.

Total Destruction of the Existing DW is Not Required

Excavator working with earth and sand in sandpitWhile a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.

Incorporating Social Media Information

The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.

For More Information

I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Complex Data Analysis and Azure Machine Learning Presentation Wrap Up

Thank you for all of the people who signed up for my webinar on Data Analysis with Azure Machine Learning [ML]. I hope after watching it that you find reasons to agree that the most important thing you need to know to get started in Machine Learning is not Math, but having good knowledge of the data you want to analyze. There’s no reason not to investigate as Azure Machine Learning is free.  In order to take more time with the questions after the presentation than the webinar format allowed,  I am posting my answers here, where I am able to answer them in greater detail.

How would one choose a subset of data to “train” the model? For example, would I choose a random 1000 rows from my data set?

It is important to select a subset of data which is representative of the data which wish to evaluate. Sometime a random 1000 rows will do that, and other times you will need to use other criteria, like transactions throughout a given date range to be a better representative sample. It all comes down to knowing your data well enough to know that the data used for testing is similar to what you will be ultimately using for analysis.

Do you have to rerun or does it save results?

The process of creating an experiment requires that for each run you need to re-run the data as it does not save results.

Does Azure ML use the same logic as data mining?

In a word, no. If you look at the algorithms used for data mining you will see they overlap with some of the models available in Azure ML. Azure ML provides a richer set of models, plus a greater ability to either call models created by others or write custom models.

How much does Azure ML cost?

There is no cost for Azure ML. You can sign up and use it for free.  Click here for more information on Azure ML.

If I am using Data Factory, can I use Azure ML ?

Data Factory added the ability to call Azure ML in December, providing another place to incorporate Azure ML analytics. When an Azure experiment is complete, it is published as a web service so that the experiment can be called by any program which chooses to call it. Using the Azure ML experiments from directly within Data Factory decreases the need to write custom code, while allowing the logic to be incorporated into routine data collection processes.

http://azure.microsoft.com/blog/2014/12/16/azure-data-factory-updates-integration-with-azure-machine-learning-2/

If you have more questions about Azure ML or would like to see me present on the topic live and live in Southern California, I hope you can attend SQL Saturday #389 – Huntington Beach where I will be presenting on Azure ML and Top ten SSIS tips. I hope to see you there.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Math and Machine Learning

MLModelsI had an interesting conversation with someone at SQL Saturday Phoenix, an event that I am happy I was able to attend, regarding knowing math and getting started in Machine Learning. As someone who had majored in Math in college, he was sure that you had to know a lot of math to do Machine Learning. While I know that having really good math skills can always be helpful when creating statistical models based on probability, a big part of Machine Learning, I do not believe that you need to know a lot of math to do Azure Machine Learning [ML].

Azure Machine Learning and Throwing Spaghetti Against the Wall

For those of you who cook, you may have heard of an old school way of testing to see if the spaghetti is done. You throw the spaghetti against the wall and if it sticks, the pasta is done. If it falls right off, keep the spaghetti in the pot for a while longer. Testing machine learning models is similar, but instead of throwing the computer against the wall, you keep on testing using the large number of models available in Azure ML. Once you have determined the classification of your data, there are a number of different models for the classification which you can try without knowing all of the statistical formulas behind each model. I have listed all of the models from Azure ML here so that you can take a look at the large number of models available. By taking a representative sample of your data, and testing all of the related models, determining which one will provide a result is not terribly difficult. The reason it is not very hard is you do not have to understand the underlying math needed to run the model. Instead you need to learn how to read a ROC curve, which I included in my last blog post. While you can pick the appropriate model by having a deep understanding of the formula behind each model, you can achieve similar results by running all of the models and selecting the model based on the data.

Advanced Statistical Analysis and Azure ML

While Azure ML contains a lot of good tools to get started if you do not have a data scientist background, which recruiters lament not enough people do, why would you use Azure ML if you have coded a bunch of R Modules already to analyze your data? Because you can use Azure ML to call those modules as well and provides a framework to raise visibility and share those modules with people within your organization or the world, if you prefer.

How to Pick the Right Model

I am going to demonstrate how to pick the right model in an upcoming webinar, which is probably easier to explain in that fashion rather than in a blog post. If you want to see how to determine which model to use and not know a lot of Math, I hope you take the time to attend. Azure ML offers the ability to integrate analysis into your data environment without having to be a data scientist, while providing advanced features to accommodate those really good at math, which I will be talking about in an upcoming Preconvention event for SQL Saturday in Huntington Beach. If you happen to be in Southern California on April 10th I hope you will be able to attend that event.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur