Introduction to Databricks

As I have been doing some work on Databricks, I thought that it would make sense that I start writing about it. Databricks is a scalable environment used to run R, Python and Scala code in the cloud. It currently can be run in either AWS or Microsoft’s Azure Cloud. For those of you who are budget-minded when it comes to learning new tools, there is also a free tier, which is available here Community.cloud.databricks.com . It has somewhat limited compute capacity, but if you are just starting out you might find it helpful.

Backstory

Databricks is an implementation of Apache Spark, part of the Hadoop collective and was a replacement for Map Reduce. Many of the people who worked on that open source project were students at Berkeley, where Apache Spark was created and added to Apache in 2013. Like many development projects, after it was completed, they had some ideas on how to improve the code. This time they decided to not make it open source but make it a commercial product so they could make some money for their development efforts. In April of 2017 Databricks was released on AWS and in March 2018 it was released in Azure.

Creating an Azure Databricks Service

Creating a Databricks Service is very straight-forward. There are only a few things that you need to complete when creating a new Databricks instance. The location becomes very important if you are looking at higher level performing instances which may not be available in all locations. Additionally, security considerations are also important for you if there is plenty of sensitive information being stored and accessed. You can either take up the exams yourself to learn about them (check out SC 900 dumps dumps here) or if that seems like a stretch, then hire someone with the know-how. Well, all of this only comes into the picture if you are working on a company project and not just for the sake of learning. If you are just getting started don’t worry about high level hosting services as you most likely will not need them and most of the compute options are available in most data centers. As always in Azure you want to make sure that you are hosting your Databricks service in the same location as your data so you will not need to pay to transfer data between data centers.

The Pricing Tier contains three options: Standard, Premium and Trial(Premium 14 Days). The trial is pretty self-explanatory and is a great way to get started using Databricks. They are of course a few differences between Standard and Premium. Premium has extra features needed for teams including Role-Based rights for the components of Databricks. And if you want ODBC authentication and Audit logs you will need to use Premium. For more information on the cost of Databricks pricing tiers, check out Microsoft’s pricing link for more information.

Once you have an instance created, you can start using Databricks. The application is contained within a managed instance, so once you launch Databricks you will be in their environment, which looks the same as the free edition.

 

Clusters, Notebooks and Data

These three components are the most important parts of Databricks as they include the compute power, where you write code and the information you work with respectively. These components are all separated in Databricks to improve scaling and provide a familiar environment to create and run code.

Cluster

The most important Databrick element, as it contains the compute. This is also the part of Databricks which will greatly increase your bill as the more resources you use to run code the more money you need to run it. One nice thing is clusters by default will terminate in 120 minutes of inactivity. I generally drop this to 20 minutes. If I am using it naturally it will not terminate, but if I am not using it, I want the charges to stop. You can also automatically spin up clusters to run jobs, so that they will only be in use when the job needs them. More about that in another post.

Notebooks

Databricks Notebook Import

Databricks Notebook Import

There are 3 supported languages in Databricks, R, Scala and Python, and within Databricks all of these languages are written in Notebooks. You don’t have to write your code in the environment. You can write it locally and then import it. However, if you want to export your Notebook then run it locally, it gets trickier. Natively all of the Notebooks in Databricks are saved as .dbc files. You can’t read them from anywhere else. Fortunately there is a workaround to format the Notebook files as .ipynb files which can be read by any notebook. Dave Wentzel from Microsoft has an elegant solution to convert .dbc to .ipynb which he includes in his blog here.

Data

You have a lot of options with data. You can import a dataset into your environment to play with or you can connect to just about anything you can think of. When you start doing data connections is when you stop using the community edition as you will want to use the Azure version to this to connect to various data resources like Azure SQL and blob storage. More on how to that later in an upcoming post.

If you are interested in hearing more about Databricks and are in Chicago, I am teaching an all day class as part of SQL Saturday Chicago and would love to have you attend. More information on that class is here.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Science with Python

KD Nuggets Data Science/ Machine Learning PollFor those of you who might have missed it, the website KDnuggets released their latest internet survey on data science tools, and Python came out ahead, again. Python has continued to gain as a tool that people are using for Data Science.  The article accompanying the graphic is very interesting as it brings up two data related points. The first is the survey only had “over 2300 votes” and “…one vendor – RapidMiner – had a very active campaign to vote in KDnuggets poll”.  This points the fallacy in completely relying on data with an insufficiently sized data set, as it is possible to skew the results, which is true both for surveys and data science projects.  If you look at the remaining results one thing also strikes me as interesting. Anaconda and sci-kit learn are Python libraries.  Tensorflow could be used for either R or Python.  This does tend to increase the argument for more use of R or Python over RapidMiner.  The survey also made me want to check out RapidMiner.

Thoughts around Rapid Miner for Machine Learning

While I have not had enough time to fully analyze Rapid Miner, I thought I would give my initial analysis here and do a more detailed review of it in another post.  Rapid Miner scored well in the Kaggle Survey, but also it ranked highly on the 2018 Garner Magic Quadrant for Data Science Platforms.  Rapid Miner is trying to be a tool not only for data scientists, but also for business analysts as well.  The UI is pretty intuitive, which is good because the help is not what it should be. I also was less than impressed at its data visualization capabilities, as R and Python both provide much better visuals. Of course, I used the free version of the software, which works but it is limiting.  It looks like a lot of the new stuff is going to be only available on the paid version, which decreases my desire to really learn this tool.

Machine Learning Tools

Recently I have done a number of talks on Python in SQL Server, literally all around the world, including Brisbane, Australia tomorrow and Saturday, June 2 as well as in Christchurch New Zealand. As R was written in New Zealand, I thought that it would be the last place where people would be looking to use Python with Data Science, but several of the attendees of my precon on Machine Learning for SQL Server told me that where they worked, Python was being used to solve data science problems. Now of course this is anecdotal sample, as we are not talking about a statistically significant sample set, but that doesn’t keep it from being interesting.   The demand for Python training continues to increase as Microsoft has announced they are working on incorporating Machine Learning Service blog series with SQL Server Central.  The first two post have been released. Let me know what you think of them.

Upcoming Events

I am looking forward to talking about Machine Learning with SQL Server in Brisbane both at an intense day long session and at a one hour session on Implementing Python in SQL Server 2017 at SQL Saturday #713 – Brisbane, Australia. I look forward to seeing you there. For those who can’t make it, well, hopefully our paths will cross at a future event.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Power BI – Beyond the Basics

When helping clients recently with their Power BI implementations, I have noticed that when talking to people about Power BI there seems to be some areas where there continues to be a log of questions.  While it is easy to find a plethora of information about getting started with Power BI, when it comes to implementing a solution, the information is scarce.  How do you handle releases? Should an implementation contain only one data model? Is Power BI’s data secured on the cloud? Is it required to have Office 365 use Power BI? Do you have to have Power BI Premier to have the Power BI run locally?

Advanced Power BI Techniques in Norway

While I have discussed some best practice techniques on my blog, as usual new features released in Power BI have a

Norway Parliament Building in Oslo

Norway Parliament Building in Oslo

tendency to change some of the available options.  For example, App workspaces, the updated take on Content Packs released a few months ago, now offer a new method for releasing not only dashboards but the reports behind them and the ability to easily migrate sources. I am excited that I will have the opportunity to discuss the answers to the questions received by doing a full day of training at SQL Saturday Oslo. I am looking forward to visiting Oslo, which is home to the best preserved Viking Ship, an Opera House designed to be walked on and the home of the guy who painted the Scream.  If you happen to reside somewhere where it is possible to make the journey to Norway, please register to attend this full day of interactive training.  We will cover all of these items and go into detail about Power BI administration, security and new features and design techniques which will improve Power BI implementation techniques.

sqlsat667_osloFor those of you who are unable to attend, I feel obliged to answer some of the questions I posed earlier.  Implementations generally require more than one data model.  Power BI is encrypted both in transit and at rest. You do not need to have Office 365 to run Power BI.  Power BI can be run locally with Power BI Report Server, which is part of SQL Server 2016 Enterprise with Software Assurance, and you do not need to sign up with Power BI Premier to install it.

I hope to see you in Norway.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

2015: Year End Wrap up for Releases and More

As 2015 draws to a close, I started thinking back about everything that has happened this year. 2015 GraphicTechnically this has been a big year as a many new applications were released. Here are just some of them, with links included to provide more detail.

This short list could be a lot longer as it doesn’t count the number of updates released to Power BI, which occur several times a month, the CTP releases for SQL Server 2016, the new web version of BIML, or PowerShell. It’s really hard to keep up with everything that is changing. It’s a good thing that so many people are willing to help others learn how through speaking and blogs which make learning new things easier.

Community Involvement in 2015

Keeping up with all of these events is difficult, especially given the pace of releases.  I spend a lot of time reading various blogs, watching videos and going to hear people speak. I also have been able to talk about topics of particular interest, many Power BI and Machine Learning. This year I spoke a different times at a number of different events including: Speaker Idol, two different user groups, seven webinars, five SQL Saturdays and other Tech Events. I’ve got a number of engagements on the books for next year, including PASS BA Con and SQL Saturday #461 – Austin. 2016 is shaping up to be busy too and hopefully our paths will cross.  I list all of my speaking events on my Engagement Page and I hope that you might take a look at it from time to time if you are interested in catching up in person sometime. Next year I am hoping my list of speaking engagements changes somewhat as I plan on trying harder to get accepted to speak at events where I submitted and was turned down in 2015. On a more positive note, views of my blog are up 1000%, and the number of website subscribers has more than doubled. Thank you very much for continuing to read this site and I hope you find my thoughts helpful. I posted once a week this year, which I thought was pretty good until I talked to Ken Fischer b | t who blogs twice a week. I’ll have to try harder next year. If you think of a topic you think would make a good blog post, let me know as I am always interested in feedback.

Keeping Up the Pace in 2016

Next year there will be no slowdown in the things to learn as SQL Server 2016 is going to be released. Although the exact date has not been announced, my sources tell me to look for it around May-June. The next release of SQL Server is going to be huge as it will include new tools Microsoft added to integrate Big Data and open source platforms to SQL Server. PolyBase, JSON and R are all going to be part of with SQL Server. Personally, I find the R integration most Datazen and SSRS are going to be integrated in the next release too which should really increase the implementation of mobile reporting visualizations.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Azure ML, SSIS and the Modern Data Warehouse

Recently I was afforded the opportunity to speak at several different events, all of which I thoroughly enjoyed. I was able to speak on Azure Machine learning first at the Arizona SQL Server Users Group meeting. I really appreciate all who attended as we had quite a crowd. Since the meeting is held MachineLearningTalkpractically on Arizona State University’s Tempe Campus, it was great to see a number of students attending, most likely due to Ram’s continued marketing efforts on meetup.com. After talking to him about it, I was impressed at his success at improving attendance by promoting the event on Meetup, and wonder if many SQL Server User Groups have experienced the same benefits. If you have, please let me know. Thanks Joe for taking a picture of the event too.

Modern Data Warehousing Precon

The second event where I had the opportunity to talk about technology was at the Precon at SQL Saturday in Huntington Beach, where I spoke about Modern Data Warehousing. It was a real honor to be selected for this event, and I really enjoyed interacting with all of the attendees. Special thanks to Alan Faulkner for his assistance. We discussed the changing data environment including cloud based storage, analytics, Hadoop, handling ever increasing amounts of data from different sources, increasing demands of users, the review of technology solutions demonstrate ways to resolve these issues in their environments.

Talking and More Importantly Listening

The following day was SQL Saturday in Huntington Beach #389. Thanks to Andrew, Laurie, Thomas and the rest of the volunteers for making this a great event as I know a little bit about the work that goes into planning and pulling off the event. My sessions on Azure ML, Predicting the future with Machine Learning and Top 10 SSIS Tuning Tricks were both selected and I had great turnout on both sessions. To follow-up with a question I received during my SSIS Session, Balanced Data Distributor was first released as a new SSIS transform for SQL Server 2008 and 2008 R2, so you can use it for versions prior to SQL Server 2012. I’ve posted more information about it here. I also got a chance to meet a real live data scientist, the first time that has happened.  Not only did I get a chance to speak but a chance to listen. I really enjoyed the sessions from Steve Hughes on the Building a Modern Data Warehouse and Analytics Solution in Azure, Kevin Kline on , and Julie Koesmarno on Interactive & Actionable Data Visualisation With Power View. As always it’s wonderful to get a chance to visit in person with the people who’s technical expertise I read. In addition to listening to technical jokes which people outside of the SQL community would not find humorous, it’s great to discuss technology with other practitioners. Thanks to Mr. Smith for providing me a question which I didn’t know the answer, which now I feel compelled to go find. I’ll be investigating the scalability of Azure ML and R so that I will be able to have an answer for him next time I see him. I really enjoy the challenge of not only investigating and applying new technology but figuring out how to explain what I’ve learned. I look forward to the opportunity to present again, and when I do I’ll be sure to update this site so hopefully I get a chance to meet the people who read this.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Complex Data Analysis and Azure Machine Learning Presentation Wrap Up

Thank you for all of the people who signed up for my webinar on Data Analysis with Azure Machine Learning [ML]. I hope after watching it that you find reasons to agree that the most important thing you need to know to get started in Machine Learning is not Math, but having good knowledge of the data you want to analyze. There’s no reason not to investigate as Azure Machine Learning is free.  In order to take more time with the questions after the presentation than the webinar format allowed,  I am posting my answers here, where I am able to answer them in greater detail.

How would one choose a subset of data to “train” the model? For example, would I choose a random 1000 rows from my data set?

It is important to select a subset of data which is representative of the data which wish to evaluate. Sometime a random 1000 rows will do that, and other times you will need to use other criteria, like transactions throughout a given date range to be a better representative sample. It all comes down to knowing your data well enough to know that the data used for testing is similar to what you will be ultimately using for analysis.

Do you have to rerun or does it save results?

The process of creating an experiment requires that for each run you need to re-run the data as it does not save results.

Does Azure ML use the same logic as data mining?

In a word, no. If you look at the algorithms used for data mining you will see they overlap with some of the models available in Azure ML. Azure ML provides a richer set of models, plus a greater ability to either call models created by others or write custom models.

How much does Azure ML cost?

There is no cost for Azure ML. You can sign up and use it for free.  Click here for more information on Azure ML.

If I am using Data Factory, can I use Azure ML ?

Data Factory added the ability to call Azure ML in December, providing another place to incorporate Azure ML analytics. When an Azure experiment is complete, it is published as a web service so that the experiment can be called by any program which chooses to call it. Using the Azure ML experiments from directly within Data Factory decreases the need to write custom code, while allowing the logic to be incorporated into routine data collection processes.

http://azure.microsoft.com/blog/2014/12/16/azure-data-factory-updates-integration-with-azure-machine-learning-2/

If you have more questions about Azure ML or would like to see me present on the topic live and live in Southern California, I hope you can attend SQL Saturday #389 – Huntington Beach where I will be presenting on Azure ML and Top ten SSIS tips. I hope to see you there.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Math and Machine Learning

MLModelsI had an interesting conversation with someone at SQL Saturday Phoenix, an event that I am happy I was able to attend, regarding knowing math and getting started in Machine Learning. As someone who had majored in Math in college, he was sure that you had to know a lot of math to do Machine Learning. While I know that having really good math skills can always be helpful when creating statistical models based on probability, a big part of Machine Learning, I do not believe that you need to know a lot of math to do Azure Machine Learning [ML].

Azure Machine Learning and Throwing Spaghetti Against the Wall

For those of you who cook, you may have heard of an old school way of testing to see if the spaghetti is done. You throw the spaghetti against the wall and if it sticks, the pasta is done. If it falls right off, keep the spaghetti in the pot for a while longer. Testing machine learning models is similar, but instead of throwing the computer against the wall, you keep on testing using the large number of models available in Azure ML. Once you have determined the classification of your data, there are a number of different models for the classification which you can try without knowing all of the statistical formulas behind each model. I have listed all of the models from Azure ML here so that you can take a look at the large number of models available. By taking a representative sample of your data, and testing all of the related models, determining which one will provide a result is not terribly difficult. The reason it is not very hard is you do not have to understand the underlying math needed to run the model. Instead you need to learn how to read a ROC curve, which I included in my last blog post. While you can pick the appropriate model by having a deep understanding of the formula behind each model, you can achieve similar results by running all of the models and selecting the model based on the data.

Advanced Statistical Analysis and Azure ML

While Azure ML contains a lot of good tools to get started if you do not have a data scientist background, which recruiters lament not enough people do, why would you use Azure ML if you have coded a bunch of R Modules already to analyze your data? Because you can use Azure ML to call those modules as well and provides a framework to raise visibility and share those modules with people within your organization or the world, if you prefer.

How to Pick the Right Model

I am going to demonstrate how to pick the right model in an upcoming webinar, which is probably easier to explain in that fashion rather than in a blog post. If you want to see how to determine which model to use and not know a lot of Math, I hope you take the time to attend. Azure ML offers the ability to integrate analysis into your data environment without having to be a data scientist, while providing advanced features to accommodate those really good at math, which I will be talking about in an upcoming Preconvention event for SQL Saturday in Huntington Beach. If you happen to be in Southern California on April 10th I hope you will be able to attend that event.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Getting Started with Machine Learning – Result Analysis

Recently I’ve started working with Azure Machine Learning and looking at what I consider the most challenging part, picking the right analysis. For those people who haven’t ventured into Azure Machine Learning, it looks a lot like a data flow in SSIS. After that you need to train or more to the point evaluate which model works best. The answer to that question takes a while. What kind of data do you have? Are you looking to find errors? Determine whether data classified in a certain way can predict a result? Perform a regression analysis of data over time? Group data together to identify trends?

Is your Model better than a Monkey throwing Darts?

While you can analyze your variables and rank them to determine the chance that the variables indicate a result, there is another method that is also used to determine an outcome, the coin toss. This lowly method of analysis is right half the time. If you have more than two outcomes, or to speak the language of Machine Learning, the outcome is not binary, there is another method used to determine the accuracy of predictions, monkeys. I have read about the various skills of monkeys in both literature and financial analysis. Think about it for a minute and you may remember reading or hearing about monkeys typing on a keyboard who have been able to write Shakespeare, or a blog post.  This is known as the Infinite monkey theorem. Another thing monkeys have been known to do is throw darts. Various financial publications have been measuring the success of mutual funds to monkeys throwing darts at stocks since the last century. The goal is of course to create a model that has the better success as a monkey or a quarter. The question is how?

Probability of Picking the Right Model

ROC [Receiver Operating Characteristic] Curves are used to ensure the machine learning model generated is better than a monkey throwing darts. Your goal is a perfect game of golf. Chances are your ROC curve will be somewhere between the two. In looking at the ROC curve generated here, you can see 3 lines, a light grey, a red and a blue one.

RocCurve

The diagonal line represents a coin toss. If you were able to get one of your scored datasets to be a 1, meaning that you got a true positive rate every time, you would have played a perfect game of golf. Chances are you will have two lines like I do here and one, in this example the blue line, has a higher number of true positive rates than does the red line, so the results generated by that model are more accurate.

More ML More of the Time

I find myself spending more time with Azure ML, meaning that I will be devoting a lot more future blog posts on this topic. I am also speaking on Azure ML both as part of a Pre-Convention Event on the Modern Data Warehouse and at SQL Saturday in Phoenix. If you happen to be in Phoenix, I would love to meet you. SQL Saturdays are great learning events and I am happy that I was selected to participate in this one.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Tips on SSIS at SQL Saturday Albuquerque

sqlsat358_ABQOn February 7, I was fortunate enough to be selected to speak at SQL Saturday in Albuquerque, New Mexico on Top 10 SSIS Tuning Tricks. Having worked with SSIS for a number of years, I’ve needed to research what was the best methods to employ to ensure my SSIS ETL was running optimally. I’ve compiled the most valuable items, with examples of course, into this presentation. I’m assuming that everyone attending already has been using SSIS for a while, so I will skip straight into more in-depth ways of tuning SSIS. One of the questions that I know I have heard most often is “When should I do X in SQL or SSIS?” If you are able to attend this session, you will have the answer to that question.

I really enjoy the opportunity to speak on data related topics and meeting people who may have come upon my blog in the past. Having spoken at this event last year, I know what a good job Keith, Chris and Meredith and friends do organizing this event. I want to take the time to say thank you for all of your hard work as I really appreciate it. These events are a great place to learn and keep up with a lot of the changes going on in the industry. I anticipate there will be many lively discussions both before and after the event. That reminds me. If you get a chance, on Friday there are two great precons scheduled on Friday, February 6th , Powershell Basics with Mike Fal and Query Tuning, Troubleshooting and Execution Plans with Jason Kassay. Having been fortunate enough to meet both of them, I know that they are both extremely knowledgeable in their respective topics, and if you are in Albuquerque I encourage you to sign up for either of them as I am sure both will be excellent.

I hope that you will be able to attend as I know I will enjoy seeing you there.

 

Yours Always
Ginger Grant
Data aficionado et SQL Raconteur

Upcoming and Up and Coming Topics

It’s funny the different meanings words have when you put them in different order, a point which anyone who has imitated the dialectic of Yoda can tell you. I find words fascinating as they are not static but have meanings which change over time. For example the Iron Maiden meant something totally different before there were electric guitars. Thinking of works and things changing, as one year closes and another year begins, I start to evaluate past and future topics. Earlier this year, I held an informal poll on twitter to find out how long people tend to talk on the same topic. The answers were quite varied. Some people keep on talking about the same topic as long as there seems to be interest in hearing about it. That way you can get to be a really good speaker on that topic. Another feels obligated to create a new topic each time out to provide him a challenge. The answer that personally I related to, was keep on talking about the topic until you are tired of hearing about it, which takes about a year.

SQL Saturday Albuquerque

sqlsat358_ABQMy first upcoming engagement for 2015 will be as SQL Server Albuquerque where I will be talking about SSIS. I generally talk about things I am interested in or presently working on, and having working on a lot of ETL recently, I thought that it would be an interesting topic which I think most people would find helpful. As a consultant, I see a lot of code and wonder why parts of it were written that way. One big reason is someone thought the design was a good one. Since that is an objective decision, I thought it might be helpful to clarify design decisions with facts so that that people would be able to employ good logic for their design decisions.

Technology changes and their Impact on Data Development

Another topic which really interests me is the changes that new technologies are having on the database world. With the increased implementation of Hadoop and cloud things are really changing in the way data is being both stored and used. Predictive Analytics, Machine Learning, Cloud implementations, Interactive Data visualizations are changing what people are expecting from the way their data is stored and used. Expectations for data professionals are increasing as the business is looking away from HIPPO and towards the knowledge that they have gathered or integrated data from public sources.

Modern Data Warehouse

I have the pleasure of assisting in a day-long session to talk about Architecting the Modern Data Warehouse . During this one day session we will be showing how to use new technology such as HD Insight and Machine Learning to implement a modern data warehouse. Instead of just talking about new technologies we will be putting them to use to show how they can be used today. I’m really looking forward to it.

If you are able to attend any of these or any upcoming sessions, please stop by and introduce yourself as I would love to meet readers of my blog in person.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur