Introduction to Databricks

As I have been doing some work on Databricks, I thought that it would make sense that I start writing about it. Databricks is a scalable environment used to run R, Python and Scala code in the cloud. It currently can be run in either AWS or Microsoft’s Azure Cloud. For those of you who are budget-minded when it comes to learning new tools, there is also a free tier, which is available here Community.cloud.databricks.com . It has somewhat limited compute capacity, but if you are just starting out you might find it helpful.

Backstory

Databricks is an implementation of Apache Spark, part of the Hadoop collective and was a replacement for Map Reduce. Many of the people who worked on that open source project were students at Berkeley, where Apache Spark was created and added to Apache in 2013. Like many development projects, after it was completed, they had some ideas on how to improve the code. This time they decided to not make it open source but make it a commercial product so they could make some money for their development efforts. In April of 2017 Databricks was released on AWS and in March 2018 it was released in Azure.

Creating an Azure Databricks Service

Creating a Databricks Service is very straight-forward. There are only a few things that you need to complete when creating a new Databricks instance. The location becomes very important if you are looking at higher level performing instances which may not be available in all locations. Additionally, security considerations are also important for you if there is plenty of sensitive information being stored and accessed. You can either take up the exams yourself to learn about them (check out SC 900 dumps dumps here) or if that seems like a stretch, then hire someone with the know-how. Well, all of this only comes into the picture if you are working on a company project and not just for the sake of learning. If you are just getting started don’t worry about high level hosting services as you most likely will not need them and most of the compute options are available in most data centers. As always in Azure you want to make sure that you are hosting your Databricks service in the same location as your data so you will not need to pay to transfer data between data centers.

The Pricing Tier contains three options: Standard, Premium and Trial(Premium 14 Days). The trial is pretty self-explanatory and is a great way to get started using Databricks. They are of course a few differences between Standard and Premium. Premium has extra features needed for teams including Role-Based rights for the components of Databricks. And if you want ODBC authentication and Audit logs you will need to use Premium. For more information on the cost of Databricks pricing tiers, check out Microsoft’s pricing link for more information.

Once you have an instance created, you can start using Databricks. The application is contained within a managed instance, so once you launch Databricks you will be in their environment, which looks the same as the free edition.

 

Clusters, Notebooks and Data

These three components are the most important parts of Databricks as they include the compute power, where you write code and the information you work with respectively. These components are all separated in Databricks to improve scaling and provide a familiar environment to create and run code.

Cluster

The most important Databrick element, as it contains the compute. This is also the part of Databricks which will greatly increase your bill as the more resources you use to run code the more money you need to run it. One nice thing is clusters by default will terminate in 120 minutes of inactivity. I generally drop this to 20 minutes. If I am using it naturally it will not terminate, but if I am not using it, I want the charges to stop. You can also automatically spin up clusters to run jobs, so that they will only be in use when the job needs them. More about that in another post.

Notebooks

Databricks Notebook Import

Databricks Notebook Import

There are 3 supported languages in Databricks, R, Scala and Python, and within Databricks all of these languages are written in Notebooks. You don’t have to write your code in the environment. You can write it locally and then import it. However, if you want to export your Notebook then run it locally, it gets trickier. Natively all of the Notebooks in Databricks are saved as .dbc files. You can’t read them from anywhere else. Fortunately there is a workaround to format the Notebook files as .ipynb files which can be read by any notebook. Dave Wentzel from Microsoft has an elegant solution to convert .dbc to .ipynb which he includes in his blog here.

Data

You have a lot of options with data. You can import a dataset into your environment to play with or you can connect to just about anything you can think of. When you start doing data connections is when you stop using the community edition as you will want to use the Azure version to this to connect to various data resources like Azure SQL and blob storage. More on how to that later in an upcoming post.

If you are interested in hearing more about Databricks and are in Chicago, I am teaching an all day class as part of SQL Saturday Chicago and would love to have you attend. More information on that class is here.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

5 Tips for Writing Better SQL

More people are starting to write SQL who have a background in programming as more and more people are analyzing data as data science related jobs are exploding.  SQL is a little different than other languages and here are some common patterns which should be avoided.  Here are 5 things, in no particular order, you can do to improve your SQL.

1. Use the correct join for the job

There are a number of different joins you can use in SQL, but there are 3 common ones, Left Join, Right Join and Inner join.  When joining two tables you will get only the records which match with an inner join.  Any records which cannot be joined will be excluded.  For this reason, many people only write left or right joins which automatically return all of the rows in the left or right hand side, respectively.  These joins are always slower that Inner joins, and may not be necessary.  Evaluate your joins to see if the record count changes if you do an inner join before you do a left or right join.

2. Do not Loop through rows of data

SQL is designed to process data in sets, not one record at a time.  In fact, it does not work well processing one row at a time.  Looping through records one at a time in a cursor or a while loop is to be avoided.  Think about how you might write the code so that it processes the records at one time.

3. Avoid Nesting SQL code

SQL is not an object-oriented language.  This means the performance and readability does not improve when views, stored Procedures Or cursors are called from other objects. Try to do all of the work in one place.  For more information on nested views, check out my post on the topic.

4. Do not use column names which include spaces or reserved words

While you can get away with using spaces or reserved words by putting the name of the column in brackets, most people are not used to writing code this way and it is inevitable that someone will leave off the square brackets and the code will not work. Have pity on those who will be supporting your code and do not include spaces.  If you are unsure of what the reserved words are, see if they change color in the editor.

5.Use Sets in SQL and Avoid Temp tables

Someone who attended I taught a class recently told me that they didn’t think that it was possible to write a stored procedure without a temp table because all of the stored procedures at her work had them.  A common pattern which I see is to processing a bunch of records and putting them in a temp table, then processing them again. This pattern should be avoided. You can process records in one query. Look at writing more complex queries which look at all of the data in one query.

Resolving Common SQL Pitfalls for People who Develop Lots of Code

Sometimes all it takes to figure out how to write SQL using sets is to have someone show you how.  If you are in Phoenix, I can show you how on February 5th when I am doing a talk at Galvanize on this topic. I would be happy to show you some tips and tricks which would help gather and analyze data for an experiment.  Not in the area or can’t make it?  Check out my class at DataCamp.  You can try the class for free to see what you think. I look forward to hearing your feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Gaining Better DAX Skills

If you work with Excel Pivot Tables, Power BI, or Analysis Services Tabular more than likely you are working to improve your DAX skills. I know I am. While I do a lot of work with DAX, I am always looking to get better at writing it and better at explaining how to write it for clients or class attendees.  As I find things which I think others may find helpful, I write about them here to help spread the knowledge, especially to those people who have background in SQL.

DAX History

Microsoft has been working on the DAX language for about 10 years and released it in as part of Excel 2010 Power Pivot Add-in, back when Microsoft was incorporating Power BI functionality in Excel. Microsoft wrote DAX so that it would be like Excel so that people who are familiar with Excel would have an easier time writing it. While DAX is easier than MDX, it is enough different from Excel that it can get quite complex. One of the founders of the DAX language is Jeffery Wang, who is still actively working on enhancing the language further.  He works not only with developers but also people in the technical community who work extensively with DAX to get their input. I also found out recently that he took time out to do a technical review of a book on the DAX language by Phil Seamark b | t.

Leveraging SQL Knowledge to Learn DAX

If like me, your skills like not in Excel but in SQL, reading a book which builds upon SQL knowledge to understand DAX is really helpful.  Beginning DAX with Power BI provides examples of how to write something in SQL then shows how the same functionality would be written in DAX, sort of like a SQL-to-DAX language library.  Showing how to get the answer from SQL and DAX is a great technique to provide a better understanding of how to write DAX. I worked a client recently who validated all of the DAX in the tabular model by writing SQL to ensure they were getting the correct answers. I think the validation process would have been improved with this book which takes DAX out of the black magic category and explains the way it works.

Performance Tuning DAX

Once you start writing a lot of DAX, you will at some point write a query which performs badly.  What do you do to avoid poorly performing DAX? Phil provides some “old-school debugging” techniques to understand what is being generated and where it is going wrong. He also shows how you can use DAX Studio and SSMS to help figure out how and why your DAX is working the way it is. You will definitely want to reference Chapter 8 to learn more about this important topic.

Other Methods for Learning DAX and Improving Power BI skills

If you are looking for a hands-on environment to better understand how DAX works and how to improve your Power BI skills I have two opportunities for you in the US. I will be giving Advanced Power BI Classes in Columbus, Ohio on July 27 and Boston, Massachusetts on September 20.  Of course we will be covering more advanced DAX development and tuning as well as applied visualization theory to improve the interactivity of your Power BI reports as well as hands on labs to implement the new drill-through features and enhancements in Power BI and other class topics.  I hope to meet you there or at some of the other conferences I will be attending this year.  Check out my Engagement page and hopefully find a place where we can meet up as I would love to meet you.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Data Science with Python

KD Nuggets Data Science/ Machine Learning PollFor those of you who might have missed it, the website KDnuggets released their latest internet survey on data science tools, and Python came out ahead, again. Python has continued to gain as a tool that people are using for Data Science.  The article accompanying the graphic is very interesting as it brings up two data related points. The first is the survey only had “over 2300 votes” and “…one vendor – RapidMiner – had a very active campaign to vote in KDnuggets poll”.  This points the fallacy in completely relying on data with an insufficiently sized data set, as it is possible to skew the results, which is true both for surveys and data science projects.  If you look at the remaining results one thing also strikes me as interesting. Anaconda and sci-kit learn are Python libraries.  Tensorflow could be used for either R or Python.  This does tend to increase the argument for more use of R or Python over RapidMiner.  The survey also made me want to check out RapidMiner.

Thoughts around Rapid Miner for Machine Learning

While I have not had enough time to fully analyze Rapid Miner, I thought I would give my initial analysis here and do a more detailed review of it in another post.  Rapid Miner scored well in the Kaggle Survey, but also it ranked highly on the 2018 Garner Magic Quadrant for Data Science Platforms.  Rapid Miner is trying to be a tool not only for data scientists, but also for business analysts as well.  The UI is pretty intuitive, which is good because the help is not what it should be. I also was less than impressed at its data visualization capabilities, as R and Python both provide much better visuals. Of course, I used the free version of the software, which works but it is limiting.  It looks like a lot of the new stuff is going to be only available on the paid version, which decreases my desire to really learn this tool.

Machine Learning Tools

Recently I have done a number of talks on Python in SQL Server, literally all around the world, including Brisbane, Australia tomorrow and Saturday, June 2 as well as in Christchurch New Zealand. As R was written in New Zealand, I thought that it would be the last place where people would be looking to use Python with Data Science, but several of the attendees of my precon on Machine Learning for SQL Server told me that where they worked, Python was being used to solve data science problems. Now of course this is anecdotal sample, as we are not talking about a statistically significant sample set, but that doesn’t keep it from being interesting.   The demand for Python training continues to increase as Microsoft has announced they are working on incorporating Machine Learning Service blog series with SQL Server Central.  The first two post have been released. Let me know what you think of them.

Upcoming Events

I am looking forward to talking about Machine Learning with SQL Server in Brisbane both at an intense day long session and at a one hour session on Implementing Python in SQL Server 2017 at SQL Saturday #713 – Brisbane, Australia. I look forward to seeing you there. For those who can’t make it, well, hopefully our paths will cross at a future event.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Applying Data Science to SQL Server

Data has been getting a lot of attention in the business world for a while now. First there was big data, which was another way to store data so that later the data could be analyzed. Recently the talk has been all about analyzing the data with new tools such as R and Python. The reality is that people who have been working with databases doing work in business intelligence have been analyzing data for a while. Learning a different toolset for analyzing data is not such a big leap, but an expansion of what they know.

As the field is rapidly expanding now, and demand is huge, now is a great time to learn the tools. With the advent of more advanced software, it seems essential to create fast and reliable processing mechanisms. In addition, concepts such as Web3 seem to be developing and becoming realities at an accelerated rate. Data plays an essential role in the development of such marvels of information technology. It is high time companies consider combining traditional data science tools, like MATLAB, Apache Spark, and SQL, with modern tools, like web3 sql. This could help them organize and process the data at relative speeds while reducing data analytics costs.

Traditional Data Science Development

Data scientist have created analysis solutions with data for a number of years. The data is analyzed, cleaned, processed with various algorithms, and results are created. When the process is complete, code has been created to provide meaning from a portion of the data and is ready to be migrated to production. Traditionally there has been a big gap between creating a solution and implementing the solution to be run against data on a regular basis. Data Scientists traditionally are not part of the IT organization, they are actuaries or analysts, not the people who have anything to do with system processing. Recently I did some work for a company and after the data scientists were done creating a solution, they turned over all of their code to the Java team. Six weeks later the code was released into production. This solution made no one happy. Management thought it took too long. The data scientist didn’t believe that the code that they created was what was implemented into production, and the java developers were tired of people blaming them for wrong code which required a long time to implement.

SQL Server Implementation of Data Science

Since SQL Server 2016 incorporates R and SQL Server 2017 has added the ability to include Python code into SQL Server, data science solutions can be incorporated as part of a scheduled process with SQL Server. There is now a dev ops solution for incorporating R and Python into SQL Server. One way of learning about the technology is through blogs and other online training which can help you get up to speed. Many times though there is no substitute for hands on learning. If you are attending PASS Summit 2017, and want to learn not only about data science, but how to incorporate it into SQL Server, I hope you can sign up for my all day training session on Applied Data Science for the SQL Server Professional. I hope to see you there.

I have recently created a You Tube channel where I plan on sharing more data related content where I have included my first video about this conference.

If you are at PASS Summit, please introduce yourself as I would love to meet people who read my blog personally.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Power BI – Beyond the Basics

When helping clients recently with their Power BI implementations, I have noticed that when talking to people about Power BI there seems to be some areas where there continues to be a log of questions.  While it is easy to find a plethora of information about getting started with Power BI, when it comes to implementing a solution, the information is scarce.  How do you handle releases? Should an implementation contain only one data model? Is Power BI’s data secured on the cloud? Is it required to have Office 365 use Power BI? Do you have to have Power BI Premier to have the Power BI run locally?

Advanced Power BI Techniques in Norway

While I have discussed some best practice techniques on my blog, as usual new features released in Power BI have a

Norway Parliament Building in Oslo

Norway Parliament Building in Oslo

tendency to change some of the available options.  For example, App workspaces, the updated take on Content Packs released a few months ago, now offer a new method for releasing not only dashboards but the reports behind them and the ability to easily migrate sources. I am excited that I will have the opportunity to discuss the answers to the questions received by doing a full day of training at SQL Saturday Oslo. I am looking forward to visiting Oslo, which is home to the best preserved Viking Ship, an Opera House designed to be walked on and the home of the guy who painted the Scream.  If you happen to reside somewhere where it is possible to make the journey to Norway, please register to attend this full day of interactive training.  We will cover all of these items and go into detail about Power BI administration, security and new features and design techniques which will improve Power BI implementation techniques.

sqlsat667_osloFor those of you who are unable to attend, I feel obliged to answer some of the questions I posed earlier.  Implementations generally require more than one data model.  Power BI is encrypted both in transit and at rest. You do not need to have Office 365 to run Power BI.  Power BI can be run locally with Power BI Report Server, which is part of SQL Server 2016 Enterprise with Software Assurance, and you do not need to sign up with Power BI Premier to install it.

I hope to see you in Norway.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

I’m on TV!

Old-fashioned four legged TV set isolatedI’m on TV! Not just any TV, Excel.TV, which unless you stream the internet on your TV set like I do, probably is on a computer monitor, but in my world it still counts. Please check out Episode 40 if you want to see a recording of the show where I talk about Power BI.

There is kind of an interesting story as to how I ended up being asked to be on the show. I was teaching a Power BI class, and Jordan Goldmeier b | t was talking it. Jordan started following me on twitter, and tweeted that he was in my class. I was really surprised and intimidated that an Excel MVP who has written a number of books on Excel was taking a class from me. My class was in the pre-Power BI desktop days, so I was going over the 4 Powers in Excel, Power Pivot, Power Query, Power View and Power Map. Jordan later told me that he learned something in my class. I was relieved. We’ve kept in contact since then, mainly via twitter, which is how I got asked to be on the show.

I was really impressed by Excel.TV and everything that Rick Grantham b | t , Szilvia Juhasz b | t and Jordan do to make the very professional, with graphics and sound effects. They asked me on to talk about Power BI. While on the ExcelTVshow, Rick asked me about the variety of things I have on my blog, and I got to thinking about it. Whether I use Power BI, Excel, Machine Learning, SSIS, SSAS or R, I am trying to do the same thing, make sense of the data and use the data to provide answers. You can call that data science or business analysis or business intelligence, but whatever the label or the tool, I think that really covers what I like to do.

I really enjoyed being interviewed, and I look forward to catching up with Rick and Jordan at the PASS Business Analytics Conference in May where we will all be speaking May 3-4. Maybe I’ll see you there too?

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

2015: Year End Wrap up for Releases and More

As 2015 draws to a close, I started thinking back about everything that has happened this year. 2015 GraphicTechnically this has been a big year as a many new applications were released. Here are just some of them, with links included to provide more detail.

This short list could be a lot longer as it doesn’t count the number of updates released to Power BI, which occur several times a month, the CTP releases for SQL Server 2016, the new web version of BIML, or PowerShell. It’s really hard to keep up with everything that is changing. It’s a good thing that so many people are willing to help others learn how through speaking and blogs which make learning new things easier.

Community Involvement in 2015

Keeping up with all of these events is difficult, especially given the pace of releases.  I spend a lot of time reading various blogs, watching videos and going to hear people speak. I also have been able to talk about topics of particular interest, many Power BI and Machine Learning. This year I spoke a different times at a number of different events including: Speaker Idol, two different user groups, seven webinars, five SQL Saturdays and other Tech Events. I’ve got a number of engagements on the books for next year, including PASS BA Con and SQL Saturday #461 – Austin. 2016 is shaping up to be busy too and hopefully our paths will cross.  I list all of my speaking events on my Engagement Page and I hope that you might take a look at it from time to time if you are interested in catching up in person sometime. Next year I am hoping my list of speaking engagements changes somewhat as I plan on trying harder to get accepted to speak at events where I submitted and was turned down in 2015. On a more positive note, views of my blog are up 1000%, and the number of website subscribers has more than doubled. Thank you very much for continuing to read this site and I hope you find my thoughts helpful. I posted once a week this year, which I thought was pretty good until I talked to Ken Fischer b | t who blogs twice a week. I’ll have to try harder next year. If you think of a topic you think would make a good blog post, let me know as I am always interested in feedback.

Keeping Up the Pace in 2016

Next year there will be no slowdown in the things to learn as SQL Server 2016 is going to be released. Although the exact date has not been announced, my sources tell me to look for it around May-June. The next release of SQL Server is going to be huge as it will include new tools Microsoft added to integrate Big Data and open source platforms to SQL Server. PolyBase, JSON and R are all going to be part of with SQL Server. Personally, I find the R integration most Datazen and SSRS are going to be integrated in the next release too which should really increase the implementation of mobile reporting visualizations.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Non-technical Issues Impacting Data Based Decision Making

 Having worked with a number of clients to implement Power BI in their respective environments I noticed that one factor appeared to be common to all. The success of the project depended greatly upon the relationship between the business analyst and the database team. Since this seems to be an issue which greatly impacts the ability to implement Data Based Decision Making, I decided to talk about it in my recent webinar PASS BA Marathon. Too often I see companies which decide to join data together in an analytics platform, such as Power BI, and fail to take advantages of the separate skillsets in the organization. The data team has spent a considerable amount of effort and energy determining the best ways to combine datasets together. Logically one would assume that this expertise would be leveraged to help the business team analyze data. Instead the business teams are tasked with joining data together. While this approach can work, it will take longer to train the business in areas in which they may not be familiar, and the results will be mixed, especially when considering scalability and maintenance needs over time. To leverage the capabilities of the self-service business tool, which tool doesn’t really matter as the same issues will exist in for example Tableau as well as Power BI, the data team needs to be engaged. The skills they have gathered over time allow them to design a plan a data model which can be refreshed automatically without causing issues.

Using Areas of Expertise

Business Analysts time is best spent using the unique skills they have gathered over time too. Their familiarity with the data values allows them to determine at a glance how the business is doing. Codifying this knowledge into meaningful reports which can disseminate this information throughout the organization provides the basis for data based decision making. To make them successful, they need a data model which has all of the information they need which is well documented so that they can find the values they need to provide meaningful data visualizations. Too often the report generation is left to the data team, and many times there is a reporting backlog of items as there are not enough resources to do provide all of the information a business needs.

Team Collaboration

Data Based Decision Making should be an organizational goal, as it has been shown to be a major tool for business success. When the Data Team and Business Analysts work collaboratively by using their specialized skills to create and implement a solution, this solution will be successful. The result will be a model which provides the a path for the Business Analyst to continue to use the data to answer either routine questions, such as “How successful was the business last month” to more obscure questions, such as “What happened to sales volumes after a bad story in the press?”. These and many other questions are answered using the model and tools, like Power BI to implement an enterprise wide solution.

Implementing Successful Data Analytics Management PracticesPASS Business Analytics

There is more to implementing a self-service BI Tool such as Power BI than merely knowing how to make the tool work. A process and a commitment to work among teams is required as well. I enjoyed the opportunity to talk about integrating the tools with the company data management polices at the BA Marathon. If you would like to know more about this topic, please come join me at the PASS Business Analytics Conference in San Jose May 2-4 as I will be going into more depth than was possible in the webinar.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Lessons Learned About Speaking

As an attendee at PASS Summit, I had the opportunity to learn about a wide variety of topics, including public speaking. I’ll be devoting other blogs to the great technical things I learned, but I thought I would start by talking about the sessions in general. I saw a number of presentations, some which went well, others RimmaNimmeDavidDeWittwhich were beset by technical difficulties. By far the best talk I saw was the keynote with Rimma Nehme and David DeWitt of Microsoft. The presentation was well rehearsed without sounding canned, and the slides were absolutely amazing. You can check out the slides here as they are publically available. I am going to remember what made this talk work, and hope to incorporate what I saw here when I speak next. If you are interested in where that will be, check out my Engagement page as maybe we can meet sometime.

Speaking Techniques

I saw a number of different speaking techniques employed at Speaker Idol. People were really creative. Todd Kleinhans navigated through a game interface. Wes Springbob did an homage to Hitchhiker’s Guide to the Galaxy. By the way, if you haven’t read the series, I think you should as they are great books. I was surprised that all of the judges hadn’t read the books, but even those who didn’t thought he gave a great talk. I demonstrated that I had never used a microphone before, which was not positive. Bill Wolf worked to engage the audience throughout his talk. Ed Watson videotaped his demo. This technique is something that often I have heard that you should do in case your demo crashes, but this was the first time I have seen anyone who actually did record the demo. William Durkin brought great stage presence, which I noticed was a common theme among all of the talks I liked. Effective presenters know their topic so well that the talk should appear effortless and fun, without appearing that a script has been memorized that you are working to run through. Also, remember to have a point to follow during the presentation so I remember what the talk was about midway through. Everyone who did this I thought did a great job.

Speaker Idol Results

The finalist for Speaker Idol were William Durkin b | t , Theresea Iserman t, and David Maxwell b | t. My name was not there, due to my issues with the microphone, which put me off my game. Also despite my goal of not adding useless words, I threw in many “um” and “so” into my talk. In my round, William did the best job, so it was logical that he went forward. I talked to David and Theresa about their respective talks, and I know they put a lot of work and practice making them really good. David was the winner, so I look forward to seeing him at PASS Summit 2016 giving the talk of his choice. As for me, I hope to follow the pattern of fellow Speaker Idol 2014 non-winner Reeves Smith b | t who spoke at PASS Summit for the first time this year the old-fashioned way, by picking a good topic and writing a good abstract for it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur