Introduction to Databricks

As I have been doing some work on Databricks, I thought that it would make sense that I start writing about it. Databricks is a scalable environment used to run R, Python and Scala code in the cloud. It currently can be run in either AWS or Microsoft’s Azure Cloud. For those of you who are budget-minded when it comes to learning new tools, there is also a free tier, which is available here Community.cloud.databricks.com . It has somewhat limited compute capacity, but if you are just starting out you might find it helpful.

Backstory

Databricks is an implementation of Apache Spark, part of the Hadoop collective and was a replacement for Map Reduce. Many of the people who worked on that open source project were students at Berkeley, where Apache Spark was created and added to Apache in 2013. Like many development projects, after it was completed, they had some ideas on how to improve the code. This time they decided to not make it open source but make it a commercial product so they could make some money for their development efforts. In April of 2017 Databricks was released on AWS and in March 2018 it was released in Azure.

Creating an Azure Databricks Service

Creating a Databricks Service is very straight-forward. There are only a few things that you need to complete when creating a new Databricks instance. The location becomes very important if you are looking at higher level performing instances which may not be available in all locations. Additionally, security considerations are also important for you if there is plenty of sensitive information being stored and accessed. You can either take up the exams yourself to learn about them (check out SC 900 dumps dumps here) or if that seems like a stretch, then hire someone with the know-how. Well, all of this only comes into the picture if you are working on a company project and not just for the sake of learning. If you are just getting started don’t worry about high level hosting services as you most likely will not need them and most of the compute options are available in most data centers. As always in Azure you want to make sure that you are hosting your Databricks service in the same location as your data so you will not need to pay to transfer data between data centers.

The Pricing Tier contains three options: Standard, Premium and Trial(Premium 14 Days). The trial is pretty self-explanatory and is a great way to get started using Databricks. They are of course a few differences between Standard and Premium. Premium has extra features needed for teams including Role-Based rights for the components of Databricks. And if you want ODBC authentication and Audit logs you will need to use Premium. For more information on the cost of Databricks pricing tiers, check out Microsoft’s pricing link for more information.

Once you have an instance created, you can start using Databricks. The application is contained within a managed instance, so once you launch Databricks you will be in their environment, which looks the same as the free edition.

 

Clusters, Notebooks and Data

These three components are the most important parts of Databricks as they include the compute power, where you write code and the information you work with respectively. These components are all separated in Databricks to improve scaling and provide a familiar environment to create and run code.

Cluster

The most important Databrick element, as it contains the compute. This is also the part of Databricks which will greatly increase your bill as the more resources you use to run code the more money you need to run it. One nice thing is clusters by default will terminate in 120 minutes of inactivity. I generally drop this to 20 minutes. If I am using it naturally it will not terminate, but if I am not using it, I want the charges to stop. You can also automatically spin up clusters to run jobs, so that they will only be in use when the job needs them. More about that in another post.

Notebooks

Databricks Notebook Import

Databricks Notebook Import

There are 3 supported languages in Databricks, R, Scala and Python, and within Databricks all of these languages are written in Notebooks. You don’t have to write your code in the environment. You can write it locally and then import it. However, if you want to export your Notebook then run it locally, it gets trickier. Natively all of the Notebooks in Databricks are saved as .dbc files. You can’t read them from anywhere else. Fortunately there is a workaround to format the Notebook files as .ipynb files which can be read by any notebook. Dave Wentzel from Microsoft has an elegant solution to convert .dbc to .ipynb which he includes in his blog here.

Data

You have a lot of options with data. You can import a dataset into your environment to play with or you can connect to just about anything you can think of. When you start doing data connections is when you stop using the community edition as you will want to use the Azure version to this to connect to various data resources like Azure SQL and blob storage. More on how to that later in an upcoming post.

If you are interested in hearing more about Databricks and are in Chicago, I am teaching an all day class as part of SQL Saturday Chicago and would love to have you attend. More information on that class is here.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

5 Tips for Writing Better SQL

More people are starting to write SQL who have a background in programming as more and more people are analyzing data as data science related jobs are exploding.  SQL is a little different than other languages and here are some common patterns which should be avoided.  Here are 5 things, in no particular order, you can do to improve your SQL.

1. Use the correct join for the job

There are a number of different joins you can use in SQL, but there are 3 common ones, Left Join, Right Join and Inner join.  When joining two tables you will get only the records which match with an inner join.  Any records which cannot be joined will be excluded.  For this reason, many people only write left or right joins which automatically return all of the rows in the left or right hand side, respectively.  These joins are always slower that Inner joins, and may not be necessary.  Evaluate your joins to see if the record count changes if you do an inner join before you do a left or right join.

2. Do not Loop through rows of data

SQL is designed to process data in sets, not one record at a time.  In fact, it does not work well processing one row at a time.  Looping through records one at a time in a cursor or a while loop is to be avoided.  Think about how you might write the code so that it processes the records at one time.

3. Avoid Nesting SQL code

SQL is not an object-oriented language.  This means the performance and readability does not improve when views, stored Procedures Or cursors are called from other objects. Try to do all of the work in one place.  For more information on nested views, check out my post on the topic.

4. Do not use column names which include spaces or reserved words

While you can get away with using spaces or reserved words by putting the name of the column in brackets, most people are not used to writing code this way and it is inevitable that someone will leave off the square brackets and the code will not work. Have pity on those who will be supporting your code and do not include spaces.  If you are unsure of what the reserved words are, see if they change color in the editor.

5.Use Sets in SQL and Avoid Temp tables

Someone who attended I taught a class recently told me that they didn’t think that it was possible to write a stored procedure without a temp table because all of the stored procedures at her work had them.  A common pattern which I see is to processing a bunch of records and putting them in a temp table, then processing them again. This pattern should be avoided. You can process records in one query. Look at writing more complex queries which look at all of the data in one query.

Resolving Common SQL Pitfalls for People who Develop Lots of Code

Sometimes all it takes to figure out how to write SQL using sets is to have someone show you how.  If you are in Phoenix, I can show you how on February 5th when I am doing a talk at Galvanize on this topic. I would be happy to show you some tips and tricks which would help gather and analyze data for an experiment.  Not in the area or can’t make it?  Check out my class at DataCamp.  You can try the class for free to see what you think. I look forward to hearing your feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Using Power BI without an Power BI Account or Active Directory Entries

Believe it or not there is a way to use Power BI without signing up or having an account.  With an embedded application, you can use one Power BI ID and have all users use this account to log in. Now before you judge this is concept as a horrible security idea, keep on reading as you may change your mind. In this embedded application all users to login using a unique id and password and restrict the data seen in Power BI data based upon that id. How do you accomplish this seemingly incongruent task and why would you ever do such a thing? Well it turns out there is a logical reason for wanting to implement a Power BI application this way.

Creating a way to Securely Access Power BI Data for Customers

There are many companies which would like to provide Power BI reports which would allow customers to interactively work with their data, but they don’t want to create Power BI accounts for customers as that can be a lot of work from an administrative standpoint.  For the same reason, these customers are not added to the corporate network which means they are not added Active Directory.  For example, if Desert Isle SQL contracts with Acme Corporation to create a custom conference display, Acme might want to show me a report showing when the components were purchased, when they were modified and when the order is in process and when the order is completed.  How do I show a Power BI report containing information? From an application design perspective data from all of the customers should be stored in the same place and Desert Isle SQL should only see their orders when logging in to Acme’s site.  Here is the workflow that I want to implement.

 

Passing Login information to Power BI

When creating an embedded application, connecting requires a connection string.  It is possible to pass additional information to the connection string buy modifying the gateway to use effective identity and then pass the role information you want to use.  There are two configuration steps you need to complete to make this work.  The Power BI gateway needs to be configured to use CustomData through the Map User name screen. Also SSAS needs to be configured to use Roles as the role will restrict the data that users can access. The CustomData can contain a comma delimited list of values, which can include the data I need to have to access the role.  In the DAC for the role, the CUSTOMDATA field as performs as if it was a table. The DAX in the role would provide permissions based on the value of that table DimTerritory[TerritoryName] = IF(CUSTOMDATA() = “username”.“territory” .  This will restrict the data that a customer can see based on the territory they have assigned.  The Id can then be used to implement Row level security in Power BI with either the embedded data model or with Analysis Services Tabular. By using this method, you have the ability to restrict the data for each user and use one Power BI account all at the same time.

 

Costs for Implementation Multi-User Power BI Systems

Unfortunately, this solution means that you are going to be purchasing the embedded version of Power BI as this functionality is not covered with a Pro License. Embedded applications require that you purchase an embedded license or have a premium account.  The pricing for embedded has changed quite a bit from 2017 when it was introduced.  Pricing is all about capacity, unless you use a Premium account.

Power BI can be implemented in a number of different ways, and this implementation is one that you may see more of in the future.  There are a lot of different things that you can implement Power BI and it is hard to keep up with all of the changes. If you are interested in learning more about some of Advanced Power BI topics, join me in person in Boston for a full day of Advanced Power BI Training on Friday, September 21.  I look forward to meeting you here or anywhere else we might meet up.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Applying Data Science to SQL Server

Data has been getting a lot of attention in the business world for a while now. First there was big data, which was another way to store data so that later the data could be analyzed. Recently the talk has been all about analyzing the data with new tools such as R and Python. The reality is that people who have been working with databases doing work in business intelligence have been analyzing data for a while. Learning a different toolset for analyzing data is not such a big leap, but an expansion of what they know.

As the field is rapidly expanding now, and demand is huge, now is a great time to learn the tools. With the advent of more advanced software, it seems essential to create fast and reliable processing mechanisms. In addition, concepts such as Web3 seem to be developing and becoming realities at an accelerated rate. Data plays an essential role in the development of such marvels of information technology. It is high time companies consider combining traditional data science tools, like MATLAB, Apache Spark, and SQL, with modern tools, like web3 sql. This could help them organize and process the data at relative speeds while reducing data analytics costs.

Traditional Data Science Development

Data scientist have created analysis solutions with data for a number of years. The data is analyzed, cleaned, processed with various algorithms, and results are created. When the process is complete, code has been created to provide meaning from a portion of the data and is ready to be migrated to production. Traditionally there has been a big gap between creating a solution and implementing the solution to be run against data on a regular basis. Data Scientists traditionally are not part of the IT organization, they are actuaries or analysts, not the people who have anything to do with system processing. Recently I did some work for a company and after the data scientists were done creating a solution, they turned over all of their code to the Java team. Six weeks later the code was released into production. This solution made no one happy. Management thought it took too long. The data scientist didn’t believe that the code that they created was what was implemented into production, and the java developers were tired of people blaming them for wrong code which required a long time to implement.

SQL Server Implementation of Data Science

Since SQL Server 2016 incorporates R and SQL Server 2017 has added the ability to include Python code into SQL Server, data science solutions can be incorporated as part of a scheduled process with SQL Server. There is now a dev ops solution for incorporating R and Python into SQL Server. One way of learning about the technology is through blogs and other online training which can help you get up to speed. Many times though there is no substitute for hands on learning. If you are attending PASS Summit 2017, and want to learn not only about data science, but how to incorporate it into SQL Server, I hope you can sign up for my all day training session on Applied Data Science for the SQL Server Professional. I hope to see you there.

I have recently created a You Tube channel where I plan on sharing more data related content where I have included my first video about this conference.

If you are at PASS Summit, please introduce yourself as I would love to meet people who read my blog personally.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Database Table Design

http://michaeljswart.com/2016/06/t-sql-tuesday-079-its-2016/comment-page-1/#comment-186750There are a number of different ways that you could decide to organize your data in a database. If you are creating a database to be used in a transactional system, your table design should follow a normalized design as much as possible.  Data should be grouped in logical groups, such as customers, products, sales, orders, quotes, tickets etc.  Redundantly repeating data in multiple places will cause problems in the future.  Your design may include hundreds of tables, and that is perfectly fine.

If the purpose of the database is for Power BI or for a data warehouse, dimensional modeling techniques should be deployed.  In this type of database design, generally speaking there are a number of tables containing descriptive data, such as product and customer and few fact tables which contain the actions which happened.  The actions include things like Sales.  The database design will look much like a star with the fact tables in the center and the dimension tables connected to it like satellites.  If you have one dimensional model connected to another dimension, that design is called a snowflake and some applications, like Analysis Services Multi-dimensional will not process it well.  Power BI and Analysis Services Tabular work very well with snowflake dimensions.

SnowflakeModel

Snowflake Data Model from Power BI

 

Table Design Gone Wrong

To paraphrase Ron White, the reason that I described database modeling is so that even people who know nothing about database design could appreciate my interview story. When I was working at a previous location, I assisted in providing technical reviews for database developer jobs.  We asked a number of typical questions about indexing and stored procedures, but I always tried to come up with at least one question which the candidate could not readily answer by cramming interview questions found on the internet.  I decided to ask one candidate, who did correctly answer the previous stock questions something that would let us know what kind of work he had really done.  I asked him “What do you do to determine how to design a table?” I was interested to find out what his thought process was, see if he would mention normal form or describe something he had done in the past.  I was completely surprised by his answer.

“Well, you can only have 256 columns in a table. After that you have to create a new one.”  This answer was a complete surprise.  I was really curious to find out where he had developed this completely warped view of how to determine what fields should go in a table.  It turned out that he learned all of his database skills from a co-worker, who had recently retired.  His co-worker had worked at the same location for a very long time and when he started used mainframes without any databases.  He had migrated some of the applications to databases and they wrote them this way because it “made sense”.   After that the interview was over, and we hired someone else.

I challenge anyone who is learning databases to please look up what people tell you to do on the internet.  This is useful for two reasons, the first being that it will probably help you learn the concept better to read about it another way.   The other reason is that you can find out if the person teaching you really knows what they are doing, so you will learn the correct way to do something.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Power BI – Beyond the Basics

When helping clients recently with their Power BI implementations, I have noticed that when talking to people about Power BI there seems to be some areas where there continues to be a log of questions.  While it is easy to find a plethora of information about getting started with Power BI, when it comes to implementing a solution, the information is scarce.  How do you handle releases? Should an implementation contain only one data model? Is Power BI’s data secured on the cloud? Is it required to have Office 365 use Power BI? Do you have to have Power BI Premier to have the Power BI run locally?

Advanced Power BI Techniques in Norway

While I have discussed some best practice techniques on my blog, as usual new features released in Power BI have a

Norway Parliament Building in Oslo

Norway Parliament Building in Oslo

tendency to change some of the available options.  For example, App workspaces, the updated take on Content Packs released a few months ago, now offer a new method for releasing not only dashboards but the reports behind them and the ability to easily migrate sources. I am excited that I will have the opportunity to discuss the answers to the questions received by doing a full day of training at SQL Saturday Oslo. I am looking forward to visiting Oslo, which is home to the best preserved Viking Ship, an Opera House designed to be walked on and the home of the guy who painted the Scream.  If you happen to reside somewhere where it is possible to make the journey to Norway, please register to attend this full day of interactive training.  We will cover all of these items and go into detail about Power BI administration, security and new features and design techniques which will improve Power BI implementation techniques.

sqlsat667_osloFor those of you who are unable to attend, I feel obliged to answer some of the questions I posed earlier.  Implementations generally require more than one data model.  Power BI is encrypted both in transit and at rest. You do not need to have Office 365 to run Power BI.  Power BI can be run locally with Power BI Report Server, which is part of SQL Server 2016 Enterprise with Software Assurance, and you do not need to sign up with Power BI Premier to install it.

I hope to see you in Norway.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Data Platform MVP

I am very excited to be able to announce that Microsoft has made me a Data Platform MVP. This is a big thrill. The right words escape me, I will have to make do with these.

If I Only had a Brain

I love this song from the Wizard of Oz.  Unfortunately, the scarecrow never gets a brain, instead he gets a honorary degree.  I wish having an MVP award would make me smarter, but unfortunately, it does not do that.  Frankly it means I am in very intimidating mental company as when I wrote this, there were only 370 Data Platform MVPs. Most likely I need to learn a lot more and maybe write a book so I can keep up.

mvp_horizontal_fullcolor

One thing I do try to do is share what I know by blogging and speaking, if for no other reason than I don’t want to be a hypocrite.  When I was learning SSIS, the person leading the project was tuning SSIS and he would not show me how.  He obfuscated, and made SSIS tuning out to be wizardry. I thought to myself at the time, that he should tell me what he knows as I would do that.  Later I found out the rules, and gave a few talks about SSIS, including one for the PASS Data Warehousing and Business Intelligence Virtual Chapter which was recorded here.  If I learn something, I want to tell other people, which is why I blog and speak.  I think this is the greatest profession in the world and I feel bad for people who have chosen to do something else as the data platform stuff and they are missing out.

Keeping Up

There are a ton of new technology things to learn coming up all the time. I keep up as much as I can and when I do learn something, I tend to blog or speak about it.  If you subscribe to this blog or follow me on twitter, hopefully keeping up will be easier.  I don’t want Microsoft to think that they made a mistake, so I plan on trying to increase the number of blog posts and speak when I am afforded the chance.

SQL Saturday Phoenix

I wanted to make sure to talk about the next place I will be speaking, SQL Saturday Phoenix, the largest data related sqlsat492_Phoenixtechnology event in the state of Arizona.  I know it is going to be a great event thanks to Joe Barth and the rest of us on the organizing committee who have volunteered to make this a great event.  The Arizona SQL Server Users Group was where I learned about the SQL Server Community and was where I started to really get motivated to start learning and I am happy to be a part of it. I hope to see you there.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

2016 Releases of SQL Server and Excel

It’s been 2016 for over a month now, so hopefully you are still not changing 2016your 5 into a 6 still. Chances are you haven’t changed your software to reflect the new year yet. Office 2016 came out in September of 2015 and SQL Server 2016 hasn’t been released yet. It’s hard to keep up with all of the version changes that have come out, especially when you throw in Power BI which is has something new every month. If you know you are going to be upgrading to one or both of these versions, or want to learn more about SQL Server or Excel so you can decide if  is worth the upgrade effort, this week I might be able to help out. I am going to be talking about the 2016 version of SQL Server on Wednesday, February 10 at the Arizona SQL Server User Group meeting and then talk about the 2016 version of Excel on Thursday at the Excel BI SQL Pass Virtual Chapter. If you are not in Arizona right now, you are missing out as we are having Department of Tourism weather of 80 degrees. You can get back to me in August when I am melting in the 115 degree heat.

Polybase in SQL Server 2016

Since there are many new features to talk about in SQL Server 2016, I picked Polybase. As big data matures many places are looking to keep their structured data right where it is and create an HDFS cluster to store other data. Polybase allows SQL Server 2016 users to look at both all in one place.

Excel 2016

It’s been a while since September 22, 2015, the date Office 2016 was released , but I still know very few people who have upgraded. I’ve been to a few clients that hope to upgrade to Excel 2012 this year. In this session, I will show where things got moved and renamed, what’s new and what is on the deprecated list. If you don’t have 2016 installed yet, or if you do and wonder where Power Query went, please join me to hear all about it. Generally speaking, the Virtual Chapters are posted on Youtube, and when they are I will have a link available. Unfortunately for those who attended my last Excel BI Virtual Chapter Meeting, due to technical difficulties that recording is not available, but hopefully this time everything will work.  When the recording is available I will make sure a link it is available on my blog for those who can’t make it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Lessons Learned About Speaking

As an attendee at PASS Summit, I had the opportunity to learn about a wide variety of topics, including public speaking. I’ll be devoting other blogs to the great technical things I learned, but I thought I would start by talking about the sessions in general. I saw a number of presentations, some which went well, others RimmaNimmeDavidDeWittwhich were beset by technical difficulties. By far the best talk I saw was the keynote with Rimma Nehme and David DeWitt of Microsoft. The presentation was well rehearsed without sounding canned, and the slides were absolutely amazing. You can check out the slides here as they are publically available. I am going to remember what made this talk work, and hope to incorporate what I saw here when I speak next. If you are interested in where that will be, check out my Engagement page as maybe we can meet sometime.

Speaking Techniques

I saw a number of different speaking techniques employed at Speaker Idol. People were really creative. Todd Kleinhans navigated through a game interface. Wes Springbob did an homage to Hitchhiker’s Guide to the Galaxy. By the way, if you haven’t read the series, I think you should as they are great books. I was surprised that all of the judges hadn’t read the books, but even those who didn’t thought he gave a great talk. I demonstrated that I had never used a microphone before, which was not positive. Bill Wolf worked to engage the audience throughout his talk. Ed Watson videotaped his demo. This technique is something that often I have heard that you should do in case your demo crashes, but this was the first time I have seen anyone who actually did record the demo. William Durkin brought great stage presence, which I noticed was a common theme among all of the talks I liked. Effective presenters know their topic so well that the talk should appear effortless and fun, without appearing that a script has been memorized that you are working to run through. Also, remember to have a point to follow during the presentation so I remember what the talk was about midway through. Everyone who did this I thought did a great job.

Speaker Idol Results

The finalist for Speaker Idol were William Durkin b | t , Theresea Iserman t, and David Maxwell b | t. My name was not there, due to my issues with the microphone, which put me off my game. Also despite my goal of not adding useless words, I threw in many “um” and “so” into my talk. In my round, William did the best job, so it was logical that he went forward. I talked to David and Theresa about their respective talks, and I know they put a lot of work and practice making them really good. David was the winner, so I look forward to seeing him at PASS Summit 2016 giving the talk of his choice. As for me, I hope to follow the pattern of fellow Speaker Idol 2014 non-winner Reeves Smith b | t who spoke at PASS Summit for the first time this year the old-fashioned way, by picking a good topic and writing a good abstract for it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Talking about the latest Power BI Update

On August 20, Microsoft released the latest update to Power BI Desktop. There are some great new features, some of them little but handy like the ability to resize columns in the data view, and others are a bigger deal like the ability to import Power Pivot models. The data modeling feature which has me most intrigued is the ability to move columns from one table to another. That is something which is not a feature found in most other data mashup tools. Moving columns I think provides for a more forgiving environment, which may encourage people new to the topic to play around with it.

When to use Power BI Desktop or Excel

I’ve been writing a lot about Power BI recently and have received some questions on my blog, but I am thinking there may be some more, especially about items in the latest release. There are a lot of interesting tools in Power BI and Excel. Having two tools means there are reasons, based on your unique environments, you may wish to use one or the other. If you have questions about this decision process, or other Power BI questions, I hope that you can make time to ask them on August 31, 10:00 EDT when I’m going to be speaking at the next SQL PASS – Excel Business Intelligence Virtual Chapter meeting, when I will be giving a talk Power Pivot – the Gateway to Power BI. The SQL PASS organization, which I am a member of and I hope you are too, has a number of virtual chapters around all things data, including Excel Business Intelligence, which provide the SQL Community with great information and training resources on a number of different topics. Registration for this talk is free, and I hope you will be able to attend and ask any Power BI related questions you may have.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur