Articles for the Month of December 2015

Dec 31 2015

2015: Year End Wrap up for Releases and More

Author: Ginger Grant • Discussion: No Comments

As 2015 draws to a close, I started thinking back about everything that has happened this year. Technically this has been a big year as a many new applications were released. Here are just some of them, with links included to provide more detail.

Azure Machine Learning – February 18
Power BI – July 10
Cortana Analytics – July 13
Data Factory – August 6
Office 2016 – September 22

This short list could be a lot longer as it doesn’t count the number of updates released to Power BI, which occur several times a month, the CTP releases for SQL Server 2016, the new web version of BIML, or PowerShell. It’s really hard to keep up with everything that is changing. It’s a good thing that so many people are willing to help others learn how through speaking and blogs which make learning new things easier.

Community Involvement in 2015

Keeping up with all of these events is difficult, especially given the pace of releases. I spend a lot of time reading various blogs, watching videos and going to hear people speak. I also have been able to talk about topics of particular interest, many Power BI and Machine Learning. This year I spoke a different times at a number of different events including: Speaker Idol, two different user groups, seven webinars, five SQL Saturdays and other Tech Events. I’ve got a number of engagements on the books for next year, including PASS BA Con and SQL Saturday #461 – Austin. 2016 is shaping up to be busy too and hopefully our paths will cross. I list all of my speaking events on my Engagement Page and I hope that you might take a look at it from time to time if you are interested in catching up in person sometime. Next year I am hoping my list of speaking engagements changes somewhat as I plan on trying harder to get accepted to speak at events where I submitted and was turned down in 2015. On a more positive note, views of my blog are up 1000%, and the number of website subscribers has more than doubled. Thank you very much for continuing to read this site and I hope you find my thoughts helpful. I posted once a week this year, which I thought was pretty good until I talked to Ken Fischer b | t who blogs twice a week. I’ll have to try harder next year. If you think of a topic you think would make a good blog post, let me know as I am always interested in feedback.

Keeping Up the Pace in 2016

Next year there will be no slowdown in the things to learn as SQL Server 2016 is going to be released. Although the exact date has not been announced, my sources tell me to look for it around May-June. The next release of SQL Server is going to be huge as it will include new tools Microsoft added to integrate Big Data and open source platforms to SQL Server. PolyBase, JSON and R are all going to be part of with SQL Server. Personally, I find the R integration most Datazen and SSRS are going to be integrated in the next release too which should really increase the implementation of mobile reporting visualizations.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Dec 29 2015

Power BI Content Packs and Workspaces

Author: Ginger Grant • Discussion: 5 Comments

Recently I was talking with a client who had questions regarding how to integrate Power BI Organizational Content Packs and Workspaces. Working on the principal that he was not alone in wondering about how these Power BI features worked together, I provided that information here, along with some links which you may find useful.

Organizational Content Packs in Power BI

Content Packs were released this summer for Power BI, and if you are thinking about using them, remember that you will be needing a pro license for every user who needs to access them. If you are interested i PowerBiCreatingContentPacks n providing users with the ability to access the data and provide a read only dashboard, containing data and pre-built links to access it, Content Packs can provide that capability. Consuming the provided data source available in a content pack does not allow users to modify anything about the data source, such as the refresh schedule or id used to access the data. Any user can create a content pack and publish it to the entire organization or to a list of emails, both of which can be somewhat unwieldy. A better way to distribute content packs would be to expose them via workgroups.

Workspaces in Power BI

I do talk about workspaces in a previous post, but I wanted to include a shortened version here. If you do not have an Office 365 Exchange client, you will not be able to create workspaces, also known as distribution groups, as these are created as part of outlook and used by Power BI, but are not a native Power BI feature. If you have groups of users such as a group of people working on reports or a team of people who are looking at the same data, for example the accounting department, HR, and Operations all may want to look at different kinds of data, so it might make sense to create Workspace for each of them and store the specific reports for each group in the Workspace.

Releasing Read Only Dashboards and Data Security

When releasing a dashboard via a content pack, people who have been granted access to the content pack are able to view a read only copy of the dashboard. If they wish to modify the dashboard, they will have to make a copy of it, then modify the copied version. Everyone who looks at the dashboard, since they are using the data connection in the content pack, will have the ability to look at all of the data in that connection. If instead you want to use a security model created for each user or group of users which have been created at the source, the data source must be an on-site SSAS tabular model. Only by using a SSAS tabular model can you provide different data access to people viewing a dashboard with content packs.(Update: Power BI Enterprise Gateway will use local security. For more information see this link.) Only the person who created the content pack can delete the content pack. No other person can delete it, but they can break the content pack by removing or altering the data connection the content pack uses. If the person who created the content pack doesn’t work at your company anymore, it is possible to have orphaned content packs that no one can delete. I hope at some point this is fixed as I think an administrator or someone else should be able to delete the content pack, but Power BI does not work that way now.

If you have any other questions regarding these features or what I’ve posted here, please feel free to respond to this blog as I would be happy to hear from you.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Dec 23 2015

R – Why Learn it?

Author: Ginger Grant • Discussion: No Comments

I was talking to someone who is looking to change careers to pursue a technical field. She has been talking to people from various tech companies and has signed up for an intensive training bootcamp designed to teach enough programming to get a job after the class. She has been hearing all about various types of open source languages, but one which has never come up in previous conversations was R. That surprised me since R is on the list of the most popular list of languages (SQL didn’t make the list) as well as list for the programming languages in high demand. She asked me why I thought R would be a good language to learn? Since I had the same thought myself when I started to learn R, I thought I’d answer it. R helps bring meaning to data through its ability to combine data analysis and visualization. Data is important because nearly every application, from FitBit to various flashlight apps, are for better or worse all about data.

Combining Analysis with Visualization

To get started learning R, I took a MOOC class on it. While this appeared to be a good idea, after being in the class for five weeks, I had no idea why the language was considered useful. All we did was load arrays of data into memory and then write some code which approximated aggregation and selection which could be done in SQL. After five weeks, I dropped the class as I had other time commitments and it didn’t seem worth it. I was still interested in finding out what the big deal was. After playing around with R and watching other people play around with it who could make it do a lot more than I could, I got it.

A Picture Is Worth a Thousand Words

The cliché about a picture is often repeated because it holds true. I have taught a number of Power BI classes, and during the class, we review a giant spreadsheet of data, and I ask the class tell me over time what the impact is over time. With a giant spreadsheet, you can’t readily determine what the answer is. Once we create a visualization of data, it’s easy to see the answer to that question. R has been working on providing graphical answers to data questions for years. A number of different companies are realizing the value of R as well. Microsoft bought Revolution Analytics, a leading R provider in 2014 and are rapidly incorporating R into other tools, like Power BI and SQL Server 2016. I’m convinced now and will be talking more about R in my blog in the future.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Dec 17 2015

I’m Not Good at Math

Author: Ginger Grant • Discussion: 3 Comments

How many times have you heard someone say, “I’m not good at Math”? Often times this statement is used as a reason why something technical cannot possibly be pursued. It’s a self-inflicted limitation; a reason that entire areas of study cannot be pursued. If you have ever said this, stop it. Don’t repeat it even if you believe you are not good at math. Why? Because while you may not be good at math now, there is no reason why that should stop you from learning it.

Math, Music and Programming

Years ago, back in the days before PCs and more importantly computer science degrees offered by major universities, IBM was working on developing mainframe computers and needed people to help them develop them. Since there were no computer science degrees being offered at that time, they hired people with degrees in Math and Music. Music? Why Music? Music uses the same part of the brain as math does. This is one of the reasons educators think that music should be taught to small children as it has been shown to improve math scores. Personally I have found it interesting to ask technical people if they play or have played an instrument. Ask around yourself and you may be surprised at the large number of people in technical fields who have or do play a musical instrument. Musicians have the brain training needed to be good technical people, regardless of their math skills.

Learning Limits

There are no limits to what you can learn, other than the limits you put on yourself. The brain is very complex and there are infinite ways to train it to do something. Generally speaking one is not good at math because they haven’t learned it. Oddly enough, discouraging one’s ability to learn often starts in school. If this sounds familiar, remember life isn’t school. Often times a school setting isn’t the best way to learn anything. Performance in class is not indicative of one’s ability to learn. It may have be the ability of the instructor to teach or willingness to focus at that time. I am willing to bet you don’t view the world the way you did when you were sixteen, so why would you judge your ability to learn with that same filter?

Machine Learning is a Skill Which Can Be Learned

I know a very smart developer who told me recently that he wasn’t good at math, so he couldn’t possibly do machine learning. Really. PowerShell, Networking, TSQL, C#, SSIS, MDX and DAX you could learn but you can’t teach yourself Machine Learning? I am not going to say it is easy, but I wouldn’t say that about any of the other development and IT tasks either. If you can learn one of those, you can learn Machine Learning too, despite your belief in your math skills. There is no reason why not. I think Yoda said it best “Do or do not. There is no Try”. There is nothing really stopping you.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Dec 15 2015

Azure Stream Analytics Hopping – Part 3

Author: Ginger Grant • Discussion: No Comments

When incorporating streaming data from Azure Streaming Analytics, it is important to select the data to accomplish the goals of the tasks at hand. The sample company Eosh is streaming a lot of sensor data and has a variety of questions which the data is will answer. Eosh is a concrete delivery company which is streaming the data back from their vehicles to the central dispatch system using Microsoft’s Stream Analytics. There’s a much more detailed description of what Eosh is streaming and their data needs in the first post in this series. After reviewing when Tumbling Windows and Sliding Windows, are used, in this post we are going to discuss another option for streaming data, Hopping Windows.

When to Use Hopping Windows

Eosh wants to use Hopping Windows to determine the previous action when water is added to the concrete mix. There is a flow meter sensor in the water tank which detects when the driver flips the switch to add more water. There are a number of different reasons for adding water, one being that the pouring is complete and the driver is washing out the remaining concrete. Another reason could be that the driver is stuck in traffic and the water is added to keep the concrete from setting up within the mixer. Depending on the type of concrete in the mixer, if too much water is added, the concrete will no longer have the required strength and can’t be used to create a load bearing structure. It is very important that concrete used in structural concrete be created according to specification, as concrete mixed incorrectly will crumble over time, something commonly seen in Detroit. If too much water is added the vehicle may be routed to a different location so the concrete can be used for a non-load bearing purpose, like creating sidewalks.

Overlapping Hops

By design, all hops contain an overlapping previous time slice. The picture provides a good visualization for how the data slices are created. Eohs wants to look at the events which happened 5 minutes prior so that the adding water event can be appropriately categorized. The following Streaming query can provide that data

SELECT System.TimeStamp AS OutTime, VehicleID, COUNT(*)
FROM Input TIMESTAMP BY WaterStartPour
GROUP BY VehicleID, HoppingWindow(minute,10 , 5)

This query will create 10 minute slices of time. Each slice will look at the last 5 minutes previous reported and 5 minutes past that. By slicing the data in this way, the context around adding water can be evaluated to determine what kind of water add event took place. Eosh can then use this data to determine if the concrete can be delivered to the original location or if it needs to be rerouted. This later processing will be accomplished via machine learning, which I will talk about in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Dec 13 2015

Non-technical Issues Impacting Data Based Decision Making

Author: Ginger Grant • Discussion: No Comments

Having worked with a number of clients to implement Power BI in their respective environments I noticed that one factor appeared to be common to all. The success of the project depended greatly upon the relationship between the business analyst and the database team. Since this seems to be an issue which greatly impacts the ability to implement Data Based Decision Making, I decided to talk about it in my recent webinar PASS BA Marathon. Too often I see companies which decide to join data together in an analytics platform, such as Power BI, and fail to take advantages of the separate skillsets in the organization. The data team has spent a considerable amount of effort and energy determining the best ways to combine datasets together. Logically one would assume that this expertise would be leveraged to help the business team analyze data. Instead the business teams are tasked with joining data together. While this approach can work, it will take longer to train the business in areas in which they may not be familiar, and the results will be mixed, especially when considering scalability and maintenance needs over time. To leverage the capabilities of the self-service business tool, which tool doesn’t really matter as the same issues will exist in for example Tableau as well as Power BI, the data team needs to be engaged. The skills they have gathered over time allow them to design a plan a data model which can be refreshed automatically without causing issues.

Using Areas of Expertise

Business Analysts time is best spent using the unique skills they have gathered over time too. Their familiarity with the data values allows them to determine at a glance how the business is doing. Codifying this knowledge into meaningful reports which can disseminate this information throughout the organization provides the basis for data based decision making. To make them successful, they need a data model which has all of the information they need which is well documented so that they can find the values they need to provide meaningful data visualizations. Too often the report generation is left to the data team, and many times there is a reporting backlog of items as there are not enough resources to do provide all of the information a business needs.

Team Collaboration

Data Based Decision Making should be an organizational goal, as it has been shown to be a major tool for business success. When the Data Team and Business Analysts work collaboratively by using their specialized skills to create and implement a solution, this solution will be successful. The result will be a model which provides the a path for the Business Analyst to continue to use the data to answer either routine questions, such as “How successful was the business last month” to more obscure questions, such as “What happened to sales volumes after a bad story in the press?”. These and many other questions are answered using the model and tools, like Power BI to implement an enterprise wide solution.

Implementing Successful Data Analytics Management Practices

There is more to implementing a self-service BI Tool such as Power BI than merely knowing how to make the tool work. A process and a commitment to work among teams is required as well. I enjoyed the opportunity to talk about integrating the tools with the company data management polices at the BA Marathon. If you would like to know more about this topic, please come join me at the PASS Business Analytics Conference in San Jose May 2-4 as I will be going into more depth than was possible in the webinar.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Dec 09 2015

Limiting the Number of Results with TABLESAMPLE

Author: Ginger Grant • Discussion: 1 Comment

In the world of exploding piles of data, there are times you just need less not more. If you are working on a proof of concept, just want to test out some code, or want to pass a selection of data to a machine learning experiment, you don’t need several million records. Having that much data will just keep you from getting to the answer in a reasonable period of time. There are a number of standard ways people limit the data, the most common being some kind of a date filter. Using a date range though often times does not provide the variability needed. This is a particular problem with data used in a machine learning experiment it is designed to create an algorithm based on data pattern extrapolated over time. For example if you are doing any kind of regression analysis on a retail client and you either exclude or include the Christmas shopping season, the algorithm created will not be correct. The goal is to have less data for analysis pulled from the entire set of data. Fortunately SQL Server since 2005 has several methods for selecting random data

Using TABLESAMPLE

Until recently, I hadn’t used the Transact SQL TABLESAMPLE clause, but I ran into a situation where I needed to test some things and not wait all year for a result. TABLESAMPLE to the rescue. There are a couple of things where it won’t work. If you have derived tables, tables from a linked server (bad idea), or are writing a view and you want to return random data you can’t use TABLESAMPLE. If those conditions do not apply, you can use it.

The random sample that TABLESAMPLE provides is based on the number of data pages, not the number of records. If you want the number of rows to be specifically limited you will need to use Top(n) as well. I’ve written all my samples based upon AdventureWorksDW so you can run them for yourself later. I’ve listed the variety of ways to call TABLESAMPLE and shown the number of records returned.

SELECT *
FROM [dbo].[FactInternetSales] –60398 Rows returned prior to Table Sample
TABLESAMPLE(10 PERCENT)

(6073 row(s) affected)

Let’s say you want to return the same sample set multiple times. For that you will need some value. I picked 11, but you could pick any other you like.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]
TABLESAMPLE(10 PERCENT)
REPEATABLE (11)

(6489 row(s) affected)

When looking at the number of records returned, the values are not 10 percent exactly or particularly consistent in the number of rows returned. If you only want 6039 records returned, you can try the following code, but it doesn’t really do what it says.

SELECT *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales]  
TABLESAMPLE(6039 ROWS)

(5640 row(s) affected)

This code will provide the exact number of records desired. Noticed I upped the number of rows returned in order to get 6039 rows. If the sample is 6039 you cannot guarantee that you have enough rows returned.

SELECT top 6039 *
FROM [AdventureWorksDW2014].[dbo].[FactInternetSales] --60398
TABLESAMPLE(6200 ROWS)

(6039 row(s) affected)

Really Random Sample

TABLESAMPLE has some really useful functionality, but for machine learning I need a truly random sample, which TABLESAMPLE does not provide. For that I need NEWID(). The following sample returns approximately 1% (.01) of the 60398 rows.

Select * from [AdventureWorksDW2014].[dbo].[FactInternetSales]
Where 0.01>= Cast(checksum(newid(), [ProductKey]) & 0x7fffffff as float) / Cast(0x7fffffff as int)

Just for fun I ran the same code 3 times and got a variety of rows returned.

(600 row(s) affected)
(607 row(s) affected)
(622 row(s) affected)

The ProductKey is added so that the NEWID() function will calculate a sample for each row. The WHERE statement calculates a random float between 0 and 1. This will truly give me the random sample I would need for a machine learning experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Desert Isle SQL

Just a three hour tour…

Articles for the Month of December 2015

2015: Year End Wrap up for Releases and More

Community Involvement in 2015

Keeping Up the Pace in 2016

Power BI Content Packs and Workspaces

Organizational Content Packs in Power BI

Workspaces in Power BI

Releasing Read Only Dashboards and Data Security

R – Why Learn it?

Combining Analysis with Visualization

A Picture Is Worth a Thousand Words

I’m Not Good at Math

Math, Music and Programming

Learning Limits

Machine Learning is a Skill Which Can Be Learned

Azure Stream Analytics Hopping – Part 3

When to Use Hopping Windows

Overlapping Hops

Non-technical Issues Impacting Data Based Decision Making

Using Areas of Expertise

Team Collaboration

Implementing Successful Data Analytics Management Practices

Limiting the Number of Results with TABLESAMPLE

Using TABLESAMPLE

Really Random Sample