Getting Started with R

RAbout two years ago, because I kept on reading that R was the language for analyzing data, I signed up for an online class in R. I took the class for four weeks, and then I got busy and couldn’t finish it. Now maybe they would have gotten around to the good parts if I had stayed longer, but in those four weeks I had no idea why you would want to use R. The classes had various projects to load data up into memory and analyze it, which I did thinking all the while “I can do in Excel, so why is this cool?”. I didn’t see anything that showed my why R was the tool for analyzing data. A little while later, I heard someone talking about visualizing data in R. Now I saw why people were so excited about it. This video from Revolution Analytics, which Microsoft bought last year, shows some of the cool visualizations you can do with R. Since Microsoft is including R support in SQL Server 2016, now might be a good time to start learning it.

Starting R from a SQL Perspective

Since R support in SQL Server 2016 will be available in the preview for CTP3, now might a good time to start learning it. When I wrote this post, CTP3 was not available, which is why all of my samples are created in the open source free tool, R Studio (go ahead click on the link and download it then come back). The application generally uses four quadrants. RscreenThe top left contains the code editor, below that is the console where you see the results of the code you run. Like SSMS, you can run what you highlight, which may or may not be everything in the code editor. R Studio sort of reminds me of PowerShell, as R has good help and doesn’t always tell you if your command contains an error or not. The top right contains the workspace, including data which has been loaded, and the bottom right contains tabs for showing your graphics or the help files. R includes you data to play with to get started, just type data() to see the data sets, so you don’t have to go fish for data sets on the internet.

Visualizations in R

If this was the first example I saw with R, I would have better understood why people are paying more for R developers than C# developers as it doesn’t take much to get started to chart your data. You do have to load the libraries for what you want to load. I chose to load a selection of BMI data from gapminder for my sample. I have a simple text file which contains Country, year and BMI. First I am going to reference the libraries beeswarm and ggplot2, then load the data and assign it to a variable. After that, I will call the beeswarm function and have it plot the data, then provide a legend to see which country is which. After I have it loaded I am going to call the boxplot function to overlay the data. Not much code and a lot more satisfying than Hello World.

 install.packages("beeswarm")
 install.packages("ggplot2")
 library(beeswarm)
 library(ggplot2)
 setwd("D:/files/code/r/WorkingDir")
 BMIFemale<-read.csv("d:/Files/code/r/BMIfemaleCountriesSelected.csv")
 #View(BMIFemale) 
 beeswarm(BMIFemale$BMI ~ BMIFemale$Year
 ,data=BMIFemale, pch=20
 ,ylab='BMI', xlab='Year',
 ,pwcol=(Country)
 ,labels=c(BMIFemale$Year)
 )
 legend('topleft',legend =levels(BMIFemale$Country), title='Country',
 pch=20, col=1:50
 )
 boxplot(BMIFemale$BMI~BMIFemale$Year, data=BMIFemale, add=T
 , col="#0000ff18"
 )

Why should I use R ?

I hope this post provided a quick way of getting started with R. I will be devoting more space on this blog to provide other reasons why you might want to use R, especially when it comes to Machine Learning and explaining further R in relation to what you can do with SQL or other tools like Excel so you can better understand it’s place in the data environment. If you are interested in reading my upcoming posts on this topic, feel free to subscribe to my blog to get the updates.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Datazen – The On-site BI Option

I have talked to a number of people who looked at Power BI Preview and read that it will be replacing Power BI Office 365 as a cloud solution for mobile self-service Business Intelligence [BI] and thought what are we going to do? Some people are not going to go to the cloud. Since they have made that business decision, they were wondering what Microsoft was planning on offering in the non-cloud space? I was curious too, so I asked my local Microsoft Technical Sales Representative about what was the plan for people who wanted on-premises BI. The answer to that question is, use Datazen. Datazen is positioned to be the application for people who don’t want cloud but do want to create a rich self-service environment for doing self-service BI, which of course means that you can send the reports to tablets and phones.

Datazen vs Power BI Preview

When comparing the two products Datazen and Power BI Preview, there are a number of features which will sway people one direction or another.DataZenVisualizations  For example, Datazen doesn’t integrate with PowerPivot models, you have to flatten the model to connect. On the plus side, Datazen has a very robust security and distribution model which Power BI Preview does not. There are more visualization types in Datazen and the snap to grid makes using them a breeze. Power BI has the ability to connect with QuickBooks and other data sources such as Salesforce, which Datazen cannot do. Datazen is included with a current SQL Server maintenance agreement. Power BI will have an on-going per user monthly fee. Both have the ability to allow users to pick their own colors, rather than limit the selection to a series of color pallets, and both deploy to tablets and phones in addition to a web page. All of these factors, plus a whole lot more than I have space for here, weigh into the decision of which product you may select. If the foremost criteria for mobile Self-Service BI is No cloud, Microsoft’s solution in the self-service BI space is, use Datazen.

Future is Cloudy

Recently most of Microsoft’s new products have been very cloud based. Azure Machine Learning, is a great example of this, as no where can you find any document on Microsoft’s site about any plans to move that out of the cloud. The purchase of Datazen shows Microsoft is still accommodating people who just want to stay in house, which I missed in their announcement. I’ll be reading the upcoming releases a little more carefully from now on, as I am sure there will be many more.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Designing Dashboards with Datazen

CreateNewDataZenDashboard creation in Datazen uses the Publisher application which you can find here. My previous post walks through installing it in detail. When you open the Datazen Publisher Application, you will notice that it has a mobile interface, not a Windows interface, which means using it can be different from what you may expect. To create a dashboard, you will need to click the menu on the bottom on the screen. If there is no menu bar on the screen, right click anywhere and the menu bar will appear. There are two options available for you when you get started, either and empty dashboard or an existing dashboard, which provides a design template as shown on the picture of the screen.

Setting Up Excel as Data Source in DataZen

Your data can come locally from Excel or from the Datazen server. I picked Excel. The navigation is not the Windows-like navigation I am used to, as I had to click on the libraries to get a drop down menu to get to my data. The first spreadsheet I selected contained a Power Pivot Data Model. Datazen doesn’t appear to read Power Pivot Data Models, as it was unable to access the data. First I needed to modify my excel spreadsheet to create a flat pivot, and then Datazen can use that as a source. After I modified my Excel spreadsheet, then I was able to use that spreadsheet as a source in DataZen. Then I tried to create some KPIs. I found out that Excel cannot be used as a data source for KPIs in Datazen, unless you configure Datazen Server to use excel like a SQL Server Data source. Good to know, and hopefully you will read this before you decide to use Excel as a data source, which will save time.

About Datazen Visualizations

After all of your data sources are defined, it’s pretty easy to set up visualizations from the selections on the left. One thing that I really like about the Designer is the snap-to grid function, which makes laying out content much more organized. I wish other dashboard design tools I use had a grid. When visualization are selected, the settings for configuring it are on the bottom of the screen. The output can be designed for a phone or a tablet from the master view. This feature makes a lot of sense as the screen size changes the way the display looks and your ability to click on it. The maps are pretty nice out of the box. Here’s a time saving tip when creating maps in Datazen. Make sure that you spell out the state names when you want to use them. The visualizations you create will allow you to drill into other reports, similar to other reporting applications. It’s not really meant to be self-service BI, but a tool to present dashboards to business users and provide them the level of detail desired by creating deeper dive other reports. As it was designed for a mobile environment, touching the screen to interact with the visualizations means needing to think about finger size when creating phone applications so that interaction is possible by all users.

Try Datazen Yourself

I hope you found this post useful and take the time to explore the product yourself. I am sure Microsoft has a lot of plans for Datazen, one of which is backwards compatibility. Datazen now has a preview of the Publisher in Windows 7, along with the released product in Windows 8, which will allow more people to use it.  If you are exploring dashboarding tools, especially for a mobile deployment, this could be the tool you end up selecting.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

DIY Machine Learning – Supervised Learning

When I first heard about supervised learning I had a picture in my head of a kindergarten class with a teacher trying to get the small humans to read. And perhaps that isn’t a bad analogy when talking about Machine Learning in general as it is based on the same principles as school, repetition and trial. After that the analogy falls apart though when you get to the specific criteria needed for Supervised Learning. There are two broad categories for types of machine learning which have the binary descriptions of supervised learning, which fall into the binary categories of Supervised and Unsupervised. This means you only have to know the one set of criteria for supervised learning, to determine which type you need.

Training Data

A problem solved with supervised learning will have a well-defined set of variables for its sample data and a known outcome choice. Unsupervised learning has an undefined set of variables as the task is to find the structure from data where it is not apparent nor is the type of outcome known. An example of Supervised learning would be determining if email was spam or not. You have a set of emails, which you can evaluate by examining a set of training data and you can determine using the elements of the email such as recipient, sender, IP, topic, number of recipient, field masking and other criteria to determine whether or not the email should be placed in the spam folder. Supervised learning is very dependent upon the training data to determine a result, as it uses training data to determine the results. Too much training and your experiment starts to memorize the answers, rather than developing a technique to derive solutions from them.

When Supervised Learning Should be employed in a Machine Learning Experiment

As the field of data science continues to proliferate, more people start are becoming interested in Machine Learning. Having the ability to learn with a free tool like Azure Machine Learning helps too. Like many tools while there are many things you can do, so knowing when you should do something is a big step in the right direction. While unsupervised learning provides a wide canvas for making a decision, creating a successful experiment can take more time as there are so many concepts to explore. If you have a good set of test data and a limited amount of time to come up with an answer, the better solution is to create a supervised learning experiment. The next step in the plan is to figure out what category the problem uses, a topic I plan to explore in depth in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Creating a Date Dimension with a CTE

I am looking forward to SQL Server 2016, which includes an update to DAX which will generate dates for you, but since it hasn’t been released yet, the need still exists to create a date Table. Since I am creating a Tabular Analysis Service instance, an identified date dimension is needed for time functions. There are a number of different ways to create a date dimension, here I am going to take advantage of recu There are a number of ways you can create a date dimension, here I am going to describe a neat way of creating one using a recursive CTE, which decreases the SQL code required to generate one.

Date Dimension Fields

There are a number of fields which are pretty standard for date tables. The fields can vary if you need to add things like Fiscal years or Month End dates which have business specific rules like last Friday of the month. The most common fields needed for a Date Dimension are listed here.

  • DateKey – Integer key representing the date, for example 20150524
  • CalendarDate – Date time field
  • CalendarYear – Four digit year
  • QuarterNumber – Number from 1 to 4
  • QuarterName – First Quarter, Second Quarter, 1 Qtr 2015, First Quarter 2015 are some of the ways the name is listed. Pick whichever format people would like to see.
  • MonthNumber – 1 for January, 2 for February, 3 for March etc.
  • NameMonth –   January, February, March etc.
  • MonthDayNumber – July 15 would have the value of 15 here
  • WeekDayNumber – Date of the Month. July 4 would have a 4 here.
  • CalendarWeekofMonth – Counts the weeks within each month
  • NameDay – Monday, Tuesday etc.
  • DayNumber – Whether you choose 1 for Sunday or 1 for Monday is a business rule you need to find out first. This example shows 1 for Sunday, 2 for Monday etc.
  • YearDay – Sometimes referred to as the Julian number this counts the days from 1- 365 and can be useful in some calculations
  • YearWeek – Counts the weeks from 1 -52

As a best practice, remember reserved words such as DATE or Index or any other word which shows up as blue or pink in SQL Server Management Studio, should never be the names of any columns. Always select a name other than a reserved word when creating tables. Also friendly names containing spaces are great for exposing to users, but they are annoying in SQL Server, so leave the spaces out of the column names. Evaluate what dates you are going to be needing in the table so that you don’t have to go back and redo it. I am creating a smaller table here, but that is just because it is an example. Look at the dates you will be storing in your data warehouse when determining your starting dates, and set the end dates for probably about five years longer than you think the data warehouse will still be in use.

Using a CTE to Generate a Date Dimension

CTEs, which Microsoft added in 2005, is a great way to generate a date table by harnessing the power of computer to spin through your code, decreasing the need to write a lot of code. I am using a recursive CTE method here first published by Itzik Ben-Gan to generate the number table being used in the code below.

CREATE TABLE [dbo].[DimDate](
[DateKey] int NULL,
[CalendarDate] [smalldatetime] NULL,
[CalendarYear] [int] NULL,
[QuarterNumber] [int] NULL,
[QuarterName] [varchar](14) NULL,
[MonthNumber] [int] NULL,
[NameMonth] [nvarchar](30) NULL,
[MonthDayNumber] [int] NULL,
[CalendarWeekOfMonth] [int] NULL,
[NameDay] [nvarchar](30) NULL,
[DayNumber] [int] NULL,
[YearDay] [int] NULL,
[YearWeek] [int] NULL
) ON [PRIMARY]

GO

/*Make sure you change the start and end dates listed here to the dates you wish to use in your table*/
DECLARE @StartDate smalldatetime = '01/01/2014'
DECLARE @EndDate smalldatetime = '12/31/2016' ; /* don't forget the semi-colon or you will get an error*/

/*This CTE is used to create a list of numbers used to generate the calendar*/
WITH
A00(N) AS (SELECT 1 UNION ALL SELECT 1),
A02(N) AS (SELECT 1 FROM A00 a, A00 b),
A04(N) AS (SELECT 1 FROM A02 a, A02 b),
A08(N) AS (SELECT 1 FROM A04 a, A04 b),
A16(N) AS (SELECT 1 FROM A08 a, A08 b),
A32(N) AS (SELECT 1 FROM A16 a, A16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM A32),
/*Calendar dates are created here*/
CalendarBase as (
SELECT
DateKey = n
, CalendarDate = DATEADD(day, n - 1, @StartDate )
FROM cteTally
WHERE N <= DATEDIFF(day, @StartDate , @EndDate +1)
)

/*Using the list of dates created above will populate your date table here*/
insert into dbo.DimDate(DateKey
,CalendarDate
,CalendarYear
,QuarterNumber
,QuarterName
,MonthNumber
,NameMonth
,MonthDayNumber
,CalendarWeekOfMonth
,NameDay
,DayNumber
,YearDay
,YearWeek)

SELECT
DateKey       = CONVERT(char(8), CalendarDate, 112)
, CalendarDate
, CalendarYear  = YEAR(CalendarDate)
, QuarterNumber =  (DATEPART(QUARTER,CalendarDate) )
, QuarterName = 'Quarter ' + cast((DATEPART(QUARTER,CalendarDate) ) as char(1)) +' ' + cast(YEAR(CalendarDate) as char(4))
, MonthNumber = MONTH(CalendarDate)
, NameMonth     = DATENAME(Month, CalendarDate)
, WeekDayNumber   = DATEPART(DAY, CalendarDate)
, CalendarWeekOfMonth = DATEDIFF(week, DATEADD(day,1, CalendarDate - DAY(CalendarDate) + 1) -1, CalendarDate) +1
, NameDay       = DATENAME (Weekday,CalendarDate )
, DayNumber       = DATEPART(Dw, CalendarDate)
, YearDay       = DATEPART(DAYOFYEAR, CalendarDate)
, YearWeek      = DATEPART(WEEK, CalendarDate)
FROM CalendarBase

After running this code you will have a date table created and loaded.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Configuring Power BI Data Refresh on a Local SQL Server Database

Recently I have been tasked with setting up and assisting other in setting up Office 365 Power BI Automatic Data Refresh. If you scroll through this post, you’ll see that there are detailed steps required to accomplish a successful Power BI Data Refresh with an on site version of SQL Server. There are a lot of people who are very frustrated trying to configure Automated Data Refresh. Considering what is involved, I understand why. Follow these steps, and Automated Data Refresh for Power BI Office 365 with a SQL Server located locally will work.

TL;DR Steps for Power BI Automated Data Refresh

This is a long document because I wanted to make sure that all confusion regarding how to make this work took a lot of words. If you don’t want to read them all, here are the steps needed to complete a successful data refresh. I have a feeling this level of detail may not be enough, some people may figure it out after reading this far.

  • Install the Data Management Gateway application on your SQL Server box.
  • Add a gateway Power BI Admin Center and configure the Data Management Gateway.
  • Create a connection string and a Power BI workbook which uses the connection string
  • Create a Data Source in Power BI Admin Center using the connection string created in the previous step
  • Schedule Data Refresh using the Power BI Workbook using a matching connection string.

 

Power BI Office 365 Automated Data Refresh Steps

Here in detail is how I can get Automated Data Refresh to work every time. If you have figured out another way to make it work not using these steps for an on-premise SQL Server data refresh, please let me know how you did it. I know these steps work every time, but I am hoping that there is an easier way.

  1. Install the Microsoft Data Management Gateway on your SQL Server Database Server. Here is more information on the Data Management Gateway.Technically, you don’t have to have it installed here as Microsoft says it is supported installed in other places. If you want to make sure it does work, you will install it on your database server, which must have internet access available on the server. Here is the link for the software, and here are the instructions from Microsoft on installing it.

 

  1. The next step is to configure the gateway for the Power BI site and for your local server. Go to your Power BI for Office 365 site, and navigate to the Power BI Admin Center. On the Left Menu, you will see an option for Gateway. Click on the plus and add a gateway.  You will generate a key as part of this process and you will need to copy this key and provide the key to the application installed on the database server. If you have more than one database server, you will need more than one key. Here’s a link to the instructions for creating a gateway in the Power BI Admin Center which provide more detail. When you are done, look on Power BI Admin screen your gateway name will have this Running next to the gateway name. If you don’t see the green check, wait for about 10 minutes to see if the status changes.If the status does not show as Running, there are two things you might want to do: Review the instructions or Look on the Office 365 Admin Center. The Office 365 Admin Center can be accessed via the grid button on the top left hand side of the screen under the Admin icon. Select menu option on the left hand side of the screen for Service Health, open it and select service health. I have included a copy of the screen here, which shows that there are no issues today. If you are experiencing issues, this screen may not reflect a problem until sometime later, so continue to review this if the steps just don’t work for some reason.Office365Admin
    1. Create a local data connection within Excel to be used to create a Power BI data source. Start a new worksheet in Excel. Click on the data tab (not the Power Pivot tab) and click on the left most icon on the ribbon Get External Data->From Other Sources->From SQL Server. Enter the server name. Either Windows Authentication or SQL Server Authentication work, so pick either one and enter the Id and password. Click on the Next Select the database and then the tables you want. There is no need to select more than one table, as you can always go back and add tables once this is complete, and it will save loading time. Click on the Next button.
      DataConnectionWizardThis is the most important screen of the entire process. The Friendly Name listed here will be used on the Power BI site to create a Data Source. The Friendly Name must exactly match the name of the Data Source in Power BI, which may mean that you will want to edit the name from the default listed here. Do not check always attempt to use this file to refresh data. After you are satisfied with the Friendly Name click on the Browse… button.In a later step, the connection string listed in this file will be used to create the connection string in Power BI. As we will be opening the file listed here in Notepad in a later step, it is important to remember this file path.DataConnectionWizardFileSaveClick on the Save button in the File Save window.Click on the Finish button in the Data Connection Wizard Menu.ImportDataGenerally speaking, you will want your database information to be loaded to a Pivot Table Report, so click on the OK button. If you do this, you have the added benefit of having a report to test. When your pivot table is completed, it might be a good idea to now go to the Power Pivot tab and take a look at the model. Feel free to modify it as much as you like, but it is not necessary to do so to test the data refresh. If you need to add additional tables, make sure that you do so by using the existing data connection created earlier. Save and close your new excel workbook.

     

    1. Using the information from the previous step, we can now create a data source in Power BI. Go back to Power BI Admin Center on the Web, and from the left hand menu select Data Sources. On that screen, click on the +(plus sign) to add a new data source, and select SQL Server. Enable Cloud Access must be completed. If you want to make items from the database searchable from Power Query, you will need to check Searchable in Power Query and Enable Odata Feed. If you select Enable Odata Feed you will be prompted later to select a series of tables to expose for viewing Click next to get to the connection info screen.PowerBIConnectionInfoCompleting the connection info screen correctly will allow you to do a data refresh. The information needed to complete the screen can be found in the *.odc file you created earlier. In notepad, open up the *.odc file using path you remembered from the previous step. Look for the text <meta name=Catalog content= The name listed on the other side of the equal sign MUST be the name you list in the name in the Power BI Connection Info Name. Put a description in if you wish. The dropdown box for the Gateway will list the gateway you created in the previous step as well as any others you have created previously. Select the appropriate gateway.By default the Connection Properties option will be selected.ConnectUsing Unfortunately if you use this option you will never be able to perform an Automatic Data Refresh in Power BI. Select connection String. When you do, the connection provider will change to .NET Framework Data Provider OLE DB. This is supposed to happen. For the connection String information, go back to your *.odc file and look for the text <odc:ConnectionString> Copy everything between that text and this </odc:ConnectionString> into the connection string window. Wait until the set credentials button turns blue. If it does not turn blue, there is something wrong with the Power BI and you will want to look at the Office 365 Admin Service Health site referenced earlier.Click on the Set Credentials button. A little window will show up indicating it is loading. A pop up window will appear prompting you to add your user id and password. Make sure you test the connection before clicking on the OK button.The last step here is to click the Save button.

     

    1. Upon the successful completion of the previous steps, Scheduling Data Refresh can now be configured. Upload the excel workbook created earlier in step 3. Once it is loaded, click on … (the ellipse) and select Scheduled Data Refresh. When the AddToFeaturedReportsscreen loads, the first thing you will need to do is click on the button at the top of the screen to turn the feature on as it defaults to off. All of the connection information included inside your Power BI report will be listed. The connection name(s) listed in the report should exactly match the connection(s) in the Power BI Admin Data Connections. If there is not a match, you will not be able to Schedule an automatic data refresh. It is not possible to change the names from this screen, as it lists the connections within the Power BI document. You may need to fix the connections within your Power BI worksheet. If the connections are not valid, you will see this status iconStatusIcon and a message about this status.
      The Configure Refresh Schedule is default selected to a daily frequency.   Select the time and the zone for the data refresh. Unfortunately, you can only schedule refreshes for 3 months, so the schedule will need to be updated every 3 months as there is no way to make it continue in perpetuity. Send Notifications is defaulted to the email address of the admin. Note you will be notified of errors, not of completions of the data refresh. Click on the Save and Refresh button to test the data refresh process, which refreshes the report immediately. Once the data refresh schedule is completed, the top of the screen will have two menu items History and Settings, which you can review at any time.

     

    Data Refresh Conclusion

    If you think that there should be a better way, I would love to hear about it. Looking online, there are a lot of people struggling with this topic. If in any way you find it helpful, or you want to tell me there is an easier way, I would love to hear about it. I look forward to hearing what you have to say about this topic.

    Yours Always

    Ginger Grant

    Data aficionado et SQL Raconteur

Upcoming and Recent Events

24HOPPassSpeakingThe PASS organization is a professional organization which sponsors a number of different technical events in the technical community. Recently, I have been honored to be selected to speak at not one but two events hosted by PASS, a professional organization which provides a lot of great resources to improve knowledge of all things SQL Server and related technologies to the world. The PASS Business Intelligence Chapter provides training on all things related to Business Intelligence via the web. I was selected to talk at the last meeting in May. Thank you to all of the people who were able to attend my talk on Top 10 SSIS Tuning Tricks live. If you had to work, no problem all of the talks hosted by the PASS Business Intelligence Virtual Chapter Recordings are available on www.Youtube.com. The recording of my Top 10 SSIS Tuning Tricks session is available here.

24 Hours of PASS

Periodically PASS provides a 24 Hour Training session on SQL Related topics to provide training live to every time zone in the world. As this event is watched by people around the world, it is a real honor to be selected for this event. This time the speakers were selected from people who had not yet spoken at the PASS Summit Convention, as the theme was Growing Our Community. The theme is just another way the PASS organization is working to improve people’s skills. Not only do they provide the opportunity to learn all things data, but also provide professional development through growing the speaking skills by providing many avenues to practice these skills.

Data Analytics with Azure Machine Learning

My abstract on Improving Data Analytics with Azure Machine Learning was selected by the 24 Hours of PASS. As readers of my blog are aware, I have been working on Azure Machine Learning [ML] this year and look forward to discussing how to integrate Azure ML into current environments. Data analytics with ML are yet another way to derive meaning from data being collected and stored. I find the application of data analytic fascinating, and hope to show you why if you are able to attend. There are a number of wonderful talks scheduled at this event, so I encourage you to check out the schedule at attend as many as you can. To be sure I’ll be signing up for a number of sessions as well.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Getting Started with Datazen

Since Microsoft announced they were buying Datazen, I wanted to check out the application. I am interested to know about Datazen’s feature set, and how it compares to functionality in Excel with the Power tools and Power BI Preview. Datazen is comprised of two applications, the Publisher and the Viewer. To create reports, one must install Datazen Publisher and that application will need to use the Datazen Server to make data connections and provide security. I have included a diagram from DataZen’s Website to describe how it is meant to be configured.DataZenArchitecture

The publisher is meant to use the Datazen Server, which is where the connection to your data is made. All of the client communication with the server is performed using a Web App which uses IIS. If you have an Enterprise copy of SQL Server with active Software Assurance coverage, you are in luck. You automatically have the ability to install Datazen Server for no additional cost, which you can do here. What if you just want Microsoft to host the Datazen Server for you, do you have to install a server? The answer to that question is yes, because the Datazen Hosted Service is listed as Coming Soon (scroll down to the bottom of the page if you click the link). You can of course install Datazen to Azure, which will make it cloud based.

Quick Tips for Installing Datazen

To design a dashboard, you will need to install the Datazen Publisher application. While Datazen visualizations can be viewed on Apple and Android devices, to design those visualizations, you will need to use the Windows only client. And by Windows, they mean Windows 8. While I have a machine with Windows 8.1, it’s not my favorite OS, and I work a lot on my Windows 7 PC. I won’t be using that PC to create Datazen reports. Datazen now offers a preview version of the Datazen Publisher, so I assume that the market demand pushed them to support Windows 7.  This version is not a released version, but a preview, so you may have some issues with it. Datazen is designed to be an application to run on Phones and tablets. The software is installed in the same manner phone apps are, you need to go to the Windows Store to download it, rather than click on a link to the site and download it, which is why you need Windows 8. If you can’t find the store, just go to the start screen and search for Store, which is what I had to do. Once in the store, search for Datazen Publisher, which is free.

Running Datazen

After Datazen Publisher has been installed, when you run it you will be prompted to connect to a server. If you don’t have a server, no worries, Datazen provides a demo server for you. The data which is entered by default the first time you run Datazen, as shown below, provides a connection to this server. DataZenClientScreenTo connect, just click on the Connect button on the screen. You will be asked if you want to let publisher run in the background. Generally speaking, click on the Allow button. If you do, you will get updates which have been made to the data sources. I received 31 updates after I installed it.

Datazen Visualizations

Initially you will see KPIS, Top Dashboards and other sample dashboards which have been sorted by sector provided to show the breadth of things available. Datazen has two basic categories of visualizations, KPIs and Dashboards.

KPIs are designed to provide tiles of a single metric, optionally compared to a target or range. You have the ability to comment on the results too, which is a novel ability that I have not seen in other products. Dashboards provide a typical visualization which allow for contextual selecting interactivity to see different detail on the same screen.

Creating Datazen Dashboards

To create your own, right click anywhere on the screen and a toolbar will appear at the bottom of the screen. Click on the create button and you can start either creating from an empty dashboard or starting from the existing dashboard. I will be exploring both of these options in an upcoming post to my blog. If you are interested knowing exactly when that will occur, I recommend subscribing to my blog so that you can be notified when exactly that is going to happen.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

SQL Server 2016 and PolyBase

The next release of SQL Server, SQL Server 2016 is continuing with a convention which was employed in previous releases, which I call the Cadillac release system.  At General Motors, in the past new features were first offered on their most luxurious brand, Cadillac, and if these features prove successful, they are rolled out to Buick and the rest of the product lines.  Microsoft does the same thing.  Their ‘Cadillac’  is the PDW [Parallel Data Warehouse], Microsoft’s Data Appliance. One notable example of this release model was the addition of column store indexes to SQL Server. Column store indexes were first available on the PDW, or APS as is was known then, and Microsoft later added column store indexes to SQL Server 2012. Later that same year, at SQL PASS Summit 2012, I heard about a really neat feature available in the PDW, PolyBase. The recording I heard is available here, where Dr. David DeWitt of Microsoft explained PolyBase in great detail. I have been waiting to hear that PolyBase was going to be released to SQL Server ever since.  On May the Fourth, 2015, Microsoft announced the preview release of SQL Server 2016. Listed in the release announcement was the feature I’d been waiting for, PolyBase.

Sqoop Limitations

PolyBase provides the ability to integrate a Hadoop cluster with SQL Server, which will allow you to query the data in a Hadoop Cluster from SQL Server. While the Apache environment provided the Sqoop HadoopSqoopapplication to integrate Hadoop with other relational databases, it wasn’t really enough. With Sqoop, the data is actually moved from the Hadoop cluster into SQL Server, or the relational database of your choice. This is problematic because you needed to know before you ran Sqoop that you had enough room within your database to hold all the data. I remembered this the hard way when I ran out of space playing with Sqoop and SQL Server. From a performance perspective, this kind of data transfer is also, shall we say, far from optimal. Another way to look at Sqoop is that it provides the Hadoop answer to SSIS. After all Sqoop is performing a data move, just like SSIS code. The caveat is SSIS is generally faster than Sqoop, and provides a greater feature set too.

Polybase – Hadoop Integration with SQL Server

Unlike Sqoop, PolyBase does not load data into SQL Server. Instead it provides SQL Server with the ability to query Hadoop while leaving the data in the HDFS clusters. Since Hadoop is schema-on-read, within SQL server you generate the schema to apply to your data stored in Hadoop. After the table schema is known, PolyBase provides the ability to then query data outside of SQL Server from within SQL Server. Using PolyBase it is possible to integrate data from two completely different file systems, providing freedom to store the data in either place. No longer will people start automatically equating retrieving data in Hadoop with MapReduce. With PolyBase all of the SQL knowledge accumulated by millions of people becomes a useful tool which provides the ability to retrieve valuable information from Hadoop with SQL. This is a very exciting development which I think will encourage more Hadoop adoption and better yet, integration with existing data. I am really looking forward SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Azure Data Lake: Why you might want one

On April 29, 2015 Microsoft announced they were offering a new product Azure Data Lake. For those of us who know what a data lake is, one might have thought that having a new data lake product was, perhaps redundant, because Microsoft already supported data lakes with HDInsight and Hadoop. To understand why you might want a separate product, let’s look at what a data lake is.  I think the best definition of a data lake that I read recently was here. Here’s the TL;DR version “A ‘data lake’ is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.” Ok so here’s the question, one  can spin up an HDInsight Hadoop cluster on Azure and put all of your data there, which means you can already create a data lake. Since you can already create a data lake, why did Microsoft go and create a new product?

Hardware Optimization and the Data Lake

If you look at Microsoft’s most recent Azure release, you’ll see they are releasing products designed to operate together. Service Bus, Event Hubs, Streaming Analytics, Machine Learning and Data Factory are designed to process lots of data, especially a lot of short pieces of data, like Vehicle GPS messages, or other types of real time status messages. In reading the product release for Azure Data Lake, they highlight it’s ability to store and more importantly retrieve this kind of data.  DataFactory The difference between the HDInsight already on Azure and the Data Lake product is the hardware dedicated to make the storage and the integration designed to improve access to the data. Data Factory is designed to move your data in the cloud to anywhere, including a data lake. If you look at the graphic Microsoft provides to illustrate what Data Factory is designed to integrate, the rest of the outputs listed have products associated with them. Now there is a product associated with the data lake too. Data lakes are designed to store all data, but unlike a database operational data store, data lakes are designed to have the database schema applied when the data is read, not when the data is written. This allows for faster writing of the data, but it does tend to make accessing the data slower. The Azure Data Lake hardware, according to the release, is designed to address this issue by providing computing power designed for massively parallel processing to provide the data when needed, which would be on the reading and analysis of the data, not when it is written. This sort of targeted computing power differs from the HDInsight Hadoop offering, which is uses a standard hardware model for storage and access. By tailoring the hardware to meet the needs of the specific type of data stored, in theory this will greatly improve performance, which will increase the adoption of not only the Azure Data Lake, but the tools to analyze and collect the data too. It’s going to be interesting to see how the marketplace responds as this could really push massive amounts of data to the Azure cloud. Time will tell.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur