Articles for the Month of June 2015

DIY Machine Learning – Supervised Learning

When I first heard about supervised learning I had a picture in my head of a kindergarten class with a teacher trying to get the small humans to read. And perhaps that isn’t a bad analogy when talking about Machine Learning in general as it is based on the same principles as school, repetition and trial. After that the analogy falls apart though when you get to the specific criteria needed for Supervised Learning. There are two broad categories for types of machine learning which have the binary descriptions of supervised learning, which fall into the binary categories of Supervised and Unsupervised. This means you only have to know the one set of criteria for supervised learning, to determine which type you need.

Training Data

A problem solved with supervised learning will have a well-defined set of variables for its sample data and a known outcome choice. Unsupervised learning has an undefined set of variables as the task is to find the structure from data where it is not apparent nor is the type of outcome known. An example of Supervised learning would be determining if email was spam or not. You have a set of emails, which you can evaluate by examining a set of training data and you can determine using the elements of the email such as recipient, sender, IP, topic, number of recipient, field masking and other criteria to determine whether or not the email should be placed in the spam folder. Supervised learning is very dependent upon the training data to determine a result, as it uses training data to determine the results. Too much training and your experiment starts to memorize the answers, rather than developing a technique to derive solutions from them.

When Supervised Learning Should be employed in a Machine Learning Experiment

As the field of data science continues to proliferate, more people start are becoming interested in Machine Learning. Having the ability to learn with a free tool like Azure Machine Learning helps too. Like many tools while there are many things you can do, so knowing when you should do something is a big step in the right direction. While unsupervised learning provides a wide canvas for making a decision, creating a successful experiment can take more time as there are so many concepts to explore. If you have a good set of test data and a limited amount of time to come up with an answer, the better solution is to create a supervised learning experiment. The next step in the plan is to figure out what category the problem uses, a topic I plan to explore in depth in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Creating a Date Dimension with a CTE

I am looking forward to SQL Server 2016, which includes an update to DAX which will generate dates for you, but since it hasn’t been released yet, the need still exists to create a date Table. Since I am creating a Tabular Analysis Service instance, an identified date dimension is needed for time functions. There are a number of different ways to create a date dimension, here I am going to take advantage of recu There are a number of ways you can create a date dimension, here I am going to describe a neat way of creating one using a recursive CTE, which decreases the SQL code required to generate one.

Date Dimension Fields

There are a number of fields which are pretty standard for date tables. The fields can vary if you need to add things like Fiscal years or Month End dates which have business specific rules like last Friday of the month. The most common fields needed for a Date Dimension are listed here.

  • DateKey – Integer key representing the date, for example 20150524
  • CalendarDate – Date time field
  • CalendarYear – Four digit year
  • QuarterNumber – Number from 1 to 4
  • QuarterName – First Quarter, Second Quarter, 1 Qtr 2015, First Quarter 2015 are some of the ways the name is listed. Pick whichever format people would like to see.
  • MonthNumber – 1 for January, 2 for February, 3 for March etc.
  • NameMonth –   January, February, March etc.
  • MonthDayNumber – July 15 would have the value of 15 here
  • WeekDayNumber – Date of the Month. July 4 would have a 4 here.
  • CalendarWeekofMonth – Counts the weeks within each month
  • NameDay – Monday, Tuesday etc.
  • DayNumber – Whether you choose 1 for Sunday or 1 for Monday is a business rule you need to find out first. This example shows 1 for Sunday, 2 for Monday etc.
  • YearDay – Sometimes referred to as the Julian number this counts the days from 1- 365 and can be useful in some calculations
  • YearWeek – Counts the weeks from 1 -52

As a best practice, remember reserved words such as DATE or Index or any other word which shows up as blue or pink in SQL Server Management Studio, should never be the names of any columns. Always select a name other than a reserved word when creating tables. Also friendly names containing spaces are great for exposing to users, but they are annoying in SQL Server, so leave the spaces out of the column names. Evaluate what dates you are going to be needing in the table so that you don’t have to go back and redo it. I am creating a smaller table here, but that is just because it is an example. Look at the dates you will be storing in your data warehouse when determining your starting dates, and set the end dates for probably about five years longer than you think the data warehouse will still be in use.

Using a CTE to Generate a Date Dimension

CTEs, which Microsoft added in 2005, is a great way to generate a date table by harnessing the power of computer to spin through your code, decreasing the need to write a lot of code. I am using a recursive CTE method here first published by Itzik Ben-Gan to generate the number table being used in the code below.

CREATE TABLE [dbo].[DimDate](
[DateKey] int NULL,
[CalendarDate] [smalldatetime] NULL,
[CalendarYear] [int] NULL,
[QuarterNumber] [int] NULL,
[QuarterName] [varchar](14) NULL,
[MonthNumber] [int] NULL,
[NameMonth] [nvarchar](30) NULL,
[MonthDayNumber] [int] NULL,
[CalendarWeekOfMonth] [int] NULL,
[NameDay] [nvarchar](30) NULL,
[DayNumber] [int] NULL,
[YearDay] [int] NULL,
[YearWeek] [int] NULL
) ON [PRIMARY]

GO

/*Make sure you change the start and end dates listed here to the dates you wish to use in your table*/
DECLARE @StartDate smalldatetime = '01/01/2014'
DECLARE @EndDate smalldatetime = '12/31/2016' ; /* don't forget the semi-colon or you will get an error*/

/*This CTE is used to create a list of numbers used to generate the calendar*/
WITH
A00(N) AS (SELECT 1 UNION ALL SELECT 1),
A02(N) AS (SELECT 1 FROM A00 a, A00 b),
A04(N) AS (SELECT 1 FROM A02 a, A02 b),
A08(N) AS (SELECT 1 FROM A04 a, A04 b),
A16(N) AS (SELECT 1 FROM A08 a, A08 b),
A32(N) AS (SELECT 1 FROM A16 a, A16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM A32),
/*Calendar dates are created here*/
CalendarBase as (
SELECT
DateKey = n
, CalendarDate = DATEADD(day, n - 1, @StartDate )
FROM cteTally
WHERE N <= DATEDIFF(day, @StartDate , @EndDate +1)
)

/*Using the list of dates created above will populate your date table here*/
insert into dbo.DimDate(DateKey
,CalendarDate
,CalendarYear
,QuarterNumber
,QuarterName
,MonthNumber
,NameMonth
,MonthDayNumber
,CalendarWeekOfMonth
,NameDay
,DayNumber
,YearDay
,YearWeek)

SELECT
DateKey       = CONVERT(char(8), CalendarDate, 112)
, CalendarDate
, CalendarYear  = YEAR(CalendarDate)
, QuarterNumber =  (DATEPART(QUARTER,CalendarDate) )
, QuarterName = 'Quarter ' + cast((DATEPART(QUARTER,CalendarDate) ) as char(1)) +' ' + cast(YEAR(CalendarDate) as char(4))
, MonthNumber = MONTH(CalendarDate)
, NameMonth     = DATENAME(Month, CalendarDate)
, WeekDayNumber   = DATEPART(DAY, CalendarDate)
, CalendarWeekOfMonth = DATEDIFF(week, DATEADD(day,1, CalendarDate - DAY(CalendarDate) + 1) -1, CalendarDate) +1
, NameDay       = DATENAME (Weekday,CalendarDate )
, DayNumber       = DATEPART(Dw, CalendarDate)
, YearDay       = DATEPART(DAYOFYEAR, CalendarDate)
, YearWeek      = DATEPART(WEEK, CalendarDate)
FROM CalendarBase

After running this code you will have a date table created and loaded.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Configuring Power BI Data Refresh on a Local SQL Server Database

Recently I have been tasked with setting up and assisting other in setting up Office 365 Power BI Automatic Data Refresh. If you scroll through this post, you’ll see that there are detailed steps required to accomplish a successful Power BI Data Refresh with an on site version of SQL Server. There are a lot of people who are very frustrated trying to configure Automated Data Refresh. Considering what is involved, I understand why. Follow these steps, and Automated Data Refresh for Power BI Office 365 with a SQL Server located locally will work.

TL;DR Steps for Power BI Automated Data Refresh

This is a long document because I wanted to make sure that all confusion regarding how to make this work took a lot of words. If you don’t want to read them all, here are the steps needed to complete a successful data refresh. I have a feeling this level of detail may not be enough, some people may figure it out after reading this far.

  • Install the Data Management Gateway application on your SQL Server box.
  • Add a gateway Power BI Admin Center and configure the Data Management Gateway.
  • Create a connection string and a Power BI workbook which uses the connection string
  • Create a Data Source in Power BI Admin Center using the connection string created in the previous step
  • Schedule Data Refresh using the Power BI Workbook using a matching connection string.

 

Power BI Office 365 Automated Data Refresh Steps

Here in detail is how I can get Automated Data Refresh to work every time. If you have figured out another way to make it work not using these steps for an on-premise SQL Server data refresh, please let me know how you did it. I know these steps work every time, but I am hoping that there is an easier way.

  1. Install the Microsoft Data Management Gateway on your SQL Server Database Server. Here is more information on the Data Management Gateway.Technically, you don’t have to have it installed here as Microsoft says it is supported installed in other places. If you want to make sure it does work, you will install it on your database server, which must have internet access available on the server. Here is the link for the software, and here are the instructions from Microsoft on installing it.

 

  1. The next step is to configure the gateway for the Power BI site and for your local server. Go to your Power BI for Office 365 site, and navigate to the Power BI Admin Center. On the Left Menu, you will see an option for Gateway. Click on the plus and add a gateway.  You will generate a key as part of this process and you will need to copy this key and provide the key to the application installed on the database server. If you have more than one database server, you will need more than one key. Here’s a link to the instructions for creating a gateway in the Power BI Admin Center which provide more detail. When you are done, look on Power BI Admin screen your gateway name will have this Running next to the gateway name. If you don’t see the green check, wait for about 10 minutes to see if the status changes.If the status does not show as Running, there are two things you might want to do: Review the instructions or Look on the Office 365 Admin Center. The Office 365 Admin Center can be accessed via the grid button on the top left hand side of the screen under the Admin icon. Select menu option on the left hand side of the screen for Service Health, open it and select service health. I have included a copy of the screen here, which shows that there are no issues today. If you are experiencing issues, this screen may not reflect a problem until sometime later, so continue to review this if the steps just don’t work for some reason.Office365Admin
    1. Create a local data connection within Excel to be used to create a Power BI data source. Start a new worksheet in Excel. Click on the data tab (not the Power Pivot tab) and click on the left most icon on the ribbon Get External Data->From Other Sources->From SQL Server. Enter the server name. Either Windows Authentication or SQL Server Authentication work, so pick either one and enter the Id and password. Click on the Next Select the database and then the tables you want. There is no need to select more than one table, as you can always go back and add tables once this is complete, and it will save loading time. Click on the Next button.
      DataConnectionWizardThis is the most important screen of the entire process. The Friendly Name listed here will be used on the Power BI site to create a Data Source. The Friendly Name must exactly match the name of the Data Source in Power BI, which may mean that you will want to edit the name from the default listed here. Do not check always attempt to use this file to refresh data. After you are satisfied with the Friendly Name click on the Browse… button.In a later step, the connection string listed in this file will be used to create the connection string in Power BI. As we will be opening the file listed here in Notepad in a later step, it is important to remember this file path.DataConnectionWizardFileSaveClick on the Save button in the File Save window.Click on the Finish button in the Data Connection Wizard Menu.ImportDataGenerally speaking, you will want your database information to be loaded to a Pivot Table Report, so click on the OK button. If you do this, you have the added benefit of having a report to test. When your pivot table is completed, it might be a good idea to now go to the Power Pivot tab and take a look at the model. Feel free to modify it as much as you like, but it is not necessary to do so to test the data refresh. If you need to add additional tables, make sure that you do so by using the existing data connection created earlier. Save and close your new excel workbook.

     

    1. Using the information from the previous step, we can now create a data source in Power BI. Go back to Power BI Admin Center on the Web, and from the left hand menu select Data Sources. On that screen, click on the +(plus sign) to add a new data source, and select SQL Server. Enable Cloud Access must be completed. If you want to make items from the database searchable from Power Query, you will need to check Searchable in Power Query and Enable Odata Feed. If you select Enable Odata Feed you will be prompted later to select a series of tables to expose for viewing Click next to get to the connection info screen.PowerBIConnectionInfoCompleting the connection info screen correctly will allow you to do a data refresh. The information needed to complete the screen can be found in the *.odc file you created earlier. In notepad, open up the *.odc file using path you remembered from the previous step. Look for the text <meta name=Catalog content= The name listed on the other side of the equal sign MUST be the name you list in the name in the Power BI Connection Info Name. Put a description in if you wish. The dropdown box for the Gateway will list the gateway you created in the previous step as well as any others you have created previously. Select the appropriate gateway.By default the Connection Properties option will be selected.ConnectUsing Unfortunately if you use this option you will never be able to perform an Automatic Data Refresh in Power BI. Select connection String. When you do, the connection provider will change to .NET Framework Data Provider OLE DB. This is supposed to happen. For the connection String information, go back to your *.odc file and look for the text <odc:ConnectionString> Copy everything between that text and this </odc:ConnectionString> into the connection string window. Wait until the set credentials button turns blue. If it does not turn blue, there is something wrong with the Power BI and you will want to look at the Office 365 Admin Service Health site referenced earlier.Click on the Set Credentials button. A little window will show up indicating it is loading. A pop up window will appear prompting you to add your user id and password. Make sure you test the connection before clicking on the OK button.The last step here is to click the Save button.

     

    1. Upon the successful completion of the previous steps, Scheduling Data Refresh can now be configured. Upload the excel workbook created earlier in step 3. Once it is loaded, click on … (the ellipse) and select Scheduled Data Refresh. When the AddToFeaturedReportsscreen loads, the first thing you will need to do is click on the button at the top of the screen to turn the feature on as it defaults to off. All of the connection information included inside your Power BI report will be listed. The connection name(s) listed in the report should exactly match the connection(s) in the Power BI Admin Data Connections. If there is not a match, you will not be able to Schedule an automatic data refresh. It is not possible to change the names from this screen, as it lists the connections within the Power BI document. You may need to fix the connections within your Power BI worksheet. If the connections are not valid, you will see this status iconStatusIcon and a message about this status.
      The Configure Refresh Schedule is default selected to a daily frequency.   Select the time and the zone for the data refresh. Unfortunately, you can only schedule refreshes for 3 months, so the schedule will need to be updated every 3 months as there is no way to make it continue in perpetuity. Send Notifications is defaulted to the email address of the admin. Note you will be notified of errors, not of completions of the data refresh. Click on the Save and Refresh button to test the data refresh process, which refreshes the report immediately. Once the data refresh schedule is completed, the top of the screen will have two menu items History and Settings, which you can review at any time.

     

    Data Refresh Conclusion

    If you think that there should be a better way, I would love to hear about it. Looking online, there are a lot of people struggling with this topic. If in any way you find it helpful, or you want to tell me there is an easier way, I would love to hear about it. I look forward to hearing what you have to say about this topic.

    Yours Always

    Ginger Grant

    Data aficionado et SQL Raconteur