Articles for the Month of March 2016

SQL Server 2016 Polybase Setup

After reading my last post you should be all ready to install SQL Server 2016 to be able to use Hadoop.  I went through all of these steps in my webinar, but I thought it might also be a good idea to include them here so you don’t have to watch the video.

Polybase Install Feature

When installing SQL Server in the Feature Selection List shown below,  PolyBase Query Service for External Data must be selected. SQL Server 2016 Polybase

To check to see if Polybase has been successfully installed, go to Control Panel->Administrative Tools->Services.  There are two services added for polybase, SQL Server PolyBase Data Movement and SQL Server Polybase Engine, as well as  a bunch of other new ones for SQL Server 2016. The polybase services and SQL Server will need to be restarted in a later step.  When starting SQL Server Management Studio for SQL Server 2016, it is hard to spot the differences between it and previous versions of SQL Server as even the icon looks exactly the same. Once you create a database for SQL Server, you will notice a slight difference under the tables folder. When polybase is installed there are two new folders, External Tables and External Resources. Data which is accessed via polybase will be stored under the External Files folder and the External Resources will contain the references to the external resources, the HDFS cluster where the data is stored, as well as the file formats of the underlying data.

Configuring Polybase on SQL Server

SQL Server needs to be configured to use polybase by using the sp_configure command

EXEC sp_configure ‚'hadoop connectivity', 5;
GO
RECONFIGURE; 

The number 5 indicates the kind of Hadoop connectivity desired.  If you were using HDInsight’s Azure Blob storage or Hortonworks on Windows, the number listed would be 4.  The number 5 indicates connectivity to Hortonworks Data Platform on Linux.  Go ahead and exit out of SQL Server as you will be restarting it soon.

Configuration File Modification

Open up your favorite text editor to modify the Hadoop.config file.  The default location is

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

Keep in mind, if you installed SQL Server with a named instance the path will include the name of that instance.  The config file contains a default password for Hadoop.  The password, which is initially set to pdw_user is probably a holdover from pre-2016 days when polybase was only available on Microsoft’s Big Data Appliance, the Analytics Platform System [APS] which was previously called Parallel Data Warehouse [PDW].  Hortonworks’ default password is hue, so you will want to modify the file so that the HadoopUserName has the right password, hue. I circled it below in a clip of the Hortonworks.config file.

Once the changes to the Hadoop.config file are saved, to get polybase to work, the last thing which must be done is to restart the two new polybase services and SQL Server 2016.

Setting up the External Data Source and External Data Files

At this point, we can now tell SQL Server where the Hadoop files are loaded. Open SQL Server Management Studio [SSMS] and create a new query to create the new an external data source, which is going to appear in the new folder, External Data Source.

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
TYPE = HADOOP,
LOCATION = 'hdfs://sandbox.hortonworks.com:8020'
)

After running this command and refreshing, the source HDP2 will appear in the folder External Data Source ->Data Sources

SQL Server needs to know what the underlying data file formats of the data stored in Hadoop. Assuming the data is stored in a tab delimited format, this command will tell SQL Server how to read the tab delimited data and the format of the dates.  This will allow polybase to read a sample file which comes with Hortonworks, the file SAMPLE_07

CREATE EXTERNAL FILE FORMAT TSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '\t',
DATE_FORMAT = 'MM/dd/yyyy'
)
)

SQL Server needs a schema to read the data, as it doesn’t know where the fields are without it. The following code creates a table which will appear under the External Tables folder in SSMS as well as load the data by telling it where the data lives and to use the file format which we just added.

CREATE EXTERNAL TABLE Sample_07
(
code nvarchar(255),
description nvarchar(255),
total_emp int,
salary nvarchar(255)
)
WITH
(
LOCATION = '/apps/hive/warehouse/sample_07',
DATA_SOURCE = HDP2,
FILE_FORMAT = TSV,
REJECT_TYPE = value,
REJECT_VALUE=0
)

After this step is complete, you can use the new table to join to data on SQL Server from inside a HDFS cluster.

I hope this helps you get started with using polybase with SQL Server 2016.  Let me know what you think by posting a comment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

DIY – Getting started with Hadoop and Virtual Machines

In my recent presentation on Polybase in SQL Server 2016, I demonstrated how to get started using Hadoop with SQL Server 2016. In this post I will show you everything you can do to run the demonstration yourself. The first step is to create a virtual machine so that you can run a Linux instance for Hadoop. As I know that installing a virtual machine can be intimidating, this post explains what you need to do, and how to fix a problem you may run into when running a virtual machine.

Creating Your Own Virtual Machine

Previously, spinning up a virtual machine meant purchasing software. No more, as there is now an open source application. In the example shown here, the Linux operating system will be installed, you can put any operating system you want on your virtual machine, provided of course you have a license for it. If you don’t feel comfortable installing non-released versions of code like SQL Server 2016, on your pc, a virtual VirtualBoxmachine is a great way to test it out. You will need to provide your own operating system, but there are trial versions you can use for limited periods of time as well. The open source virtual machine Oracle VM Virtual Box is the only open source version of a virtual machine software. You can download it here. This software is needed prior to installing the Hortonworks Sandbox. Obviously Hortonworks is not the only version of Hadoop available, Cloudera has a Hadoop VM too, which you can download as well. Personally I am not a use fan of the Cloudera Manager, which is why I prefer Hortonworks, but either will work with polybase.

Troubleshooting Why the Virtual Machine Won’t Run

If you are using a Windows operating system, chances are it may be running Hyper-V. Hyper-V is Microsoft’s Virtual Machine. For you to be able to use Hyper-V, you will need to have a virtual machine file which is saved in the VHD format so you can load the image in Hyper-V. Hortonworks provides two versions of their sandbox, one in the VM Ware version and one in the Virtual Box format. For this reason, Hyper-V is not an option as a virtual machine for Hortonworks. If you are running Hyper-V, you can’t run another VM.

The boot configuration data store could not be opened. Access is denied

The error message received when Hyper-V is running is pretty cryptic. What does “The boot configuration data store could not be opened. Access is denied” supposed to mean anyway? When wandering around the internets trying to find an answer this question, there is a lot of <sarcasm>wonderful advice</sarcasm> which states that the only thing to do is to replace your computer entirely as the bios won’t support Virtual Machines, even after you check the bios and find out that yes, your bios does support virtualization. Don’t be deterred. The error can be fixed without new hardware. To resolve this error,  Hyper-V needs to be turned completely off. In case you were wondering, stopping the Hyper-V services won’t fix this. Instead Hyper-V must be disabled at the command prompt, then a reboot is required. Make sure you run the command prompt as administrator, because the command won’t run if you don’t. To run the command prompt as administrator, search for CMD. When the result comes back, right click on the cmd selection and select run as administrator. To turn off Hyper-V, run this command

bcdedit /set hypervisorlaunchtype off

Exit out of the command prompt. Remember the error won’t be resolved until after rebooting. After that the reboot, the boot configuration message won’t appear when running the VM, and you can successfully get started running the Hortonworks Sandbox.

TL;DR – Links

If you just want to get started running a VM and polybase and here are the links needed to make that happen.

Oracle Virtual Box

Hortonworks Sandbox

Cloudera Quickstart

SQL Server 2016 RC1

Microsoft Server 2016 Preview

Once the VM running is running with Hadoop, install SQL Server 2016 so that you can follow along on my next post where I talk about how to use polybase.  If you want to be notified when that post will be available, please subscribe to my blog and you will find out automatically.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Introduction to R – The Follow Up

gdiPhxI really enjoyed the opportunity to share R with GDI Phoenix. Since very few people in the room were familiar with the language, it was great to be able to show something new. The audience was great, and provided some wonderful feedback. With Microsoft’s purchase of R in 2015, many of the disadvantages of the open source version of R, including not being able to process huge data sets, speed of running the code and deployment have been resolved with R Server, which will be released with SQL Server 2016. The integration of R in SQL Server will only increase the demand for R skills, providing a great incentive for people to get started to learn the language now.

As promised I have provided links to the items I covered in the talk. Thanks so much for inviting me and I hope to have the opportunity to present again.

All the Links Need to Get Started Learning R

R Language

R Studio (UI)

Visual Studio Community R Tools

Microsoft R Open

Swirl – R Tutorial

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Is a Data Model Needed in Power BI?

Power BI does of course need a data model, but often times there is no reason to create a new one for Power BI. After all the data model could always be contained within the data PowerBIDataModelsource Power BI is using. Figuring out if a data model is required is directly related to the data source(s) being used. In turn the selection of the data source also determines when to us Power BI online client and/or Power BI Desktop. Although they look the same, there are a couple of key differences which govern which to use.

When Should I Use the Power BI Online Client?

When comparing the features of the Online Client with the Desktop version of Power BI, there is one very obvious difference, there is no way to create a data model in Power BI online. It is not possible to create a data model using the online client. The online client is designed to connect to an existing online source such as Sales Force or Azure DB. If you are using an existing model, there is no need to create one. When using the enterprise gateway, which uses an on-premises database such as a SQL Server, SSAS or Hana, the data model is contained within the database exposed via the enterprise gateway, so again no reason exists to create a data model. Report creation can occur either using the online client or desktop as there is compelling technical reason that I am aware of which would determine where the report is created.

When Should I use the Power BI Desktop?

If the reports need to use data mashups, a data model is required to join the disparate data together. For example, if I need to create a report consisting of tables from two different SQL Server databases and an Excel Spreadsheet, relationships linking those tables need to be created. I need a data model, and since it is not possible to create a data model in Power BI Online, I will create my data model in Power BI Desktop. If later on the report data source needs to be changed, this is possible if you create it in Power BI Desktop. Again you have your choice of either creating the reports in Power BI Desktop or in the online client. Creating a data model in Power BI Desktop does not mean that you must to create the reports there too.

Refreshing the Power BI Report Data for the Online Client

The data model used in Power BI dictates how the report is refreshed. If you are using only a cloud based source, you don’t need a gateway of any kind. The data refresh can be automatic–meaning Power BI will do it for you, scheduled, or use a live dataset depending upon the source used. SQL Azure Database, SQL Azure Data Warehouse and Spark on HDInsight connect live. Changes in made in the source will be reflected in Power BI. The data source connected with an enterprise gateway operates the same way. When the source data is updated, Power BI will use the new data in the reports. Please note that the Analysis Services Connector has been deprecated. If you want to directly connect to a Tabular model located on-premises, use the enterprise gateway to make the connection. As I discussed in a previous post, the enterprise gateway uses local Active Directory information to connect, so users will need to have an AD Account with database access to be able to access data in an enterprise gateway. Unlike many web applications, you must set up each user to have access. All users cannot use one account to connect with the enterprise gateway, but you can do that with a personal gateway.

Personal Gateway – How to Refresh the data from Power BI Desktop

If you have created a data model in Power BI Desktop and want to refresh the data, install the personal gateway on a machine within your network, and set the schedule to perform the data refresh for each source within the model. The personal gateway uses the credentials which are entered when it is configured, which means that users who do not have database permissions can look at the data within Power BI updated by the personal gateway.

I hope this clears up some of the confusion surrounding the data models, the Power BI online client and Power BI Desktop. If you have any questions or are interested in other Power BI topics, please ping me on twitter @DesertIsleSql or post a comment as I am always interested in your feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SSIS – Resolving “Failed to Deploy the Project” Messages

Have you seen this screen when trying to deploy?
SSISFailDeploy

Needless to say, this indicates the SSIS package didn’t load to the server. This is especially frustrating as this particular package worked fine in Visual Studio. Believe it or not this screen actually contains information which can be used to resolve the error.

Finding a Useful Error Message

How do you go about determining what to do? Click on the word Failed. This will bring up some useful information, although it may not appear that way at first. Here’s the error message I received

SSISErrMessage

The message Failed to deploy project isn’t very useful, but the rest of the message is. The operation_messages view lives in SSISDB, and the operation identifier number is how to determine what the error is. Run this query, using the number provided in the error message, which in this case is 173

Select * from catalog.operation_messages where operation_id = 173

Here are the results from that query.

operation_message_id operation_id message_time message_type message_source_type message extended_info_id
50719 173 2016-02-29 15:02:08.2478928 -07:00 120 20 Failed to deploy the project. Fix the problems and try again later.:SqlDateTime overflow. Must be between 1/1/1753 12:00:00 AM and 12/31/9999 11:59:59 PM. NULL

 

Now this message is quite useful as it provides information that I can use to fix the issue. This SSIS Project contains a date parameter BackDate, which I had not set. Here’s the parameter.

SSISVSParameters

I set this parameter to a date between 1/1/1753 and 12/31/9999 and deployed the project again. This time, no error.

I hope that you have found this post helpful, especially if you haven’t deployed a package to SSIS in SQL Server 2012 or later.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Help! The Power BI Enterprise Gateway is not Working

Given some of the questions I’ve been getting recently there seems to be a lot of confusion regarding how the Power BI Enterprise Gateway works. The most common scenario seems to be that one person, probably the one who created the gateway, can use it and create reports, but when the reports are shared to others, there is no data, just an error message. Here’s the TL;DR verison. When using the Power BI Enterprise Gateway the Power BI user needs to use the security model of the on premises data source. Sounds simple enough, but in practice things can and do go wrong. Before you bother to read any further, if you can’t for a number of reasons change any Active Directory or Power BI User information, you can stop reading now and just use the Personal Gateway. All of the data access issues described here are specific to the Enterprise Gateway.

What is Required to get a Power BI user to use data from an Enterprise Gateway?

Still reading? Ok, so here’s another description on how the Enterprise Gateway works. When the Enterprise KnockingonDoorPowerBIGateway is installed, it puts in a secure door into the firewall. Power BI is the little man knocking on the door. The gateway asks two questions: Who are you? And What is your Password? The answers to those questions come from Power BI, in the form of your email address the and the password used to login to the online application. If those don’t match what is found in Active Directory, entry is denied. The reports will blank. In other words, AD is saying, I Hear You Knocking, but You can’t Come In, which is also a great song lyric.

Resolving Enterprise Gateway Access Issues

The first thing that you need to do is ensure that the user has an ID on the network containing the data source which has permissions to access the database in the Enterprise Gateway. Second, the password on both the network account and the Power BI account must be the same. If they are not, change one. Lastly you need to check the UPN [User Principal Name] to make sure that the Power BI email address matches what the UPN has in Active Directory. A UPN is sort of a replacement for an account name and does not need to be the same as the account name used to log into the network. As a side note, while you can run a SQL which will show that you are getting an error, it won’t show what the userid and password being sent by Power BI are. The email address in the UPN does not have to match the email set up for the account, and chances are if it is not working, they are not the same. To review the settings, you are going to need access to Active Direct Directory and check the UPN for the user who cannot access the data exposed by Enterprise Gateway. Once these issues have been resolved, the firewall will be opened and users will have the same access to data on Power BI on the internet as they have on the On premises server.

Let me know if you’ve found this post helpful or would like to hear more on this topic.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur