Articles for the Month of July 2016

Data Factory – Getting Started

For those unfamiliar with Azure Data Factory, it is the application in Azure which moves data and calls processes to use the data. In a lot of ways this makes data factory a lot like SSIS. In addition to moving data around, data factory also contains all of the monitoring tools found in an SSMS job. If you add those things together you get a pretty good picture of what data factory does.

Setting up Data Factory in Azure

Data Factory is a resource which lives in a Resource Group in Azure. To add it to the list of available resources, click on the plus button in the AzureDataFactoryCreateAzure portal and search for Data Factory. Select the icon and this window will appear. Click on the create button at the bottom of the screen to enter the setup information. First pick a name for the service, then select from the dropdown box the appropriate subscription from the options you have available. Like everything else in Azure the services exist within in a container, either pick an existing container to use, or create a new one. Generally speaking I will create a new one so if for any reason later on I want to destroy what I have done, I can just delete the container. Pick the location based on two factors, Azure Data Factory is not available everywhere so you are limited to use only the ones where it is available. If you pick one where it isn’t available, you will get an error message letting you know why you cannot create the resource. Whenever possible within Azure to pick the same resource where your data lives. There are charges within Azure if you migrate data across resources and no charge if you stay in the same resource. You may want to go look at where the data lives which will be used in Data Factory before deciding where to put it. I always check the Pin to Dashboard option so that I can find the resource later, but it is not required and can be done later. Click on the create button to create a Data Factory Resource. If you have selected Pin to Dashboard you will see a little window which says Deploying Data Factory. This little window goes away once Data Factory is completed, and you will have an entry in the list of resources for Data Factory.

Data Factory Tiles

Clicking on the Data Factory resource icon will bring up the following window. Initially the only button you will care about is the one on the far left StartingWithDataFactoryAuthor and Deploy. This is where you will create the Data Factory Components. The authoring is all done in JSON. It is not the most intuitive language in the world and is very picky about brackets. There is some intellisense, and when you click on a bracket, it will highlight the bracket which it lines up with, which is somewhat helpful. In my experience, JSON can be frustrating, as it is rather picky. I don’t think my experience is unique because Microsoft has created two titles which write the JSON components for you, Copy Data and Sample Pipelines. The Copy data function has a wizard to walk through gathering the data required to write the JSON. I have had mixed success in getting it to work. Sometimes it did, and sometimes I got an error. It is in preview, so I imaging over time the performance will improve. Sample pipelines has a few samples of tasks you might want to complete that can be deployed in your environment. Diagram view is very important as it opens a window on the right hand side of the screen showing you a picture of what has been created in JSON. Monitor and Manage brings up an entirely new tab, which is welcomed by those of us who are tired of sliding to the right to see new windows. This screen shows a diagram and what has run. Metrics and Operations open a window showing what has run.

I hope you found this post handy for creating an Azure Data Factory resource. In a later post I will discuss what you can do with Azure Data Factory and provide more detail around the Contents section of this screen and discuss Datasets, Pipelines and Linked Services. Please subscribe to my blog to be notified of my latest posts.

 

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

 

 

 

 

 

 

 

 

 

 

On-Premises Data Gateway – the Replacement for Power BI Enterprise and Personal Gateway

 With the announcement July 6, Microsoft has merged the number of gateways used in Power BI. The new On-Premises Gateway not only replaces both of those applications, but also replaces the data gateway used to access on premise data for other applications, such as Azure Machine Learning. This reminds me of Lord of the Rings as There is now One Gateway to bring them all. One Gateway to bind them.

Replacing the Old Power BI Gateways with the New On-Premises Data Gateway

When installing the new gateway, the place to look for the new installer is not on some Microsoft Download page, but from within Power BI PowerBIGatewayInstallScreenitself. The new gateway needs to be installed from Power BI, rather than a generic location. As the gateway is providing data for your account, installing it in this fashion will record the gateway used on this account. To download the gateway, login to the Power BI Service and click on the <<insert icon here>> the download menu and select the Power BI Gateway. This gateway is truly the one gateway to rule them all as it contains the ability to install either the On-Premises Data Gateway, which is the direct replacement for the Enterprise Gateway, or the Power BI Personal gateway. You can install both gateways, but you have to pick one to install first. Both gateways can also run on the same device. Once the software is installed, you can convert your existing gateways to the new gateway, so there is no need to enter the configuration twice.

New Gateway, New Security Model

While the personal gateway works in a similar fashion as before bu using a single user’s credentials, the new gateway uses a different security model than was employed in the past. Instead of using the credentials of the Id which installed the software, it uses a new service account NT SERVICE\PBIEgwService. OnPremSecurityDiagramThis account is for the gateway only, it does not need to be granted any data access. The new application is creating a connection to the Azure Service Bus, acts as the clearing hours for all data requests, including Power BI and other applications occur. The application gateway is where the credentials for accessing the data are applied. This security model is used not only for Power BI, but for other applications like Azure Machine Learning where there is a need to get data from a on-premises source and load it to the cloud. Previously it was not possible to have the gateway used for Power BI to be loaded to the same machine as the other data gateway that was used for Azure Machine Learning. Now that they use the same gateway, this restriction is removed.

Functional Differences in the New Gateway and the Previous Gateways

The underlying security model and application is different, but operationally are the new gateway does not behave differently from the Enterprise and Personal Gateways Power BI, there is just a new application do to what they did. If you want to use the data access rights of a single specific user to refresh and update data, the Personal Gateway will do that. The Enterprise Model provides for the ability to inherit the security of the user accessing the data. If you have a tabular model stored locally which is used as a data source for Power BI, the users will see the data they were granted access to from within the tabular model. The scheduled update features are not changed and are still accessed from the same area as they are in Power BI Service.  Unfortunately, if you are interested in connecting to Azure Machine Learning or Data Factory, you will still need a different gateway, and that gateway cannot be installed on the same PC as a Power BI Gateway.  I will have more information on that gateway in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Why Developers Should Not Deploy Their Own Code

Code is a very expensive business asset, and needs to be treated that way. The code needs to be stored in source control application which is in a SafewithPapersecured, well-known location, and the process to release it to production needs to be documented and understood. Like backups, it’s important to ensure that code in source control can be modified and installed before a crisis arises where there is a time crunch to fix a huge production issue. To ensure that the code stored in source control is the expensive important business asset that is counted on to make the business operate every day, the code needs to be deployed by someone other than the person who wrote it.

Save Money by Validating Code in Source Control

When I first worked in a location which had another team deploy code I thought it was pointless bureaucracy, which did nothing but slow down the progress. Watching the problems caused by simple processes which went wrong changed my mind. Checking code in an out of source control is a simple process, whether you are using an open source application like Subversion, or have a full blown TFS Server. If no one checks that the code in source control is the code which is deployed, all sorts of bad things can and do happen. Being the poor slob who came in when everything was a mess, getting stuck not only with figuring out some old code was made even worse when I found out that the code in source control, was not the code in production and area I didn’t have access to view. Unfortunately for me, this step did not occur until after I’d changed what I thought was the released code. Writing the code twice and/or going on a code hunt for the right version became a necessary part of the process, adding needless hours to an already complicated task. If only the code in production was deployed from source control, this mess would have been avoided.

Improving Code Quality

All sorts of things can happen when one person writes and deploys. I know someone who worked in the IT department for a large cell phone company. At the time, working there meant free phone service. One of the devs was a heavy user of the free phone service and so was his large extended family. His job was to maintain the billing code. After several questionable incidents at work, HR got involved and he was perp walked out of the building. Due to the circumstances surrounding his departure, his cell phone accounts were checked to ensure from this point on, he would get a bill. Although his account showed a number of active phones, his balance was always zero. The code in source control was checked and there was nothing in it which provided a reason why his bill was zero. Upon further investigation, my friend noticed the version number in production did not match the version number in source control. The code in source control was compiled and a huge balance appeared for the former employee. If someone else had deployed the code in source control, this chicanery would not have been possible.

Code Deployment Needs to be a Well-Understood Process

Today in many companies, the code may exist a lot longer than employment of the person who wrote it. Given the life of the code, there needs to be well established obvious processes to deploy it. Recently I heard from someone who told me about their SQL Server 2012 SSIS project which used package deployment instead of project deployment because only some of the SSIS packages are deployed to production. The packages are installed in many different locations, and they all exist in one project. This project organization idea turns a simple one button deployment task into an involved process requiring copiously maintained documentation to ensure that everyone involved knows what to do and where to deploy which code. Most ETL code runs at night, and often times that means a person on call is woken up to fix it. This tired person complicated job is compounded when the code deployment moves from a straightforward, one button deploy process to a byzantine location determined by copious documentation. I can see many potential errors which would all be avoided if the organization was changed from one SSIS project containing everything, to projects containing locally grouped packages which are created and deployed via a project to folders in an Integration Services Catalog. If the person who developed this project had to explain and document the process they were using to another person who was doing the deployment, chances are this kind of project organization would be exposed like a Sooky Non-Sparkly Vampire to sunlight, and would be burned to ash.

Ensuring the code is in source control and can be modified and moved to production are important steps in maintaining code. That code can be a stored proc or a webservice, what it is not important, securing it is. Having someone other than the developer deploy the code to production ensures that this valuable asset is truly protected and can live on as long as the company needs it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

T-SQL Tuesday #80 – SSIS Projects, Packages, and Deployments

http://michaeljswart.com/2016/06/t-sql-tuesday-079-its-2016/comment-page-1/#comment-186750Recently I was talking to someone who is just getting started with SSIS and wanted to know about the package deployment model. I hadn’t thought about a package deployment model in a while. TL;DR Do not use package deployment for SSIS. Why? Because SSIS code deployed in a project model is easier to maintain and migrate. Maintaining a large number of packages can be painful, something I know too well.

Making Maintenance Difficult One Package at a Time

Prior to SQL Server 2012, there was no project deployment. SSIS code was all deployed as packages. These packages could be stored within MSDB or they could be stored and run from the file system. In disorganized places like the one where I worked, they were deployed in both. Assuming nothing much changed since I left, they have all versions of SQL Server which were released prior to the day the new IT Director started in 2012. There was DTS on SQL Server 97, 2000 and SSIS on 2005, 2008 and 2008 R2. No reason to upgrade anything which still worked was their motto. When space was a problem, one could always go build another server. I think the LAN administrator was happiest when he was able to justify building a new server as he could spend hours shopping for parts on the internet and building the latest server.

I was given the task of supporting all of the SSIS code, which of course broke periodically. There were 300 packages on the myriad of different servers all named package1 which were deployed every way possible. As a bonus one could not trust the open source software control package to have the latest code, unless it was one I worked on previously. The hunt for where the code and the config file used to drive it, and getting access to where it was stored, was just part of the maintenance process. If one package called another package, then both packages needed to be found and reviewed. Each of those packages would have different ways of connecting to the same database too. My favorite was when one package called another package and they each used different IDs to access the exact same database. Ah the joys of troubleshooting SSIS Packages. SSISProjectIt is was on the top five list of the reason I was very happy when I quit that job.

SSIS Projects

If you create write SSIS code and use a project deployment model, you can create one data connection for all the packages which need one. The code is deployed to one place, the Integration Services Catalog. All the related code is deployed to one folder. If you need to change a connection which all the packages use, you can do it in one place. You want to pass some parameters for all of the packages to use? No problem. SSIS project deployment offers some great advantages. I cannot think of any reasons to use a package deployment for SSIS 2012 and beyond. If you are writing SSIS code it is how you want to deploy packages.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Views Calling Views – A Bad Idea

Recently I talked to someone about a view calling another view, a practice which is known as nested views. TL;DR Do not write Nested Views. Give this person a little credit, he had heard it was a bad idea, but he had a really good reason. There is no such thing. There is also no reason to take my word for it. Here are a number of other very good posts which detail why nested views are bad from a performance perspective.

7 sins against TSQL Performance

Nested Views Causing Query Performance Woes

Poor Little Misunderstood Views

Proving Views Can Hurt Performance

T-SQL Anti-Patterns: Nested Views

I understand that on the internet, no matter how cockamamie your idea is, you can find someone or some post which will agree with you. That is why I trust sources who I know and trust rather than the random chatter of places like Stack Overflow.

Non-Performance Reasons Not to Use Nested Views

In addition to the performance reasons, there are other reasons not to use nested views. Supporting and maintaining nested views can be a nightmare. If there is an issue with the accuracy of the data, finding the problem is just that much harder. And what about when one needs to go investigate or fix a problem? Then someone has to test all of the places the view is called. I worked with some code recently where a view was created on a set of data just to create 3 new values based up 3 case statements. That view was nested many levels below where the field was actually called. I know people from a development background look at nested views as modular development. Don’t. TSQL is not the same as object oriented code.

Good Use Cases for Creating Views

Views are not always a bad idea. If you are creating an Analysis Services or Power BI project, it is a good idea to create views of the tables you are going to be using, even if there is no difference between the view and the table. Why? That way if the table is changed, your project will still work. Exposing a single view to users who are going to use self-service BI against it can also be a good idea. None of these are reasons to nest views, ensuring that the source data is anything other than a table.hammer_and_bolts

If all you have is a Hammer Everything Looks like a Nail

It takes a while to learn how to write good SQL and I have seen people rely on a view because it takes a while longer to write good SQL. Relying on a view though is like having one tool in the toolbox. If you don’t have a screwdriver, one may be tempted to use a Hammer instead. Too often nested views are written because sometimes writing SQL is hard and it is easier to just write a query and save it as a view rather than use a CTE or a derived table. Do not succumb to the temptation. There is a lot of code written where someone as in a hurry and they were planning on going back and fixing it someday. Someday appears with the same frequency as a lottery. Make it easier on the person who has to figure out or maintain the code left in your wake. Don’t write nested views. Your replacement will thank you for it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur