Exploring Azure Synapse

In the past few months, I have been examining Azure Synapse and what it can do.  When it was first released in November of 2019, the first functionality that was rolled out was an update of Azure SQL DW.  For this reason, many people think that Synapse is just an improved version of a cloud data warehouse.  Microsoft did improve SQL DW when it moved it to Synapse.  The biggest architectural design change is the separation of the code from the compute, a theme with many web projects, which allows the compute power to be increased when need dictates and scaled down when computing needs change.  Within Synapse, resources are allocated as Pools and you can define a sql pools to run data warehouse and later change the compute to a different resource.  You will still need to partition your DW as large datasets require partitioning to perform well.  Subsequently Microsoft Released the Azure Synapse Studio to be a container for a larger environment of tools and notebooks to interact with them.

Non-Data Warehouse Elements of Azure Synapse

To me the more interesting parts about Azure Synapse have nothing to do with data warehouses.  Azure Synapse also contains the ability to query files stored in Azure Data Lake Gen 2 as if they were SQL files. This is a great way to analyze large data without first cleaning it up and putting it into a relational environment. Within Synapse you can formulate a query using syntax for selecting parts of files, providing the ability to look at many files as if they were one. You can also create processes which bring data into your synapse environment using Orchestration, a process that people who are familiar with Azure Data Factory will find very familiar. Synapse also contains the ability to analyze data in Cosmos DB without doing ETL or moving the data at all using a scalable architecture which will not impact the transactions being processed simultaneously on the same Cosmos DB.

Azure Synapse and Spark

Check out the promotional video here

By far the most interesting component of Azure Synapse is the Spark connection. Microsoft has added the ability to create Spark Pools into Azure Synapse.  To be honest I was somewhat surprised that this functionality is included here first and not in Azure Machine Learning, where to use Spark you need to access clusters created them in Databricks.  Spark provides the ability to dynamically scale resources when running processes.  This is very handy when writing machine learning code which can really use the performance improvements Spark brings.  Because this is Microsoft’s Spark, you can also write your code to access it in .Net if you like, in addition to the more common Spark Languages, Scala, R or Python.  You can also incorporate the AutoML API created for Azure Machine learning in R and Python so that you can use the power of Azure to select your algorithm and hyperparameters instead of spending time doing it yourself.

Getting up to Speed with Synapse

There is a lot to learn when it comes to Synapse as it combines a lot of different components into one environment. As more and more data is being migrated to the cloud, it is uniquely designed to handle both big data components containing raw data, managed data lakes as well as more traditional data warehouse needs.  It can also be the location where all of the data is processed, secured, cleaned and analyzed using Machine Learning. There is a lot to cover and since it is new, there is not a lot of places yet where you can learn more about it.  If you are interested in a deep dive on Azure Synapse and how to use it in a Modern Data Warehouse, sign up for my precon at PASS Summit 2020 where I will cover the topic in depth.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Configuring Databricks for Koalas

koala getting hlep from a Firefighter

The Apache Spark open source organization maintains all of the documentation for Apache Spark, which is a set of APIs which are used in Databricks and other big data processing applications.  The documentation provides the detailed information about the libraries, but the instructions for loading libraries in Databricks are not exactly the same as are used in Databricks, so if you follow the Spark installation instructions, you will get nowhere. If you follow the steps listed you will be up and running in no time.

Installing Options – Cluster or Notebook ?

If you are not using a ML workspace you can add in using dbutils like this.
dbutils.library.installPyPI("koalas")
dbutils.library.restartPython()

Unfortunately if you are using an ML workspace, this will not work and you will get the error message org.apache.spark.SparkException: Library utilities are not available on Databricks Runtime for Machine Learning. The Koalas github documentation  says “In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning”.  What this means is if you want to use it now

Most of the time I want to install on the whole cluster as I segment libraries by cluster.  This way if I want those libraries I just connect to the cluster that has them. Now the easiest way to install a library is to open up a running Databricks cluster (start it if it is not running) then go to the Libraries tab at the top of the screen. My cluster is called Yucca, and you can see that it is running because the circle next to the name is green.

After you are on the Libraries table you will see two buttons.  Click on the one labeled Install New.  A window will appear.  Select the library source of PYPI and in the Package text box enter the word koalas.  Then click on the install button.

Install Databricks LibraryThe installation may take a few minutes.  When it is complete you will see a green  status circle and the word installed.

After this you are ready to use the new library, once you import it as shown here.

 

Why do I want to install Koalas in Databricks?

If you have written Python code for Machine Learning, chances are you are using Pandas. Pandas dataframes are practically the standard for manipulating the data in Python.  They are not however part of the Spark API.  While you can move your Python code over to Databricks without making any changes to it, that is not advisable.  Databricks is able to scale pandas, so adding more resources to your code may not improve the performance.  When writing Python code for Databricks you need to use the Spark APIs in order to ensure that your code can scale and will perform optimally.   Prior to April of 2019, that meant that you had to use Spark dataframes and not pandas dataframes, which could involve a bit of rework when porting code as much code was written in pandas.  In April of last year Koalas was added to Spark, meaning that changing code to use a pandas dataframe to a koalas dataframe means that you only have to change one word. Koalas contains all of the functionalities of a pandas dataframe, so if you are familiar with one you can use the other.

More About Koalas

koala getting hlep from a Firefighter

Photo Credit:New York Fox 5

It is impossible for me to load the library without thinking about the Australian Bush Fires which are burning the homes of people and Koalas.  If your finances allow it, please consider donating to the firefighters as I am sure they can use help to save the homes of people and animals in Australia.

 

Regards,

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Introduction to Databricks

As I have been doing some work on Databricks, I thought that it would make sense that I start writing about it. Databricks is a scalable environment used to run R, Python and Scala code in the cloud. It currently can be run in either AWS or Microsoft’s Azure Cloud. For those of you who are budget-minded when it comes to learning new tools, there is also a free tier, which is available here Community.cloud.databricks.com . It has somewhat limited compute capacity, but if you are just starting out you might find it helpful.

Backstory

Databricks is an implementation of Apache Spark, part of the Hadoop collective and was a replacement for Map Reduce.  Many of the people who worked on that open source project were students at Berkeley, where Apache Spark was created and added to Apache in 2013.  Like many development projects, after it was completed, they had some ideas on how to improve the code.  This time they decided to not make it open source but make it a commercial product so they could make some money for their development efforts. In April of 2017 Databricks was released on AWS and in March 2018 it was released in Azure.

Creating an Azure Databricks Service

Creating a Databricks Service is very straight-forward.  There are only a few things that you need to complete when creating a new Databricks instance. The location becomes very important if you are looking at higher level performing instances which may not be available in all locations. If you are just getting started don’t worry about high level hosting services as you most likely will not need them and most of the compute options are available in most data centers.  As always in Azure you want to make sure that you are hosting your Databricks service in the same location as your data so you will not need to pay to transfer data between data centers.

The Pricing Tier contains three options: Standard, Premium and Trial(Premium 14 Days).  The trial is pretty self-explanatory and is a great way to get started using Databricks. They are of course a few differences between Standard and Premium. Premium has extra features needed for teams including Role-Based rights for the components of Databricks.  And if you want ODBC authentication and Audit logs you will need to use Premium.  For more information on the cost of Databricks pricing tiers, check out Microsoft’s pricing link for more information.

Once you have an instance created, you can start using Databricks.  The application is contained within a managed instance, so once you launch Databricks you will be in their environment, which looks the same as the free edition.

 

Clusters, Notebooks and Data

These three components are the most important parts of Databricks as they include the compute power, where you write code and the information you work with respectively. These components are all separated in Databricks to improve scaling and provide a familiar environment to create and run code.

Cluster

The most important Databrick element, as it contains the compute.  This is also the part of Databricks which will greatly increase your bill as the more resources you use to run code the more money you need to run it.  One nice thing is clusters by default will terminate in 120 minutes of inactivity.  I generally drop this to 20 minutes.  If I am using it naturally it will not terminate, but if I am not using it, I want the charges to stop. You can also automatically spin up clusters to run jobs, so that they will only be in use when the job needs them. More about that in another post.

Notebooks

Databricks Notebook Import

Databricks Notebook Import

There are 3 supported languages in Databricks, R, Scala and Python, and within Databricks all of these languages are written in Notebooks. You don’t have to write your code in the environment.  You can write it locally and then import it. However, if you want to export your Notebook then run it locally, it gets trickier. Natively all of the Notebooks in Databricks are saved as .dbc files.  You can’t read them from anywhere else.  Fortunately there is a workaround to format the Notebook files as .ipynb files which can be read by any notebook.  Dave Wentzel from Microsoft has an elegant solution to convert .dbc to .ipynb which he includes in his blog here.

Data

You have a lot of options with data.  You can import a dataset into your environment to play with or you can connect to just about anything you can think of.  When you start doing data connections is when you stop using the community edition as you will want to use the Azure version to this to connect to various data resources like Azure SQL and blob storage.  More on how to that later in an upcoming post.

If you are interested in hearing more about Databricks and are in Chicago, I am teaching an all day class as part of SQL Saturday Chicago and would love to have you attend. More information on that class is here.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Reviewing Azure Databricks and Data Lake Analytics

Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data.  Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics.  With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.

Analyzing Data with Data Lake Analytics

Data Lake Analytics offers many of the same features as Databricks.  You can write code to analyze data and the analysis can be automatically parallelized to scale.  Microsoft has released a new version of Data Lake, which they are calling Data Lake Storage Gen2 to improve the performance of analysis performed with Data Lakes.  The difference, between the old version and the new one, is the hierarchical namespace to Azure Blob Storage which provides an indexing capability which means that operations can be performed on a directory rather than enumerating through all of the data.  Data stored within a Data Lake can be accessed just like HDFS and Microsoft has provided a new driver for accessing data in a Data Lake which can be used with SQL Data Warehouse, HDinsight and Databricks.  With Data Lake Analytics, the data analysis is designed to be performed in U-SQL. While it supports R and Python libraries, users of the technology will need to get up to speed on U-SQL which is a lot like C#.  This knowledge needs to be learned. Since U-SQL is so new, only a few years old, there is not a large number of people who are familiar with it.

Analyzing Data with Databricks

When analyzing data with Databricks, there are three different languages which you can use: R, Scala, and Python.  Data can be read in from a variety of different Azure Storage options, including Blob Storage, Data Lake, and by using a JDBC connection. You can also connect to Azure SQL DB, as well as Azure SQL Data Warehouse. Since there are three different languages which can be used, there is no reason to learn a new language as most people are already very familiar with at least one of the three supported languages.

In addition to the ability to develop code, Databricks offers some other features which are not found in Data Lake Analytics.  Many projects anticipate that people are going to be working in teams and will need to have an environment to share code and version it.  This capability is baked into Azure Databricks as it provides an environment for sharing data with others and natively saving the data to a GitHub repository.  The development environment is Jupyter Notebooks which provides a great way to document the code and include data samples, all at the same time.  Databricks also includes a job schedule component so that work created in Databricks can use a native scheduler which has the ability to retry and send configurable messages on error or completion.  These additional features, plus the ability to code in a language which is already widely used in the industry, give Databricks the edge in determining which technology to use going forward.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Preparing for Exam 70-774 – Perform Cloud Data Science with Azure Machine Learning

There are a number of reasons why you might want to take a Microsoft cert exam. Maybe you want to focus your studies on a tangible thing, or you think it will help further your career, or you work for a Microsoft Partner and they required a certain number of people to pass the exam to maintain their current partner status.  I am not going to get into the long argument regarding whether or not a cert will help you in your career, or not, I can tell you why you might want to take the 70-774 exam. Machine Learning, or Data Science if you prefer, is an important analytic skill to have to analyze data.  I believe that it will only become more useful overtime. Azure Machine Learning is a good tool for learning the analysis process.  Once you have the concepts down, then should you need to use other tools to perform analysis it is just a matter of learning a new tool.  I talk to a number of people who are trying to learn new things, and the study them in their spare time.  It’s very easy to spend time vaguely studying something, but you may find that having a target set of items to study will focus your time, and as a bonus you get a neat badge and some measure of proof that you were spending time on the computer learning new things and not just watching cat videos.

Exam 70-774 Preparation Tips


While you could always buy the book for the exam (shameless plug as I was one of the authors), the book will not be enough and you will still need to write some code, and do some additional studying. This exam one of two needed for the MCSA in Data Science and you an take the exams in any order. The best place to start is by first looking at the 70-774 exam reference page from Microsoft.  There are four different sections in the exam, and I have created some links for each section which will help you prepare for the exam. In studying for exams in the past, the best way I have found to prepare is to look at everything on the outline and make sure that I know it.

Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning

Normalizing Data
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data

TanH
https://reference.wolfram.com/language/ref/Tanh.html

ZScore
http://stattrek.com/statistics/dictionary.aspx?definition=z-score
http://howto.commetrics.com/methodology/statistics/normalization/

Min Max
https://www.quora.com/What-is-the-meaning-of-min-max-normalization

PCA
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://stackoverflow.com/questions/9590114/importance-of-pca-or-svd-in-machine-learning

SVD
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Canonical-correlation analysis (CCA)
https://en.wikipedia.org/wiki/Canonical_correlation

Singular Value Decomposition (SVD)
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Develop Machine Learning Models

Team Data Science
https://docs.microsoft.com/fi-fi/azure/machine-learning/team-data-science-process/python-data-access

K-Means
https://www.datascience.com/blog/k-means-clustering

Confusion Matrix
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
https://en.wikipedia.org/wiki/Confusion_matrix
https://en.wikipedia.org/wiki/F1_score

Ordinal Regression
https://en.wikipedia.org/wiki/Ordinal_regression

Poisson regression
https://en.wikipedia.org/wiki/Poisson_regression

Mean Absolute Error and Root Mean Squared Error
http://www.eumetrain.org/data/4/451/english/msg/ver_cont_var/uos3/uos3_ko1.htm

Cross Validation
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f

Operationalize and Manage Azure Machine Learning Services

Connect to a published Machine Learning web service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/publish-a-machine-learning-web-service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/consume-web-service-with-web-app-template
https://docs.microsoft.com/en-us/azure/machine-learning/studio/manage-new-webservice

Use Other Services for Machine Learning

Microsoft Cognitive Toolkit
https://www.microsoft.com/en-us/cognitive-toolkit/

BrainScript
https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-basic-concepts

Streamline development by using existing resources
https://docs.microsoft.com/en-us/azure/machine-learning/studio/gallery-how-to-use-contribute-publish
Perform database analytics by using SQL Server R Services on Azure
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/provision-vm
https://docs.microsoft.com/en-us/machine-learning-server/install/r-server-vm-data-science
https://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
http://blog.revolutionanalytics.com/2017/07/xgboost-support-added-to-rattle.html
https://github.com/JohnLangford/vowpal_wabbit/wiki

I hope you have found this test preparation material helpful.  If you passed the exam, let me know by sending me a comment.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Azure Machine Learning Workbench

Microsoft released Azure Machine Learning Workbench at the Ignite conference on September 25, 2017 as a public preview.  This tool is a new tool which they are adding to their Azure ecosystem, which includes the machine learning tool they introduced three years ago, Azure Machine Learning Studio. Microsoft has said they plan on keeping both products. When asked about the two products, they said that the earlier tool, Azure Machine Learning Studio, is targeted to developers who wanted to add machine learning to their current applications, as it is an easy to use tool that doesn’t require a person to be a trained data scientist.  Azure Machine Learning Workbench is targeted to data scientists who want to bring in other libraries, like TensorFlow for Python, and delve deep into the data.

Microsoft Moves into Machine Learning Management

Microsoft is looking for Azure Machine Learning Workbench for more than a tool to use for Machine Learning analysis. It is part of a system to manage and monitor the deployment of machine learning solutions with Azure Machine Learning Model Management. The management aspects are part of the application installation.  To install the Azure Machine Learning Workbench, the application download is available only by creating an account in Microsoft’s Azure environment, where a Machine Learning Model Management resource will be created as part of the install. Within this resource, you will be directed to create a virtual environment in Azure where you will be deploying and managing Machine Learning models.

This migration into management of machine learning components is part of a pattern first seen on the on-premises version of data science functionality.  First Microsoft helped companies manage the deployment of R code with SQL Server 2016 which includes the ability to move R code into SQL Server.  Providing this capability decreased the time it took to implement a data science solution by providing a means for the code can be deployed easily without the need for the R code to be re-written or included in another application. SQL Server 2017 expanded on this idea by allowing Python code to be deployed into SQL Server as well.  With the cloud service Model Management, Microsoft is hoping to centralize the implementation so that all Machine Learning services created can be managed in one place.

Hybrid Cloud, Desktop, and Python

While you must have an Azure account to use the Machine Learning Workbench, the application is designed to run on a locally on either a Mac or Windows computer.  There is a developer edition of the tool so that one can learn the tool and not incur a bill, which is the case with the previous product, Azure Machine Learning.  The download of Machine Learning Workbench must be accessed within an Azure account and is installed to your local computer.  When running the application from your computer, the application will prompt to log into your Azure account to load Azure Machine Learning Workbench.

The application is designed to use and create Python code.  Azure Machine Learning Workbench does not contain any accommodation to incorporate machine learning components written in R, just Python.  If you have created machine learning components using R, they can be incorporated into the Azure Machine Learning Model Management if you create webservices which encapsulate the R code. The R code does not interface into Workbench, but can be made to be a part of the managed projectes in Azure. While it is possible to create a webservice for R with the earlier product with Azure Machine Learning, there is no direct way to include R with Azure Machine Learning Workbench.  There are a number of sample templates to get started using Python templates including the ubiquitous Iris dataset, Linear regression and several others.   Once the project is created, you can use your favorite IDE, it creates python code which can be read anywhere.

Staying within Machine Learning Workbench application allows you access to arguably one of the neatest parts of the Machine Learning Workbench, the data parser. This tool which was originally code-named project Pendleton and designed to be an intuitive way to modify the contents of data even better than the previous leader in parsing data, Power BI’s Power Query.

You can select the option “Derive column by example” or “Split Column by Example” and then start typing in a new column.  For example, if you want to separate a column which contains the date and the time, if you right click on that date column and select “Split Column by Example” then type the date in the new column provided, the application will immediately determine that you want two columns and crate them. The date column and a time column be created for you after typing in one date.  After the sample columns have been created, you can approve the change or reject it if does not work how you want to.

Like Power Query, each change made to the data is included in the window called Steps on the right side of the application window. When you are done modifying the data, right click on the Data Preparations source icon, which in my example is called UFO Clean, to and the UI changes made to the data are used to create Python code to perform the changes. The generated Python code can be used to the source data programmatically.

The next step in the process is to write the python code needed to evaluate the data and create a model which would in my case determine where and when you are most likely to see the next UFO based on the dataset I have included in my project.  Unlike it’s counterpart Azure Machine Learning, you will need to know how to write the necessary code needed to create a machine learning analysis in Python for Azure Machine Learning Workbench. One could write the Python code to create a machine learning analysis in any Python editor.  If you chose to use Azure Machine Learning, the Python library scikit-learn is installed as part of the application.  Other libraries which you may want to use, such as the common library matplot, you will need to load within Azure Machine Learning Workbench.

Web Service: How Azure Machine Learning Workbench Solutions are Deployed

To deploy a package, you will need to export the completed model serialized Python object, with the Python Module, Pickle. This will create a file with the suffix of pkl, which is the file that you will be deploying. Azure Machine Learning Workbench expects that you will be deploying via Docker containers or creating an Azure cluster.  You will need to register the Docker container in the Machine Learning Container for it to be deployed.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Running Power BI Locally with the Power BI Report Server

Power BI Now Available on your Local Server

Power BI: Now available without being on the cloud

Microsoft had an lot of announcements about Power BI this week, so many that it was easy to miss some of the finer details, including those which are going to be important in making decisions going forward.  Since the announcements are changes which will be effective soon, in the case of the free tier of Power BI on June 1, and released “… generally available late in the second quarter of 2017” this will give Power BI users time to adjust to the changes. In a nutshell, Microsoft has announced they are adding a cloud service called Power BI Premium which will allow people to create capacity instead of per-user licenses, the free edition will no longer to be able to share files, Power BI Embedded is going to be migrated to the Power BI Service from Azure, and finally, at long last, it will be possible to run Power BI reports locally and without needing anything in the cloud.

Running Power BI without a Cloud

It is not possible to run Power BI reports locally right now, but sometime before the 1st of July 2017,  users who have SQL Server 2016 Enterprise Edition per-core and active Software Assurance [SA] can deploy Power BI Report Server.  This means that no one is going to have to wait for SQL Server 2017 for Power BI on premise as it will be available sometime in June.  The functionality in SQL Server 2017 SQL Server Reporting Server [SSRS]. Community Technology Preview edition is going to be available in Power BI Report Server, with the addition of the ability to include custom visuals and many data sources, which the CTP version did not do. The Power BI Server includes all of the functionality of SSRS This means that users will not need an SSRS Server and a Power BI Server, as the Power BI Server will be able to do both.  If you want to migrate all of the reports created in SSRS from 2008 R2, and SSRS Mobile Reports, you can migrate these reports to the new Power BI Report Server. You can use Power BI Reporting Server for reports created on earlier versions, as long as you have a version of SQL Server 2016 Enterprise per-core edition with SA. The Power BI Report Server will be a separate install with separate release schedules, which currently are planned about once a quarter. Power BI Report Server will also be able to publish reports to mobile devices as well. If the reports uses data in the cloud, you can employ a Data Gateway as the Power BI Reporting Server can use the gateway to access cloud data. Of course if all of the data in the report is located on-premises, no gateway will be required.

Power BI Pro Licenses for On-Premise Reporting

While there is going to be no additional cost for running reports locally, or looking at them, creating and sharing reports for the Power BI Report will require a Power BI Pro License.  The Power BI Desktop is going to be free, and there is still going to be a free version of Power BI. There will also be a  new desktop version of Power BI for Reporting Services which will be on the same version as the Server, which will have fewer updates. This means if you support Power BI Service Reports and Power BI Report Server Reports you will have two versions of the Desktop, the Reporting Services Power BI Desktop and the Power BI Service Desktop.  Both are designed to run on the same machine. So far I have not had any problems having both other than remembering which is which as the icons are the same.  You have to load the software to see that the top line has (Report Server).

Starting June 1, free Power BI license holders will no longer be able to share reports.  Reports created with a free license can be viewed only by the person with the free account.

Power BI Desktop does not have Dashboards, and neither will Power BI

When it is released, Power BI Report Server will be displaying reports created from the Power BI Desktop.  Dashboards are not created in the Power BI Desktop application, meaning that there will be no Power BI Dashboards in the Power BI Report Server.  While this may change in a later release, it is not available in the first release, which also does not support R or custom visuals either.  To display and distribute dashboards, use the Power BI service.

I am sure there will be more announcements about this and other upcoming Power BI features. Many will most likely be announced at Microsoft’s Data Summit Conference in June, which I will fortunately have the opportunity to attend.  If you are going to be there as well, drop me a line or ping me on twitter at @desertislesql and perhaps we can meet in person.
 ***Update I have a post which covers the released version of Power BI Report Server.  Click here to find what was changed since this post was written.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Security Updates to Power BI

Office 365 Admin Screen for granting Power BI Admin rights

Office 365 Admin Screen for granting Power BI Admin rights

In the past month, Microsoft has made a number of security changes to Power BI. The first one, is not really a feature update, but a PowerShell replacement. No longer do you need to use PowerShell to become a Power BI Admin. Any Office 365 Admin can grant Power BI Admin permissions via this screen in the Admin Center. The Power BI Admin role was first created in October, but the screen was not complete, which was just fixed in February.

Power BI Security Changed from Tenant Only

People who have been granted Power BI administrator rights will also notice a modification to the Admin screen. The March 2017 update to Power BI provides a major change to the security model in Power BI. Previously all the security settings were set at the Tenant Level, meaning that all the privileges were granted to all users. If I wanted to allow one group within the organization to be able to publish reports to the web, but I did not want to allow everyone to publish reports to the web, there was no way that this could be accomplished. All that has changed. It is now possible to include or exclude groups of users from having rights in Power BI. Users can be classified into security groups in Azure Active Directory, either through the Office 365 Admin Center or via the Azure AD Admin Center. Once created the security groups can be used in Power BI. Security Groups are not the same thing as the groups created in Power BI when a new work group is created.

Using Security Groups in Power BI Admin

PowerBINonTenantAdmin

Power BI Admin Portal

The new Power BI Screen looks different. It now lists which rights can be specified to different groups of users. Share content to external users, Export Data, Export reports as PowerPoint presentations, Printing dashboards and reports, Content pack publishing, and Use Analyze in Excel with on-premises datasets now have the ability to be assigned to security groups so that the rights do not have to be the same throughout the entire tenant.

Unfortunately, some of the permissions are still tenant based. For example, the setting Publish to web, which is one permission I would definitely like to turn on only for some users, is still only available as a tenant level option.  These security changes are a welcome improvement to the product as they provide more options for administrators to grant rights to Power BI.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

PolyBase – Another Method for Creating a Stretch Database in SQL Server 2016

PolyBase, which was released with SQL Server 2016, provides another method to access live data either locally or in the cloud, very similar to the SQL Server Stretch database feature. Polybase can also provide the ability to provide a more cost-effective availability for cold data, streamlines on-premises data maintenance, and keeps data secure even during migration. Polybase differs from Stretch database in a few ways, as the SQL must be different, the speed is noticeably slower, and it is a lot less expensive. The cost is significantly less because storing data in a Azure blob store starts at 1 cent a month and Stretch database starts at $2.50 an hour. In this post,I will show how to take data which was archived due to the age of the data, which was created in 2012 and store it in an Azure Blob Storage file which will be available via Polybase when I needed.

Implementing a PolyBase Stretch Database

PolyBase is an optional feature of SQL Server 2016, and the Instance Feature PolyBase Query Service for External Data needs to be selected as part of the installation process. Two services are ssmsexternalinstalled with the feature, SQL Server PolyBase Engine and SQL Server PolyBase Data Movement. Both of these services must be running and TCP/IP must be enabled for PolyBase to work. Either check for those services or run the query SELECT SERVERPROPERTY (‘IsPolybaseInstalled’) AS IsPolybaseInstalled; which will return a 1 when PolyBase has been installed. The next step is to tell SQL Server what the external source is by configuring which Hadoop Connectivity will be used.


EXEC sp_configure ‘hadoop connectivity’, 7;
GO
RECONFIGURE;

To use Azure Blob Storage, there are 3 different options, 1,4 and 7, which include the ability to access blob storage. These options also allow for various other Hadoop engines to be accessed as well. I have selected 7, which also allows me to use Hortonworks on Linux. For more information on the connectivity options, check out this link. A secure database ke is required to make the connection a well. The next step is to let SQL Server know where exactly the blob storage data is stored, which creates an entry in the External Data Sources. PolyBase needs to know how the data is formatted, which will be stored in an External File Format. By definition, Hadoop data has no schema, so a schema is going to have to be created and stored in an external table. The location for each of these items is stored within SQL Server Management studio as shown here.

This code will create a secure key using a password I made up

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'P@ssW0rdPolybase'

A PolyBase will need the Azure Blob Storage key to be able to make a connection.

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCred
WITH IDENTITY = 'user', Secret = 'lEGL66LiK2KE2U0WEb435PH15BwFOInrHqQ1AJigjoRVfEOrOge+TLbBNu861cqbC+a26io92o5sw/b+OTD/C4w=='; --Note: Replace with your Blob Storage key


CREATE EXTERNAL DATA SOURCE AzureBlobStorage with (
TYPE = HADOOP,
LOCATION ='wasbs://gingeriottest@gingertestiotblobstore.blob.core.windows.net', --This is the location of the folder inside blob storage containing my data
CREDENTIAL = AzureStorageCred
);

The data is stored in blob storage as a CSV, so I will need to create that file Format

CREATE EXTERNAL FILE FORMAT CSV WITH (
FORMAT_TYPE = DELIMITEDTEXT
,FORMAT_OPTIONS (FIELD_TERMINATOR =',', DATE_FORMAT='MM/dd/yyyy') )

Lastly, a table definition must be created for the file so that it can be accessed as a table

CREATE External TABLE OntimePerformance2012sampleall(
[DimAirlineKey] int,
[DimOriginAirportKey] int,
[DimArrivalAirportKey] int,
[DimCancellationReasonKey] int,
[DimDelayLengthKey] int,
[DimDepartureBlockKey] int,
[DimArrivalBlockKey] int,
[DimDistanceGroupKey] int,
[FlightDateKey] int,
[FlightNumber] varchar(50),
[ScheduleDepartureTime] int,
[ActualDepartureTime] int,
[DepartureDelayInMinutes] int,
[TaxiOutTime] int,
[TaxiInTime] int,
[ScheduleArrivalTime] int,
[ActualArrivalTime] int,
[ArrivalDelayInMinutes] int,
[ScheduleElapsedTime] int,
[ActualElapsedTime] int ,
[DistanceInMiles] int,
[CarrierDelayInMinutes] int,
[WeatherDelayInMinutes] int,
[NASDelayInMinutes] int,
[SecurityDelayInMinutes] int
)

WITH
(   LOCATION = '/OntimePerformance2012.csv',
DATA_SOURCE = AzureBlobStorage,
FILE_FORMAT = CSV
)

 

All of the external tables created can be found in the DMV sys.external_tables or by looking at sys.tables where the new column is_external is set to 1.  Once the setup of PolyBase is complete, I can query the external and internal tables all at the same time, like I do here.

 

SELECT count(*)
, r.cancellationReason
, d.CalendarYear
FROM dbo.CancellationReason r
JOIN dbo.OntimePerformance2012sampleall c
ON c.dimcancellationReasonKey = r.CancellationReasonKey
JOIN dbo.DimDate d
ON c.flightDateKey = d.DateKey
GROUP BY r.cancellationReason, d.CalendarYear

PolyBase Table Considerations

It is not possible to edit the data in PolyBase, as the data is stored read only. Also since the data is being pointed to externally, it cannot be indexed and querying it is not terribly quick. All of the data will be brought down and then filtered, unless PolyBase is configured to improve performance, which I will discuss in a future blog post.

PolyBase and Azure Blob Storage provides the ability to access data stored offsite in a secure location in a cost effective manner. If you saw my presentation on this topic at PASS Summit or have questions how to make it work in your environment, please contact me on twitter or write a comment on my blog.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing JSON in U-SQL

In USQL there are built-in extractors for parsing text, comma delimited or tab delimined files. Once again, parsing JSON becomes problematic. There is a solution built into USQL, write some C# code to extend it or use someone else’s C# code to extend USQL. Since I wanted to parse JSON, fortunately there are libraries available on github containing the information required to do it. Download the github package and open up the Microsoft.Analytics.Samples project in Visual Studio. When I did this the first time, there was a problem loading the Newtonsoft.Json reference, so I right clicked on the references and downloaded the missing parts again. Build the solution and check out the code in the directory …Examples\DataFormats\Microsoft.Analytics.Samples.Formats\bin\Debug\ . There will be two DLLs, Microsoft.Analytics.Samples.Formats.dll and Newtonsoft.Json.dll. These dlls then need to be registered in Data Lake Analytics and locally if you chose to run your USQL locally. As at some point the goal is to run from within Data Lake analytics, you will need to copy both of these dlls to the data lake. I created a folder for the dlls called Assemblies, and ran this command


USE DATABASE [master];
CREATE ASSEMBLY [Newtonsoft.Json] FROM @"/Assemblies/Newtonsoft.Json.dll";
CREATE ASSEMBLY [Microsoft.Analytics.Samples.Formats] FROM @"Assemblies/Microsoft.Analytics.Samples.Formats.dll";

Notice I told the USQL where to find the dlls, in the Assemblies folder. This step only needs to be completed once per data lake. After this job successfully runs, then the dlls which allow the JSON to be parsed, can be referenced.

Here is my sample JSON, which I have copied to the folder Samples/Data/TestNew.Json, in the Data Lake

{
"appInstanceId": "357ced1e-cf05-459c-9317-794bq24f61c2",
"firmwareVersion": "1.0.2.4",
"serialNumber": "254542-694967",
"Side": "0",
"Latitude": "33.8848744",
"Longitude": "-128.403276",
"GeneratedDate": "2016-10-04T21:18:19Z"
}

Now that I have added the JSON to the Data Lake and the assemblies have been added, I can write some USQL to Parse the JSON. First I will need to reference the libraries, then create a schema, as there is no schema for a Data Lake. After those steps are completed, it’s possible to write SQL to query a JSON file. There is no UI to look at the results, so the results will be writing to a file. I am going to output the data to a csv file called JSONOutput.csv. Here’s the code to do that.

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

DECLARE @infile string="/Samples/Data/TestNew.json";

@logSchema =
EXTRACT name string
, appInstanceId string
, firmwareVersion string
, serialNumber string
, Side string
, Latitude float
, Longitude float
FROM @infile
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();

@testthis = SELECT appInstanceId
, COUNT(*) AS LocationCount
FROM @logSchema
GROUP BY appInstanceId;

OUTPUT @testthis
TO "/Samples/Data/JSONoutput.csv"
USING Outputters.Csv();

vsrunjson

Using Visual Studio, I am running the USQL Job. There isn’t much data to parse, and you can see in the summary widows that it took 21 seconds to prepare, and 33 seconds to run.

When go to the web and look at the Data Lake Analytics page, I can also see that the job completed. I have noticed that this appears pretty close to the same time on the web and on visual studio.

azuredlscreen

Clicking on the bar graph represented by today will allow me to select the job which ran, showing the same screen as appears in Visual Studio.

Thanks to Erik Zwiefel and Mark Vaillancourt b | t both of Microsoft for helping me figure out the process to use JSON in Data Lake Analytics, as I didn’t understand the steps which are required to parse JSON. I hope this blog makes it possible for you to figure out how to make it work.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur