Articles

Using Koalas in Spark

koala getting hlep from a Firefighter

The Apache Spark open source organization maintains all of the documentation for Apache Spark, which is a set of APIs which are used in Databricks and other big data processing applications.  The documentation provides the detailed information about the libraries, but the instructions for loading libraries in Databricks are not exactly the same as are used in Databricks, so if you follow the Spark installation instructions, you will get nowhere. If you follow the steps listed you will be up and running in no time.

Installing Options – Cluster or Notebook ?

If you are not using a ML workspace you can add in using dbutils like this.
dbutils.library.installPyPI("koalas")
dbutils.library.restartPython()

Unfortunately if you are using an ML workspace, this will not work and you will get the error message org.apache.spark.SparkException: Library utilities are not available on Databricks Runtime for Machine Learning. The Koalas github documentation  says “In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning”.  What this means is if you want to use it now

Most of the time I want to install on the whole cluster as I segment libraries by cluster.  This way if I want those libraries I just connect to the cluster that has them. Now the easiest way to install a library is to open up a running Databricks cluster (start it if it is not running) then go to the Libraries tab at the top of the screen. My cluster is called Yucca, and you can see that it is running because the circle next to the name is green.

After you are on the Libraries table you will see two buttons.  Click on the one labeled Install New.  A window will appear.  Select the library source of PYPI and in the Package text box enter the word koalas.  Then click on the install button.

Install Databricks LibraryThe installation may take a few minutes.  When it is complete you will see a green  status circle and the word installed.

After this you are ready to use the new library, once you import it as shown here.

 

Why do I want to install Koalas in Databricks?

If you have written Python code for Machine Learning, chances are you are using Pandas. Pandas dataframes are practically the standard for manipulating the data in Python.  They are not however part of the Spark API.  While you can move your Python code over to Databricks without making any changes to it, that is not advisable.  Databricks is able to scale pandas, so adding more resources to your code may not improve the performance.  When writing Python code for Databricks you need to use the Spark APIs in order to ensure that your code can scale and will perform optimally.   Prior to April of 2019, that meant that you had to use Spark dataframes and not pandas dataframes, which could involve a bit of rework when porting code as much code was written in pandas.  In April of last year Koalas was added to Spark, meaning that changing code to use a pandas dataframe to a koalas dataframe means that you only have to change one word. Koalas contains all of the functionalities of a pandas dataframe, so if you are familiar with one you can use the other.

More About Koalas

koala getting hlep from a Firefighter

Photo Credit:New York Fox 5

It is impossible for me to load the library without thinking about the Australian Bush Fires which are burning the homes of people and Koalas.  If your finances allow it, please consider donating to the firefighters as I am sure they can use help to save the homes of people and animals in Australia.

 

Regards,

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Introduction to Databricks

As I have been doing some work on Databricks, I thought that it would make sense that I start writing about it. Databricks is a scalable environment used to run R, Python and Scala code in the cloud. It currently can be run in either AWS or Microsoft’s Azure Cloud. For those of you who are budget-minded when it comes to learning new tools, there is also a free tier, which is available here Community.cloud.databricks.com . It has somewhat limited compute capacity, but if you are just starting out you might find it helpful.

Backstory

Databricks is an implementation of Apache Spark, part of the Hadoop collective and was a replacement for Map Reduce. Many of the people who worked on that open source project were students at Berkeley, where Apache Spark was created and added to Apache in 2013. Like many development projects, after it was completed, they had some ideas on how to improve the code. This time they decided to not make it open source but make it a commercial product so they could make some money for their development efforts. In April of 2017 Databricks was released on AWS and in March 2018 it was released in Azure.

Creating an Azure Databricks Service

Creating a Databricks Service is very straight-forward. There are only a few things that you need to complete when creating a new Databricks instance. The location becomes very important if you are looking at higher level performing instances which may not be available in all locations. Additionally, security considerations are also important for you if there is plenty of sensitive information being stored and accessed. You can either take up the exams yourself to learn about them (check out SC 900 dumps dumps here) or if that seems like a stretch, then hire someone with the know-how. Well, all of this only comes into the picture if you are working on a company project and not just for the sake of learning. If you are just getting started don’t worry about high level hosting services as you most likely will not need them and most of the compute options are available in most data centers. As always in Azure you want to make sure that you are hosting your Databricks service in the same location as your data so you will not need to pay to transfer data between data centers.

The Pricing Tier contains three options: Standard, Premium and Trial(Premium 14 Days). The trial is pretty self-explanatory and is a great way to get started using Databricks. They are of course a few differences between Standard and Premium. Premium has extra features needed for teams including Role-Based rights for the components of Databricks. And if you want ODBC authentication and Audit logs you will need to use Premium. For more information on the cost of Databricks pricing tiers, check out Microsoft’s pricing link for more information.

Once you have an instance created, you can start using Databricks. The application is contained within a managed instance, so once you launch Databricks you will be in their environment, which looks the same as the free edition.

 

Clusters, Notebooks and Data

These three components are the most important parts of Databricks as they include the compute power, where you write code and the information you work with respectively. These components are all separated in Databricks to improve scaling and provide a familiar environment to create and run code.

Cluster

The most important Databrick element, as it contains the compute. This is also the part of Databricks which will greatly increase your bill as the more resources you use to run code the more money you need to run it. One nice thing is clusters by default will terminate in 120 minutes of inactivity. I generally drop this to 20 minutes. If I am using it naturally it will not terminate, but if I am not using it, I want the charges to stop. You can also automatically spin up clusters to run jobs, so that they will only be in use when the job needs them. More about that in another post.

Notebooks

Databricks Notebook Import

Databricks Notebook Import

There are 3 supported languages in Databricks, R, Scala and Python, and within Databricks all of these languages are written in Notebooks. You don’t have to write your code in the environment. You can write it locally and then import it. However, if you want to export your Notebook then run it locally, it gets trickier. Natively all of the Notebooks in Databricks are saved as .dbc files. You can’t read them from anywhere else. Fortunately there is a workaround to format the Notebook files as .ipynb files which can be read by any notebook. Dave Wentzel from Microsoft has an elegant solution to convert .dbc to .ipynb which he includes in his blog here.

Data

You have a lot of options with data. You can import a dataset into your environment to play with or you can connect to just about anything you can think of. When you start doing data connections is when you stop using the community edition as you will want to use the Azure version to this to connect to various data resources like Azure SQL and blob storage. More on how to that later in an upcoming post.

If you are interested in hearing more about Databricks and are in Chicago, I am teaching an all day class as part of SQL Saturday Chicago and would love to have you attend. More information on that class is here.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

IoT Security Concerns

When looking at IoT implementations, the topic of security always comes up. Many people remember October 21, 2016 as the day IoT devices broke the internet. After the investigation event was complete, it turns out the that it the outtages were not exactly caused by IoT devices. A majority of the denial of service attacks came from things like home routers, which most people would not classify as an IoT devices. When looking at all of the different ways that IoT devices can be modified to do bad things a few different ideas come to mind in terms of risk. How easy is it for a non-authorized user to gain access to a given device and what kind of device is it? If the device is a network router, that is a big problem. If the device is a water sensor and you need a lot of networking equipment to modify it, then the risk can be classified as a low risk. How an IoT device is modified is also a problem. If the IoT device is hacked in such a way that it becomes unusable, because the code ran out the the battery power that is a bigger deal than an IoT device which can be fixed by power cyling the device, which returns it to the factory configuration. Many times the code used to take over the device prevents any remote control access. This means a person needs to physically go power cycle the infected device. This can be a problem as some of these devices are inaccessible and are designed to be replaced not maintained. Power cycling stops the immediate problem but it does not prevent the same exploitation from happening again. New firmware or patches need to be applied to prevent the problem, requiring maintenance of the device again.

Knowing something about the people involved and the process used to gain access can help assessing risk. Unfortunately a number of devices have been released which have no security as they leave ports open which can be readily hacked, instead of say implementing SSL. Telnet, HTTP, DNS, Port 80, Port 22, and RDP all of these are ready targets for attack. There are also broadly speaking three classifications of users who work on accessing things which they don’t have access. Knowing about these three kinds of users and how they go about the process of gaining access.

Terminology changes over time, and to understand the risk, one needs to understand the language used by the people who access networks outside of the way the software was designed. For starters, the word Hacker is not used much. The word is overused and has come to mean people who have ill-intent, instead of the original version of people who were looking for flaws, not exploiting them. The people who access networks are known as Penetration testers, which is commonly abbreviated as pen testers. They not only figure out how things are broken but more importantly how to fix problems which they have found before someone takes advantage of the flaw.

Nation State Attacks

Many governments employ legions of developers to access or destroy. With nearly unlimited resources at their disposal, they are very successful when they want to target software for attack, as was seen when the Iranian centrifuges exploded. They also have the ability to completely mask where the unauthorized code came from, as that is standard operating procedure. If code is ascribed to be from the area near I-95 exit 41, you can pretty much guarantee that it came from somewhere else. It is virtually impossible to trace the origins of Nation State attacks as they ensure nothing is ever what it seems. If there are clues as to the origins of the software, they are added as intentional misdirection to throw people off the trail or to affix blame someone else. They have dedicated hardware designed to break access codes in a matter of seconds. There is no stopping this type of attack. They have little interest in most things so the risk is fortunately low.

White Hats

These penetration testers generally make their living hacking by permission. Firms hire them to see how vulnerable their networks or IoT Devices are. It is likely that white hats would advise these firms on incorporating effective security solutions (such as Mobile security in Intune) in order to prevent office networks from getting hacked. The white hats contact the vendors and provide warnings about things like the ability to access all of the memory on a single VM server if one knows how to overload a specific buffer, which is getting to be more common as servers are virtualized both on-site and in the cloud, which is after all someone else’s server. These people are trying to help make the code better. Of course, all pen testers are not all white hats, there are grey and black hats too. These people know enough to cause or prevent real damage.

scriptKiddieScript Kiddies

Like everything else there are varying levels of skills involved in pen testing. The lowest skill level is known as a Script Kiddies. People who penetrate networks are not the evil geniuses portrayed in Hollywood, they just know where to download Kali Linux which comes with a tool called Metaspolit. This tool contains a database of libraries which can be run against networks. Knowing how to run a program is not a great skill but then again given how poorly so much of the software is written, you don’t have to be an evil genius to say spy on a TrendNet webcam, as they did not fix their software until the FTC made them do it. The IoT hack that broke the internet? That capability was readily available as a canned exploit. Fortunately these common types of unauthorized access are the easiest to defeat. IoT systems need to be designed to avoid simple and well known intrusions, which is something that I will be talking about in my presentation for 24 Hours of Pass. Hopefully you will get a chance to attend live or watch the upcoming recording.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

How 2016 Releases of SharePoint and Excel Impact Power BI

There has been a lot of news recently from Microsoft on the upcoming changes to Power BI, Excel and SharePoint. Some of the changes may impact your decisions regarding report distribution.

SharePoint 2016 has no Power Pivot Gallery

Within SharePoint 2016,  Excel Services have been removed from the SharePoint Server.  The Excel Services didn’t go away, they were moved to the Office Online Server Preview, which means if you want to continue rendering Pivot Tables and Power View via the PowerPivot Gallery, you won’t be doing this from SharePoint. If you are wondering how you are going to be able to view Pivot Tables and Power View reports in Office Online Server Preview, you have a lot of company. Office Online Server Preview is not out yet. What Microsoft is saying right now with the Preview Release of SharePoint 2016 is Excel Services are not there, and if you want to use Excel Services, you will be waiting until there is a Preview of Office Online Server. If you are thinking of upgrading next year to the latest version of SharePoint, you will need to wait to see how to do that and still support Excel report distribution.

Decreased Excel Power

Excel2016 cchangesMicrosoft is eliminating the rampant use of the word “Power” when it comes to Excel tools.  I want to stress that none of the tools have been removed, just renamed. Power Query has been renamed “Get and Transform”. Power Map is now called 3D Map. Only two power words remain, Power View and Power Pivot and the use of the word Power has diminished with both of them. Power View is still there and called Power View but you have to add it to the ribbon to see it. The Data tab now contains the familiar Manage Data Model icon to open up the Power Pivot Window, which you will have to enable the add-in to make it work. The add-in is found in the Data Analysis Add-in group with no reference to Power. A number of new features have been added, especially when it comes to visualizations in Power View, which now has Sunburst, Waterfall, Histogram, Pareto, Box & Whisker and TreeMaps.

Forecasting is Back

Forecasting in Excel 2016I really liked the forecasting Feature in Power BI Office 365, which I wrote about when was disabled in December of 2014. Now it is back in Excel 2016. It works like it did before by looking at historical information in the past and using those trends to predict future values using variables to help you adjust the value for different conditions. This feature provides a simple way to view possible outcomes, which I think is going to be a very popular feature.

Power BI Decision Process

Next year’s releases are going to provide a lot of changes which may impact what your organization may elect to do in the future. A lot of these changes were rather predictable, such as Excel 2016 containing the new visualizations found in Power BI, but others such as the changes to SharePoint, may be a surprise. Since I do not work for Microsoft, I have to guess like everyone else what the changes will be, but I don’t think it is a big stretch to guess that Office Online Server may be cloud based. SharePoint 2016 product announcement describes the product as a Cloud-hybrid, and I am guessing the hybrid part may see your Excel documents on Office Online Server in the cloud. If this is a big deal to your organization, you may want to read the information about this really closely. I know I will and will be writing about them here.  Please feel free to subscribe to my blog to get the latest updates.

***For updates on this topic please see my more recent post Update on SharePoint 2016 and Excel Services

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

What is a Modern Data Warehouse?

As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.

Data is a Valuable Asset

In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.

Total Destruction of the Existing DW is Not Required

Excavator working with earth and sand in sandpitWhile a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.

Incorporating Social Media Information

The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.

For More Information

I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

From the clouds, data – DIY Guide to using SSIS with Azure

My apologies to Jack Johnson for lifting his song title for this blog, but I couldn’t help it. I thought that it might be useful to discuss the How-Tos of data migration to the cloud, which reminded me of all the cloud songs I know. I always thought the cloud metaphor was sort of an odd name for a remote data center, but that is probably why I am not in the business of naming things. But I digress. Back to the topic at hand. Moving data from SSIS to databases hosted on Microsoft’s Azure cloud requires some different steps than just moving data around on premise. Since things on the cloud change quite quickly, I thought I would demonstrate currently what this looks like using SSDT in Visual Studio 2012.

Breaking through the Wall

Before getting started with SSDT, there are some configuration steps one needs to complete on Azure first. The Windows Azure SQL database firewall must first be set up to allow connections from your IP, or your data will never get to the cloud. There is an option on the database screen to “Connect to your database”. If the firewall is not opened on the IP and port in use, you will get this message.

AzureDBScreenConfigFirewall-message

Selecting the Yes is needed to be able to connect to the Azure database via SSIS.

Also one might want to validate that the right drivers are loaded on SSDT as well. If you are running via Visual Studio 2012 or 2013, no worries as the drivers are already there, but for earlier versions new drivers may be required. If one is planning on loading data to an Azure SQL Server database, the ODBC or ADO.Net are the connections needed for Azure. The old data connection standby, Ole-DB is going to be left in the toolbox like last year’s dress, as it won’t work for the cloud. Much like fashion, everything old is new again so ODBC is once again the “It” connection. You can use ADO.Net too, but I won’t be here.

The next step in the process is getting the connection information needed to connect to the Azure database. Microsoft made this step quite easy. Look on the Azure Database screen where I’ve pasted a pink arrow. Click there.

 AzureDBScreenFull-EnabledCropped

 

This makes it so easy as a screen pops up with all the connection information you need. You just need to copy the ODBC section, and remember what your password is as you will need to enter it.

AzureODBCConnectionInfo

While we are still in Azure, I thought it would be a good idea to display where the SSIS package we will be creating will be putting the data. Here’s the table structure where the data will be placed.

AzureDBScreenTable

 

Here’s the query screen showing that right now the table is empty.

AzureDBScreenTable-NoData

SSIS Package Transferring Data to Azure

After you have all of the information you need from Azure, it is a relatively simple thing to create an SSIS package, with an OLEDB connection for my on premise database and an ODBC data connection to Azure using the information copied from the Azure database connection screen to transfer data to my Azure Database.

AzureDBSSISRunning

Going back to Azure, you can see 19,972 rows were added.

AzureDBScreenTable-Data

One word of caution, as you see here in the progress log, adding data can be a very slow process.

AzureDBSSISProgress

I highlighted the Elapsed time in red so that it would be easy to see that a simple file transfer took over two minutes.

Location, Location

One thing which is important to consider is where you are going to be moving your data. I demonstrated what I think may be the more common scenario, where the data is not on the cloud, and you want to put it to the cloud. Microsoft refers this as Hybrid Data Movement. Of course this may not be the case. If you are running SQL Server on a Virtual Machine in the cloud it may make a lot more sense to run SSIS on that virtual machine. If that is the case, for optimal performance, locate the SSIS in a VM in the same data center as the database because otherwise, due to the bandwidth and network latency, it will be slower. When transmitting data around the cloud, whether it be from on premises to the cloud or from one server to another on the cloud, you might want to consider compressing the data prior to sending if at all possible to decrease the size of the data being transmitted. It may be faster to extract the data you want on premises and transmit a compressed file to be applied on the cloud server. This can get to be more complicated as it requires setting up an Secure FTP server to transmit the files, which then have to be applied. Natively SSIS doesn’t have a compression tool, but there are third party products, such as Task Factory, which will allow you to not only compress the output but send it to your VM via Secured FTP from within the SSIS package.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur