The Apache Spark open source organization maintains all of the documentation for Apache Spark, which is a set of APIs which are used in Databricks and other big data processing applications. The documentation provides the detailed information about the libraries, but the instructions for loading libraries in Databricks are not exactly the same as are used in Databricks, so if you follow the Spark installation instructions, you will get nowhere. If you follow the steps listed you will be up and running in no time.
Installing Options – Cluster or Notebook ?
If you are not using a ML workspace you can add in using
dbutils like this.
Unfortunately if you are using an ML workspace, this will not work and you will get the error message
org.apache.spark.SparkException: Library utilities are not available on Databricks Runtime for Machine Learning. The Koalas github documentation says “In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and Databricks Runtime for Machine Learning”. What this means is if you want to use it now
Most of the time I want to install on the whole cluster as I segment libraries by cluster. This way if I want those libraries I just connect to the cluster that has them. Now the easiest way to install a library is to open up a running Databricks cluster (start it if it is not running) then go to the Libraries tab at the top of the screen. My cluster is called Yucca, and you can see that it is running because the circle next to the name is green.
After you are on the Libraries table you will see two buttons. Click on the one labeled Install New. A window will appear. Select the library source of PYPI and in the Package text box enter the word koalas. Then click on the install button.
After this you are ready to use the new library, once you import it as shown here.
Why do I want to install Koalas in Databricks?
If you have written Python code for Machine Learning, chances are you are using Pandas. Pandas dataframes are practically the standard for manipulating the data in Python. They are not however part of the Spark API. While you can move your Python code over to Databricks without making any changes to it, that is not advisable. Databricks is able to scale pandas, so adding more resources to your code may not improve the performance. When writing Python code for Databricks you need to use the Spark APIs in order to ensure that your code can scale and will perform optimally. Prior to April of 2019, that meant that you had to use Spark dataframes and not pandas dataframes, which could involve a bit of rework when porting code as much code was written in pandas. In April of last year Koalas was added to Spark, meaning that changing code to use a pandas dataframe to a koalas dataframe means that you only have to change one word. Koalas contains all of the functionalities of a pandas dataframe, so if you are familiar with one you can use the other.
More About Koalas
It is impossible for me to load the library without thinking about the Australian Bush Fires which are burning the homes of people and Koalas. If your finances allow it, please consider donating to the firefighters as I am sure they can use help to save the homes of people and animals in Australia.
Data aficionado et SQL Raconteur