Fabric Features not found in Azure Synapse or Power BI – Data Wrangler
The second element in my series on new Fabric Features is Data Wrangler. Data Wrangler is an entirely new feature found inside of the Data Engineering and Machine Learning Experience of Fabric. It was created to help analyze data in a lakehouse using Spark and generated code. You may find that there’s a lot of data in the data lake that you need to evaluate to determine how you might incorporate the data into a data model. It’s important to examine the data to evaluate what the data contains. Is there anything missing? Incorrectly data typed? Bad Data? There is an easy method to discover what is missing with your data which uses some techniques commonly used by data scientists. Data Wrangler is used inside of notebooks in the Data Engineering or Machine Learning Environments, as the functionality does not exist within the Power BI experience.
Step-by-step Data Wrangling
The example I am going to walk through here uses the public holiday sample dataset, which you can load by clicking on the Use a sample button with Data Engineering, and selecting Public Holidays. Once you create a lakehouse with the sample data, you can follow along with the steps in this blog for doing your own data wrangling.
From within the Data Engineering Environment, select the notebook from the + New dropdown menu to create a new notebook. You will need to add some data to the lakehouse inside of the new notebook. To do this, click on the green Add button in the lakehouse section of the screen to add your Public Holiday Lakehouse. After the lakehouse is loaded, you will have two file folders, if you click on the one labelled Files, you will see two more folders. Click on the folder called sample_datasets, which makes it turn green and you will see a list of files in the sample_datasets pane which opens up. If you click on the ellipse menu next to public_holidays.parquet, you will see a popup window. In that window, select Load data and you will see two options, Spark and Pandas. The screen image shows what it looks like.
When Fabric was first released, the only option available for Data Wrangling was pandas, which is not Spark compliant. For more information regarding Pandas, check out my blog on Koalas where I cover why you don’t want to use pandas. I am very happy to see the Spark dataframes option as it is spark compliant, and can then take advantage of Spark’s superior memory and data management. However all of the generated code is pandas, and then converted to Spark. Data Wrangler will use Pandas no matter which option you select. We are now going to need to run this code which was generated for us so the data is loaded in a Spark DataFrame, by going to the Run menu and selecting Run all. This will start up the Spark cluster and run the code. You can look in the bottom left corner to see when it is complete.
Once you see the data you can start wrangling it. Go the Data tab and select Transform DataFrame in DataWrangler, then select the dataframe which was created for you, df.
Using Data Wrangler
In the Data Wrangler screen which is pictured here, you will see several interesting items. The first is at the top of the screen, where it lets you know that this code is pandas, but the code will be converted to pySpark when you leave Data Wrangler, but everything here will be pandas. This is important as if you end up with code you want to use, you can just use the generated code because it will be spark compliant. The top of the screen looks a lot like something you might see within Power Query if you have all the view items checked. Data Wrangler also shows missing and distinct as well as the top 3 values. On the left side of the screen there is a list of tasks which will generate code for you. I am going to select the first two of them, starting with Drop Duplicate Rows, which you can see in the screen shot image. This will immediately generate a line of code to drop the rows across all columns, but nothing will happen until the Apply button is clicked, which I will do so the code will be run. The applied steps are called cleaning steps and they appear on the bottom left side of the screen. Even if a step is applied it is not removed from the list, which seems odd. I am going to select the next option which is to delete missing values, and I am then prompted with a dropdown box to select which columns I want to drop if missing. I am going to pick all, then apply the change. At this point, since I completed the first two items, I don’t have anything missing as I deleted everything, so the next two steps will do nothing , as there is nothing missing. Clicking the option preview code for all steps allows me to see the generated code.
If I want to run this code every time I get this dataset, I want to add it to a notebook, so I will select add code to notebook. Now in my notebook, I want to add pySpark code, which is different than the pandas code generated. Optionally I can add Pandas code, but I see no reason to do that. A popup window will appear showing me the code, at which point I can just click on the add button. The generated code is added to my notebook, saving me the tedium of having to write common code like this.
Data Wrangler is another automated tool which will write code for me similar to what I might expect if I was using Co-pilot. Data Wrangler contains decision making options and is generated based on my thoughts around which would be the best option in a given scenario. I can always return to Data Wrangler to add other steps or make different selections from the operations I chose. This tool can be really useful and I plan on using it to clean up my data in Notebooks.
Yours Always
Ginger Grant
Data aficionado et SQL Raconteur