Articles for the Month of January 2024

Fabric Features not found in Azure Synapse or Power BI – Data Activator

Fabric Admin Portal Tenant Settings for Data Activator

With the GA release of Fabric in November, 2023, I am dedicating several posts to new features which you will not find in Power BI or Azure Synapse, and the latest one I want to talk about is Data Activator. Data Activator is an interesting tool to include inside of Fabric because it is not reporting or ETL, rather it is a way to manage actions when the data hits defined targets.  It is a management system for data stored in Fabric or streamed in Azure using IOT or Event Hubs. You can use Data Activator to monitor the current state or to define actions to occur when certain conditions occur in the data.  Data Activator is still in preview, but you can evaluate it now.

Using Data Activator

Getting started in Data Activator requires you to have access to the Tenant Administration features of Fabric.  I have included an image of the Admin Portal Tenant settings so that you can see the Data Activator Option which must be enabled.  After this change has been made you will see the option for Data Activator in the icon menu in the bottom right corner.

Fabric Admin Portal Tenant Settings for Data Activator

Fabric Admin Portal Tenant Settings for Data Activator

 

Monitoring Data with Data Activator

Data activator is designed to process data in either real time or in batch and then do something with it. For example, let’s say that you are responsible for ensuring that a set of refrigerators are operating at tolerance for a set of retail stores. A series of IOT devices is in each refrigerator, which periodically sends the ID of the device and a temperature. If a temperature records over 40 Degrees Fahrenheit for 30 minutes, a work ticket should be created, and technician should be dispatched to solve the problem. If your data is batched, you may decide that if your KPI numbers are below thresholds you set, that someone should investigate why for example the number of yards of concrete poured yesterday was down significantly as perhaps there could be something that you could do to keep the issue from happening today. Reports will show you that these sorts of events occurred, and then the person monitoring the values in the report are supposed to take action. With Data Activator, you create triggers to initiate the action which should happen in response to the changes in data.

Triggering Action

The first step for creating actions in Data Activator is to connect to your data source then you can create a trigger. The Trigger alerts are similar to alerts you may have created on a Power BI Dashboard, where you provide a measure, time interval, then a rule and then which workspace to save the trigger in. Before you create the trigger, you need to create elements which support it. In our example where we are monitoring the freezer temperatures, we have a Unique Event Id which we are going to use to create an object, which we will call Fridge Alert. Once we create the object with the Transaction ID, we will then add properties to it of temperature and location. Lastly we will need a measure for the average temperature over 40 for 15 minutes. We will call that new measure Chill Alert. We can see how many times that has occurred over the last 24 hours in the data pane. Now that we have an alert, we will create a trigger action, sending an email when it occurs. We fill out the email as shown here.

Data Activator Trigger Action Pane

We can also triggger a Power Automate action to create a dispatch event as well. Selecting the start button will start the trigger data evaluation. As you can see, Data Activator can help you easily monitor your data by completing some very simple steps.  I hope that you check out this Fabric feature as it really takes monitoring data to the next level.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Fabric Features not found in Azure Synapse or Power BI – Data Wrangler

Fabric Features not found in Azure Synapse or Power BI – Data Wrangler

The second element in my series on new Fabric Features is Data Wrangler. Data Wrangler is an entirely new feature found inside of the Data Engineering and Machine Learning Experience of Fabric. It was created to help analyze data in a lakehouse using Spark and generated code. You may find that there’s a lot of data in the data lake that you need to evaluate to determine how you might incorporate the data into a data model. It’s important to examine the data to evaluate what the data contains. Is there anything missing? Incorrectly data typed? Bad Data? There is an easy method to discover what is missing with your data which uses some techniques commonly used by data scientists. Data Wrangler is used inside of notebooks in the Data Engineering or Machine Learning Environments, as the functionality does not exist within the Power BI experience.

Step-by-step Data Wrangling

The example I am going to walk through here uses the public holiday sample dataset, which you can load by clicking on the Use a sample button with Data Engineering, and selecting Public Holidays. Once you create a lakehouse with the sample data, you can follow along with the steps in this blog for doing your own data wrangling.

From within the Data Engineering Environment, select the notebook from the + New dropdown menu to create a new notebook. You will need to add some data to the lakehouse inside of the new notebook. To do this, click on the green Add button in the lakehouse section of the screen to add your Public Holiday Lakehouse. After the lakehouse is loaded, you will have two file folders, if you click on the one labelled Files, you will see two more folders. Click on the folder called sample_datasets, which makes it turn green and you will see a list of files in the sample_datasets pane which opens up. If you click on the ellipse menu next to public_holidays.parquet, you will see a popup window. In that window, select Load data and you will see two options, Spark and Pandas. The screen image shows what it looks like.

When Fabric was first released, the only option available for Data Wrangling was pandas, which is not Spark compliant. For more information regarding Pandas, check out my blog on Koalas where I cover why you don’t want to use pandas. I am very happy to see the Spark dataframes option as it is spark compliant, and can then take advantage of Spark’s superior memory and data management. However all of the generated code is pandas, and then converted to Spark. Data Wrangler will use Pandas no matter which option you select. We are now going to need to run this code which was generated for us so the data is loaded in a Spark DataFrame, by going to the Run menu and selecting Run all. This will start up the Spark cluster and run the code. You can look in the bottom left corner to see when it is complete.

Once you see the data you can start wrangling it. Go the Data tab and select Transform DataFrame in DataWrangler, then select the dataframe which was created for you, df.

Using Data Wrangler


In the Data Wrangler screen which is pictured here, you will see several interesting items. The first is at the top of the screen, where it lets you know that this code is pandas, but the code will be converted to pySpark when you leave Data Wrangler, but everything here will be pandas. This is important as if you end up with code you want to use, you can just use the generated code because it will be spark compliant. The top of the screen looks a lot like something you might see within Power Query if you have all the view items checked. Data Wrangler also shows missing and distinct as well as the top 3 values. On the left side of the screen there is a list of tasks which will generate code for you. I am going to select the first two of them, starting with Drop Duplicate Rows, which you can see in the screen shot image. This will immediately generate a line of code to drop the rows across all columns, but nothing will happen until the Apply button is clicked, which I will do so the code will be run. The applied steps are called cleaning steps and they appear on the bottom left side of the screen. Even if a step is applied it is not removed from the list, which seems odd. I am going to select the next option which is to delete missing values, and I am then prompted with a dropdown box to select which columns I want to drop if missing. I am going to pick all, then apply the change. At this point, since I completed the first two items, I don’t have anything missing as I deleted everything, so the next two steps will do nothing , as there is nothing missing. Clicking the option preview code for all steps allows me to see the generated code.
 If I want to run this code every time I get this dataset, I want to add it to a notebook, so I will select add code to notebook. Now in my notebook, I want to add pySpark code, which is different than the pandas code generated. Optionally I can add Pandas code, but I see no reason to do that. A popup window will appear showing me the code, at which point I can just click on the add button. The generated code is added to my notebook, saving me the tedium of having to write common code like this.

Data Wrangler is another automated tool which will write code for me similar to what I might expect if I was using Co-pilot. Data Wrangler contains decision making options and is generated based on my thoughts around which would be the best option in a given scenario. I can always return to Data Wrangler to add other steps or make different selections from the operations I chose. This tool can be really useful and I plan on using it to clean up my data in Notebooks.

 

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Fabric Features not found in Azure Synapse or Power BI – Direct Lake

Power BI Data Architecturee for Semantic Data Modeling

With the general availability release of Fabric in November 2023, I am dedicating several posts to the features that are only in Fabric and not anywhere else. The first feature is Direct Lake. Direct Lake was created to address problems with Power BI Direct Query. Anyone who has used Direct Query knows what I am talking about. If you have implemented Direct Query, I am guessing you have run into one or all of these problems, including managing the constant hits to the source database which increase with the more users you have, user complaints about slow visuals, or the need to put apply buttons on all of your visuals to help with speed. Direct Query is a great idea. Who wants to import a bunch of data into Power BI? Directly connecting to the database sounds like a better idea, until you learn that that the data goes from Power BI to the database then back for each user one at a time, which means that Power BI must send more queries the more people are accessing reports. Users want to be able to access data quickly, have it scale well, and have access to the latest data.

Direct Lake Architecture

Direct Lake was designed to address the performance issues encountered by people who used Direct Query.  With Direct Lake you will connect to the existing data source live, but unlike Direct Query, you will be connecting not to a database, but to the Data Lake in OneLake. The diagram below from Microsoft shows the architecture of Direct Query for Power BI, Import, and of course Direct Lake.

Power BI Data Architecturee for Semantic Data Modeling

Power BI Data Architecture for Semantic Data Modeling

Direct Lake Performance

Direct Lake uses the same underlying technology that Fabric does, an implementation of Delta Lake on top of parquet formatted file. Delta Lake, is an open-source framework which is part of Spark and it allows parquet files to act more like databases as it contains ACID transations and DML operations which allow you to use common SQL Statements to modify and access the data. Microsoft has taken the underlying technology that the impemented in data warehousing, Polaris, to improve the way querying will work to make accessing data in OneLake faster.

Implementing Direct Lake

You need to have a Premium SKU, of P or F to be able to implement Direct Lake. Notice Premium Per User and A SKUs, commonly used for embedding do not have the ability to use Direct Lake. You also need to access the Admin Portal to ensure that the XMLA Endpoint is set to Read Write and the Integration settings to Allow XMLA Endpoints and Analyze in Excel is set to on. You can then provision a Fabric premium workspace lakehouse as your data source in Power BI when you use it as the sematic model for a report you create in the Power BI Service. Now you make look at this and think, well this is all well and good, but I don’t update the data lake as fast as I do my database, how am I going to make this work?

Mirroring

Mirroring proivdes the abilty to create a read only copy of your database in OneLake, which you can then use as a source for your Power BI reports. This was announced November 15, and should be releasing soon.  The way this will work is you will make a connection to the data source within Fabric, and a read only copy of the database will be created in OneLake. There is no ETL required. Just make the connection and the data is available.  I look forward to writing more about this feature when it is released as it will make it possible to have real time data access in Power BI.

Current Limitations of Direct Lake

As this is a new feature, there are some elements of Direct Query that are not there yet. Before you decide to use it you might want to understand some of the current limitations. The first is there is no way to create a Direct Lake connection from within Power BI Desktop. You must create the reports that use it in the service. You cannot mix the lakehouse with existing models as it must be sourced from a single lakehouse or data warehouse. You cannot use views created in the lakehouse as tables in your Power BI semantic model. Direct Lake cannot be used in embedded. You also cannot use calculated columns or tables.  If you implement row level security (RLS), the model is switched to Direct Query.othI imagine many of these limitations will be addressed over time, and when they do, I will be writing an update to this post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur