DQS – DIY Guide to Getting Started with Data Quality Services

SQLServerDBListPower BI and SQL Server 2012 (and beyond) both have components Microsoft developed to shift the focus of Business Intelligence onto business users. Since there are more people who use Excel than have eaten at McDonald’s, you may know about Power BI and the data features in Excel, but have you heard of Data Quality Services [DQS]? If not, well now you have. To get started, if you don’t have Microsoft’s developer version of SQL Server 2012 or later, or access to an enterprise or BI server somewhere, you will need $59.95 to get a copy of the Developer Edition SQL Server 2014. Interestingly Microsoft won’t let you buy the Developer Edition SQL Server 2012, which I found out when tried earlier in the year. Unfortunately, although Visual Studio is now free, you still have to pay for the Developer Edition of SQL Server.

After you have installed SQL Server and selected that you wanted Data Quality Services, one would assume you had installed it. While that sounds like sterling logic, it is not correct. Here’s how to tell if you have DQS. Open up SSMS on your computer and look at the list of databases like I did here. Do you see any databases here which start with DQS? No. That is because it hasn’t been installed yet. It sure looks like it is if you look at the sql installer, which I have included below. I added the red boxes to highlight the fact that I really did select Data Quality Services when I installed.

If you don’t have the Data Quality Services and Data Quality Client installed in SQL Server like they are here, you will need to do that first, but this is only the first step. Once the install screen looks like the one pictured above, you need to go to the Data Quality Services folder in SQL Server and select the SQL Server Data Quality Server Installer. After this package is run, which takes a while, you will finally get a screen that lets you know the installation is finally completed.

DQSInstallSuccessfulScreenAfter DQS Server has installed, you will see that 3 databases have been added: DQS_Main, DQS_Projects and DQS_Staging_Data. Once these three databases are installed, you can then start using the DQS Client.

DQSDBList

The DQS client does not need to be installed on a server. Since I highly doubt most places will want their business users to be directly accessing their Server, most of the time it will not be installed on the server.

Once your environment is set up, it’s time to start using it. For more information on how to use DQS, please listen to my presentation on the PASS BI virtual chapter on November26. If you can’t make it, generally speaking it will be available on PASS BI’s You Tube Channel after about a week. I sincerely hope you can make it. Let me know what you think of my presentation by posting feeback to my blog.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Balanced Data Distributor and Other Features in the Pack

BBDRunFullSometimes the basics aren’t enough.  Car manufactures know this, that’s why they have a base model and one with all the options.  Let’s face it, most of the time you want all the options.  Well, with SQL Server you can get the options and you don’t have to spend hours debating the cost of clear coat. But I digress. If you want all the options, you can have them, right after you install SQL Server, and I am not talking about the cool codeplex stuff, which I will have to bring up another time. The genuine Microsoft list of add-ons are available for you at no additional cost. When Microsoft releases SQL Server, just because you installed it, doesn’t mean you have everything.  For versions 2008R2, 2012, 2012 SP1 as well as 2014 all have Feature Packs. Looking at the SQL Server 2014 Feature Pack List , there are a number of things which are about as useful as heated seats in the desert, like the DB2 or SAP drivers, which is probably why they are not automatically included as part of the release. When looking at the list, there is a very good chance many SSIS developers may be interested in some of them.  For example, wouldn’t you want the option for the more powerful and fuel efficient motor?  I know I would. There is an item in the feature pack which provides that feature, the Balanced Data Distributor.

Threading the Memory since 2011

Balanced Data Distributor (or BDD for short) was first released as a new SSIS transform for 2008 and 2008 R2. It was designed to take advanced of multi-threading hardware capabilities by spreading the data load so that the data can be broken into chunks and processed at the same time, rather than serially starting at the beginning and loading a stream until it ends.  This means you can go much faster on a single tank of gas, which is awesome. As you might imagine, processing more data at one time, decreases the time needed to process it.  This is really usefully if you are using blocking transforms like a row-by-row script component or a Sort or Aggregate transform, which require SSIS to look at every single row before processing.  Another situation where you might find it useful is if you have to write to a really slow output.

DIY of BDD –  Divide and Conquer

What the BDD does is divide the copies into multiple threads. How many is determined by the developer because while you will not need to configure it, you will need to add code to tell it how many times to divide the task. Let me show you what I mean through a series of SSDT screen shots.

BBDfull

This screen shows that I have a Data Flow task where I am reading in 121,317 rows then doing an Aggregate and a Sort, both of which are blocking transforms. This is just meant to be an example as with 121,317 records, you probably won’t see that much of a performance improvement, but you get the idea.

In this screen shot you can see that I have added the BDD task, which I’ve highlighted in the SSIS toolbox so you will see where it shows up when the component is installed. You will also see that I copied the code so that the same tasks from the aggregate on appear twice. What happens when this version is run?

BBDRun2

Check out the outputs underneath the BDD component. The number of records was split, but it isn’t an even split.  There are 62,337 records on one side and 58,980 records on the other side.  The record counts in the output is determined by the component as the optimal number based on the available threads.  You configure nothing, just drag it onto the screen. Pretty cool, isn’t it?

This example shows how easy it is to speed up SSIS processing without a huge amount of effort, allowing you to drive laps of code with speeds you may not of thought possible with the help of a free download. Have fun and let me know what kind of performance gains you see.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

When IsDate Does Not Work

IsDate is an often used feature used in TSQL to determine whether or not a field contains a date. I am sure there are a number of people who have used this successfully and may wonder what I could possibly be talking about. To maintain my technical bona fides I’ve included some code, which you can run inside of SQL Server Management Studio.

 Declare @ValidDateTest as table ( Somefield varchar(15), Works bit)
 Insert into @validDateTest values ('Not a date' , 0)
 Insert into @validDateTest values ('4-1-2-14' , 0)
 Insert into @validDateTest values ('5-2-7' , 0)
 Insert into @validDateTest values ('2014.2.3' , 1)
 Insert into @validDateTest values ('08/02/10' , 0)
 Insert into @validDateTest values ('7/3/2015' , 1)
 Insert into @validDateTest values ('2014-3-14' , 1)
 Insert into @validDateTest values ('12-3-1' , 0)
 Insert into @validDateTest values ('14-3-4' , 0)
 Insert into @validDateTest values ('20140301' , 1)
 Insert into @validDateTest values ('201123' , 1)
 Insert into @validDateTest values ('2011204' , 0)
 Insert into @validDateTest values ('7/023/2015' , 0)
 Insert into @validDateTest values ('6/02/014' , 0)
 Insert into @validDateTest values ('003/02/014' , 0)
 Insert into @validDateTest values ('3/010/2014' , 0)
 Insert into @validDateTest values ('4/02/012' , 0)
Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Convert(datetime, somefield, 110)
 end as ConvertedDate , Somefield
 from @validdatetest
 where works = 1

It is obvious by looking at my example, that some of the values inserted into the temp table @validDateTest are not dates. The code below the insert statements will Convert only the values which IsDate says are dates, not all of the values as I have a where condition in the query.  The values that will Convert without giving me an error have a works value of 1. If you run this query, you will see the difference in the values returned by IsDate, and the values I have provided that work without returning an error.

 Select Somefield, isdate(somefield)as IsDateValue, works
 from @ValidDateTest

Here are the results

Covertresult

Now if you run the query listed above and take off the where condition

 Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Convert(datetime, somefield, 110)
 end as ConvertedDate , Somefield
 from @validdatetest

You’ll get this error
Msg 241, Level 16, State 1, Line 21
Conversion failed when converting date and/or time from character string.

Casting Around for Bad IsDate solution

If you are fishing around for a solution, the simplicity of this will probably make you smack your desk. Now ideally, the source system should be fixed, as it really should be making sure that the dates are valid. I can’t make that happen. But I do have a solution to resolve this problem, just Cast it. Cast and Convert are on the same page in Microsoft online help and a lot of the time you might think they are synonymous, but they don’t work the same way.

 Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Cast (somefield as datetime)
 end as ConvertedDate , Somefield
 from @validdatetest

This works with no errors whatsoever. Simple. Easy. If you ever run into this, I hope this saves you some time.

Yours Always,
Ginger Grant
Data aficionado et SQL Raconteur

Introduction to Hadoop Presentation Follow-up

Thank you so much for everyone who was able to attend my webinar http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/676 . (If you weren’t able to attend, you can always click on the link for a recording)

It’s always hard to talk about Hadoop as the subject is so broad that there were a lot of things that I had to leave out, so it is fortunate that I have this blog to discuss the topics I wasn’t able to cover. I thought that I would take this time to respond to the questions I received.

Presentation Q & A

Do you need to Learn Java in order to develop with Hadoop?

No. If you wish to develop Hadoop in the cloud with HD Insight, you have the option of developing with .net. If you are working in the Linux environments, which is where a lot of Hadoop is being developed, you will need to learn Java.

Do you know of any courses or sessions available where you can learn about Big Data or Hadoop?

My friend Josh Luedeman is going to be teaching an online class on Big Data next year.  If you don’t want to wait that long I recommend checking out a code camp in your area, such as Desert Code Camp where they are offering courses in Azure,  or SQL Saturday, especially the BI editions

How do you recommend a person with a BI background in SQL get started in learning Hadoop and where can I get the VMs?

The two ways I recommend for a person with a BI background to get involved with Hadoop is either through a Hortonworks VM or in the Microsoft’s Azure cloud with HD Insight.  Hortonworks provides a VM and Microsoft’s environment is hosted on their cloud. As the company that Microsoft partnered with to develop their Hadoop offerings, Hortonworks has very good documentation targeted to people who have more of a Microsoft BI stack background.  If you chose to go with HD Insight, there is a lot of really good documentation and video training available as well.

How do you compare Hadoop with the PDW?

While both Hadoop and Microsoft’s PDW, which they now call APS, were both designed to handle big data, but the approaches are wildly different. Microsoft built the APS to handle the larger data requirements of people who have structured data, mostly housed in SQL Server.  Hadoop was developed in an open source environment to handle unstructured data.

How can I transfer data into HD Insight?

This is a great question, which I promise to devote an entire blog post to very soon. I’ll give you the Reader’s Digest version here.  There are a number of ways you can transfer data into HD Insight.  The first step is to transfer the data into the Azure cloud, which you can do via SSIS, with a minor modification of the process I blogged about earlier here.  The other methods you could use to transfer data are via secured FTP or by using Powershell.  You will need to call the REST API which you use to provision an HDInsight Cluster.  There is also a UI you can use within HDInsight to transfer data as well.

I really appreciate the interest in the Webinar.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

From the clouds, data – DIY Guide to using SSIS with Azure

My apologies to Jack Johnson for lifting his song title for this blog, but I couldn’t help it. I thought that it might be useful to discuss the How-Tos of data migration to the cloud, which reminded me of all the cloud songs I know. I always thought the cloud metaphor was sort of an odd name for a remote data center, but that is probably why I am not in the business of naming things. But I digress. Back to the topic at hand. Moving data from SSIS to databases hosted on Microsoft’s Azure cloud requires some different steps than just moving data around on premise. Since things on the cloud change quite quickly, I thought I would demonstrate currently what this looks like using SSDT in Visual Studio 2012.

Breaking through the Wall

Before getting started with SSDT, there are some configuration steps one needs to complete on Azure first. The Windows Azure SQL database firewall must first be set up to allow connections from your IP, or your data will never get to the cloud. There is an option on the database screen to “Connect to your database”. If the firewall is not opened on the IP and port in use, you will get this message.

AzureDBScreenConfigFirewall-message

Selecting the Yes is needed to be able to connect to the Azure database via SSIS.

Also one might want to validate that the right drivers are loaded on SSDT as well. If you are running via Visual Studio 2012 or 2013, no worries as the drivers are already there, but for earlier versions new drivers may be required. If one is planning on loading data to an Azure SQL Server database, the ODBC or ADO.Net are the connections needed for Azure. The old data connection standby, Ole-DB is going to be left in the toolbox like last year’s dress, as it won’t work for the cloud. Much like fashion, everything old is new again so ODBC is once again the “It” connection. You can use ADO.Net too, but I won’t be here.

The next step in the process is getting the connection information needed to connect to the Azure database. Microsoft made this step quite easy. Look on the Azure Database screen where I’ve pasted a pink arrow. Click there.

 AzureDBScreenFull-EnabledCropped

 

This makes it so easy as a screen pops up with all the connection information you need. You just need to copy the ODBC section, and remember what your password is as you will need to enter it.

AzureODBCConnectionInfo

While we are still in Azure, I thought it would be a good idea to display where the SSIS package we will be creating will be putting the data. Here’s the table structure where the data will be placed.

AzureDBScreenTable

 

Here’s the query screen showing that right now the table is empty.

AzureDBScreenTable-NoData

SSIS Package Transferring Data to Azure

After you have all of the information you need from Azure, it is a relatively simple thing to create an SSIS package, with an OLEDB connection for my on premise database and an ODBC data connection to Azure using the information copied from the Azure database connection screen to transfer data to my Azure Database.

AzureDBSSISRunning

Going back to Azure, you can see 19,972 rows were added.

AzureDBScreenTable-Data

One word of caution, as you see here in the progress log, adding data can be a very slow process.

AzureDBSSISProgress

I highlighted the Elapsed time in red so that it would be easy to see that a simple file transfer took over two minutes.

Location, Location

One thing which is important to consider is where you are going to be moving your data. I demonstrated what I think may be the more common scenario, where the data is not on the cloud, and you want to put it to the cloud. Microsoft refers this as Hybrid Data Movement. Of course this may not be the case. If you are running SQL Server on a Virtual Machine in the cloud it may make a lot more sense to run SSIS on that virtual machine. If that is the case, for optimal performance, locate the SSIS in a VM in the same data center as the database because otherwise, due to the bandwidth and network latency, it will be slower. When transmitting data around the cloud, whether it be from on premises to the cloud or from one server to another on the cloud, you might want to consider compressing the data prior to sending if at all possible to decrease the size of the data being transmitted. It may be faster to extract the data you want on premises and transmit a compressed file to be applied on the cloud server. This can get to be more complicated as it requires setting up an Secure FTP server to transmit the files, which then have to be applied. Natively SSIS doesn’t have a compression tool, but there are third party products, such as Task Factory, which will allow you to not only compress the output but send it to your VM via Secured FTP from within the SSIS package.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

SSIS Tuning – What Size is my row, your DIY Buffer Sizing Guide

When looking to improve the performance of an SSIS package, one of the common recommendations listed in many places is to modify your buffers so that you can load as much data into a single buffer size as possible. According to Microsoft, for optimal performance, SSIS should be configured for maximum memory utilization by having buffers with as many rows as possible without exceeding the internal 100 MB limit. Ok now that you know what Microsoft says about the topic, how does one put it in practice? I thought it might be nice to post the How-To as it is a little complicated.

Buffer Tuning Parameters

There are a couple of default settings that are used for loading data into memory and they all need to line up correctly to keep data from being sent from memory to disk. In order to set the size appropriately, it is best to apply the Price Is Right Axiom.

DefaultMaxBufferSize – the default is 10 MB. The maximum size is 100MB, which SSIS stores as MaxBufferSize. This value can never be any bigger than 100MB, so size it appropriately for it’s environment.

DefaultMaxBufferRows – The default is 10,000 rows. This setting needs to be set the same way that you win with the Price is Right . To win, the default max buffer rows * (times) the row size needs to be as close to DefaultMaxBufferSize as possible without going over.

Size Matters

To figure out what to put in the DefaultMaxBufferRows, one needs to know how big the rows are. Remember to win the value must be as close to the size of DefaultMaxBufferSize without going over. If the buffer is sized too large, the package will be slower as rows will be cached to disk, so make sure you don’t miss by even a little. How big is the row size? There are a number of ways of figuring this out. You can look at each field in the query SSIS is loading, and based on the data type, add up the value of all the fields. Fortunately that isn’t the only way to figure out the row size. If your data source is SQL Server, you are in luck, as the system tables can help to determine what the size is. Here is a sample the query, assuming your table name is Address, which you can run on the AdventureWorks database.

Exec sp_SpaceUsed 'Person.Address'

The Results are

name         rows    reserved       data          index_size    unused
Address    19614   5960 KB      2784 KB     2688 KB      488 KB

To figure out what the size of your buffer should be for this entire table is to take the number of (data *1024)/ Rows as 100MB is the max size you can set. To calculate the row size, use the formula values  2784 / 19614 * 1024 = 145.346, or 146 bytes per row.  If you set DefaultMaxBufferRows to 100MB, which is the maximum and what I recommend in most cases, it is 104857600 bytes is the Buffer Size.  Buffer Size/ Row size = DefaultMaxBufferRows. 104857600 / 146 = 718202.73 so set the DefaultMaxBufferRows to 728203  If  you are using the columns, you can get the same information by looking at the  syscolumns.  By using the column length, it is relatively easy to figure out what the appropriate size of your buffer should be, by adding up the column lengths. One word of caution. I do not wish to imply that because the information is available on a per table basis one should pick Table or View in the SSIS source. Au contraire. Always access the data by using a Select statement as it performs better.

Validate

To ensure that you have improved the buffer size performance, check it. After you are done with the settings, Enable logging on the data flow task, and select the BufferSizeTuning event to see how many rows are contained in each buffer.

Please feel free to drop me a line if you find this helpful.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Power Map – No longer only BI

Microsoft did something interesting with the licensing of Power Map, they changed their mind. Power Map was original released as part of Power BI a preview tool. Shortly after Power BI went live on February 10, Microsoft made this announcement. “On May 30, 2014 if you have the Power Map Preview installed, it will no longer work in any non-Office 365 subscription version of Excel.” You won’t see that text on their site, but I had it on a slide for a previous presentation I did on Power BI.  Time to update the slide as things have changed since SQL Saturday Detroit, and I have updated my presentation for Power BI SQL Server Saturday Denver.

Do you have a License for That?

The Preview for PowerMap was originally available for Excel 2013 and Excel 2013 for Office 365. Microsoft’s message that came out after PowerBI was officially released is translatable to unless you have Office 265 and pay for a Power BI license, no Power Map for you. Well, that folded like a bad poker hand. If you look now at the Power Map website it says this (I’ll quote it here to make sure I have the text if it changes again)

“If you have any subscription for Microsoft Office 365, you have access to Power Map for Excel as part of the self-service business intelligence tools.” Yea!  But what if you have Excel 2013 on prem? You can use the preview forever. What does that mean? No new features but Power Map won’t stop working. Here’s what the website says exactly “Although feature and performance enhancements for Power Map will continue to be released for the Office 365 subscription plans, there are currently no future plans to update the downloadable Power Map Preview.” Now you don’t have to have purchased the separate license for Power BI to get Power Maps. You can be all Powerful with Office 365 or Office 2013. (Note to Microsoft:Does everything have to have Power in Excel? I am starting to channel the Wizard of OZ)

New Power Map Features

On September 9th, 2014, Microsoft just released some new features for Power Map which they do about once a month. One of the cool new features is the ability to add your own maps for things like the inside of buildings. If you want that to work, you might need to check your configuration. I’ve included a copy of my Com Add-In Screen. If you can’t remember how to pull this up, check my previous post https://www.desertislesql.com/wordpress1/?p=178

PowerMapAddIn

You will notice there are two checkboxes here for Power Map on the COM Add-In Window because I was using the preview and Office 365 automatically gave me the non-Preview version of Power Map when I got an automatic Office 365 update. To get the cool new stuff, I must have the Microsoft Power Map for Excel checked, not the Preview one. If you didn’t pay for Power BI with Office 365, you won’t be able to use the new features Microsoft adds in every month.

Now if you have Excel 2013 or Office 365 and didn’t download the preview yet, you still can right here. http://www.microsoft.com/en-us/download/details.aspx?id=38395. Microsoft now calls the Power Map Preview an Unsupported Add-In, but all of the original features still work.

Mapping in Excel

If you have Excel 2013 or Office 365 and you don’t have Power Map Preview installed, you can still use mapping tools for Excel reports. How is this possible? By inserting maps into Power View.  The maps inside of Power View are very interactive, and unlike Power Maps, you can encode these reports as HTML 5 and display them on your phone or tablet.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Powerful Stuff at SQL Saturday 331 – Denver

SQLSaturday 

This will be the fourth SQL Saturday this year that I have honored enough to have been selected as a presenter. I know it is an honor, because I’ve also been turned down, which makes me work harder and keeps me humble.  If you decided today that hanging out at the University of Denver next Saturday is something you want to do, you are out of luck. There is now a waitlist to attend as SQL Saturday Denver is filled to capacity.  It is really awesome to think that they have so many people interested in learning about SQL Server that they are going to have to turn some away. It says something about the great job that Steve Wake, Windy Martin and friends are doing in publicizing this event too.

Power BI

I am going to presenting on Power BI. As a lot of people haven’t yet seen it in action, I will be demonstrating the features of it and explaining the rather complicated licensing issues around Power BI.  Power BI is mostly Excel, and I’ll be explaining what you can do in Excel 2013 without having to buy anything and if you do decide to buy PowerBI what you get.  Because there are some things which won’t really fit in my presentation, I’ll be posting a few things here on Power BI as well.

SQL Community

I don’t get out as much as I would like, and I certainly don’t get to visit the places where a lot of people who read my blog are, like Brazil. If you are attending SQL Saturday Denver please come by and introduce yourself.  I will give you 10,000 reasons to come to my presentation, none of which I am going to say here as it will ruin the surprise.  I am looking forward to hearing presentations from the other speakers too as there are some great topics being covered. I hope to see you there.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Where to go to be in the Know

At the last SQL Server User’s group meeting, I got into a long conversation with someone about where to go to find good training materials online for people at various skill levels. Contrary to popular belief, I do spend time on the internet not on twitter or shoe shopping, and have come up with a list of sites where I think the training material is quite good, and also free. I included a lot of video content, as a lot people find that easy to use for learning new things.

This is not meant to be an exclusive list, just the places I’ve found helpful most recently or have found myself visiting a lot. If you have some suggestions of places you think I should add, please send them to me.

Big Data

MSBIAcademy – This is a really good way to get up to speed on Big Data and other topics.

Apache – If you are interested in Hadoop, you will make your way here to get the latest releases and see what wacky name the latest new tools has.

Hortonworks – I found the training information on Hortonwork’s site to be very good at explaining things.

Learning Map for HDInsight and Azure – SQL Server data is moving to the cloud too and this is a great place to get up to speed.

Mostly SQL Server

A lot of these sites have crossover information too.

SQL University – This site has information for those people just getting started with the Microsoft stack to more advanced topics.

Pragmatic Works Training on the Ts – Every week on Tuesdays and Thursdays, Pragmatic Works provides free webinars on SQL Server and Big data topics with some of the people who wrote the book, ok lots of SQL books about all kinds of SQL and Big Data Stuff. Flip through the archives if you are working during the day and don’t have time to view them live.

Microsoft Virtual Academy – This is Microsoft’s site where they offer free training and you get points. I didn’t know that I wanted points before, but I do now.

Microsoft’s SSIS Tech Net Videos – The audio on these is often very lousy, but the content is pretty good. I am not sure how often these are updated, but you can find good best practice material here.

Ola Hallengren’s Site – At any time you have anyone calling you a DBA, you should know about this site.

CBT Nuggets on YouTube – If you can find anything on YouTube from CBT Nuggets, it probably won’t be a waste of time. This link is for information on SSIS. Be wary of some of the things posted on YouTube, as not everything there is correct or best practices and the quality can be marginal.

Channel Nine – Microsoft has some random-ish videos out here, some of which are really helpful

SQLServerCentral – This is a great resource. Go create an account here as it is free and there is a treasure trove of information.

SQLPass – Last but certainly not least, check out all of the information archived on SQL Pass. They have a lot of virtual groups on a wide variety of SQL related topics. If you can’t attend when they are being held, the videos are available for later viewing on the website. The previous PASS Summit information is awesome. They also have a YouTube channel as well, where you can find interesting things to watch.

SQL Saturday

All of the other stuff I mentioned is archived and available when you have a chance, but there is nothing like being able to ask resident experts about various stuff and network with other SQL Server people. Check out SQL Saturday to see when and where there is going to be an event near you. These events have gone worldwide, so it is very likely there will be an event near you sometime this year.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Inquiring Minds Want to Know about Improving Data Quality

Lost in the sauce of all things Power released as part of Excel 2013, Microsoft snuck in an app that is pretty obscure, so I thought I’d shine some light on it. What’s the app? Inquire. Ok, be honest have you heard of this yet? Microsoft bought Inquire which was previously known as SpreadsheetIQ before Microsoft bought Prodiance and their product on June 6, 2011.

In order to see and use the Inquire, you will need to add it from the list of Com components shipped but not turned on in Excel. This is the same process that you need to follow to add PowerPivot. Check out my previous post blog and follow the same steps you needed to add PowerPivot to add Inquire, which will add yet another tab to Excel. Why would you want to add Inquire? Well, it contains some neat tools to help you analyze what is going on in Excel worksheets. If you are unaware of these tools or are not sure how to use them, you might want to consider taking an excel course online to gain a deeper understanding of the tools and functions available in Microsoft Excel. In addition to helping you improve your speed while doing major data analytics projects, it may help you increase your accuracy as well.

How accurate is your data?

Since more and more data is being stored in spreadsheets people have been starting to study how accurate the data is. In a recent study at the University of Hawaii they have found that spreadsheets are wrong most of the time. It is not surprising that there are tools, like Inquire, being created to help accuracy improve. One of the ways Inquire helps improve data quality is to expose the internals of Excel so they are easier to review, which will make it easier to see what all is included within in Excel. These tools make it easier to analyze the contents all in one place. However, if you are extracting data from a website in a JSON file, you may not be able to use the various data analysis tools as Excel may not support such file formats. In such cases, you may need to look for an online JSON to XML converter, which may help you use extracted website data in Excel.

Workbook Analysis

When you have selected this feature, Inquire will go through your excel spreadsheet and review the contents. The summary contains a lot of useful information. In addition to including things like how many cells create formulas, as well as see how many cells contain errors. The summary also shows Very Hidden Sheets, which is something I didn’t even know you could create via VBA. Data Connections, Validation Criteria, Dependents, Duplicate formulas etc are some of the many items that Inquire includes in the summary. If you accidently put a minus sign in front of a cell and didn’t see notice, Inquire will list the item under Negative Formulas. All of the numeric items you have stored as text are listed too. If you run the detail report, you will see the workbook sheet name and every cell address and formula where these numeric text fields are located, which will be extremely handy when trying to correct this problem. All the information is stored in a report, which naturally is generated in an excel spreadsheet for later review.

Understanding the connections

Inquire’s Workbook Relationship report shows a picture of how workbooks are related together.

Relationship Diagram

The cell relationship does a similar diagram on the individual cell level. If you have two spreadsheets and you want to know what the difference is between the two spreadsheets on a field by field basis, Inquire will show the differences when you select the Compare files button. If you just want to do some cleanup, there is a button for that too. Removing excess formatting can help improve the performance of the spreadsheet by eliminating duplicate and extraneous formatting. Lastly, inquire contains the ability for you to manage your passwords contained throughout the spreadsheet in one place.

With more and more data being stored in excel, either to create PowerBI analysis, Tabular Prototyping or just because excel is your data store of necessity for now, its nice to see Microsoft adding tools to help ensure that you can beat the odds and produce an excel spreadsheet where you can have greater confidence in the data stored within it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur