Data Quality, there is an App for that

On Wednesday, January 8, I will be giving a presentation of an interesting and potentially handy new feature of SQL Server 2012, Data Quality Services [DQS].  I discovered this interesting tool when I was studying for my 70-463 exam, which I did pretty well on and will be discussing in a post coming soon to a screen near you.  This tool is included as part of the standard release of SQL Server 2012.  It’s an application that given it’s usefulness will have a wide implementation in the future.

I will demonstrate how to use this tool in conjunction with third party reference data from the Azure Marketplace to validate some sample address information.  In order to assimilate ever increasing volumes of data, new applications and tools have been created to process it, but the underlying problem still remains.  The data needs to be complete and accurate.  While sometimes it is possible to improve your methods of gathering the data by say tweaking the input phone number field so that you only get valid number in a format you expect, such modifications are not always possible.  There are times where the data providers may benefit by sending more data, not more accurate data, as they are paid to deliver quantity and it is up to the recipient of the data to determine the quality. 

Microsoft has created two tools which they designed to improve the quality of collected data.  The first tool, which was released as part of SQL Server 2008 is Master Data Management.  I have yet to see or hear of anyone who is using this application, and outside of testing I have not used it either.  Master Data Management was designed to assist organizations by providing a framework which transfers ownership of data quality to subject matter experts who know the correct values.

In SQL Server 2012, Data Quality Services was added as a way to fix the data.  As part of the ongoing push to put Business Intelligence into the hands of users and away from IT, DQS has a very user friendly interface designed to develop a set of rules which are used to clean data.  Once the rules have been validated, you can take this knowledge and put it into an SSIS job to automate cleansing of larger quantities of data.  

As this is the first version of DQS, there are some things about it which are a little awkward.  An example of this is the two-step process required to install DQS.  When you install SQL Server 2012, and select DQS as an installation component it appears to be installed, but it isn’t.  After the SQL Server 2012 application is installed, then you have to run DQSInstaller.exe.  One of the other limitations of DQS is it is a bit on the slow side.  Microsoft doesn’t recommend it for millions of records, just a few thousand.  Even with these limitations, DQS can still provide a great deal of useful functionality which can be an important part of ensuring your production data is complete and correct.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Leave a Reply