Presentation Follow up to Data Quality Issues? There is a Service for that

PASSBIThank you to all of you who were able to attend my latest presentation. I received a number of great questions during the presentation and wanted to make sure that I was able to provide answers. I also wanted to say thank you for the people who provided feedback in the question and answer section as that was very thoughtful and most appreciated.

Questions and Answers

I assume DQS is included with the existing SQL Server license? Is DQS included with enterprise edition? Can you provide some idea of what the licensing costs are? What are the licensing cost for Master Data Services?

Because I don’t work for Microsoft, I hesitate to answer any but the most basic questions regarding licensing, but I can say DQS and MDS are included in the licensing costs for SQL Server. Neither are available in the standard edition either. For more information about what is included, see here http://technet.microsoft.com/en-us/library/cc645993#Other_Components or contact Microsoft directly.

Do you know of any testing of scalability that would provide some insight into the scalability of DQS?

The best place I can point you to for more information on DQS scalability is here http://www.microsoft.com/en-us/download/details.aspx?id=29075

It seems that DQS appears similar to Fuzzy Lookups and Fuzzy Grouping. Is that pretty much what DQS is using?

While the logic employed in the SSIS components Fuzzy Lookup and Fuzzy Grouping are similar, but not functionally equivalent. DQS uses logic much closer to that of Fuzzy Grouping, as both apply a rules to a set of data and set similarity thresholds to determine success. Where they differ is in DQS’ use and development of the knowledge base to continually improve the accuracy.

How do you see DQS and MDS working together? What gets done where? How does DQS integrate with MDS? Do you use MDS to fix the data?

One of Microsoft’s definition of Master Data Management is “we define Master Data Management (MDM) as the technology, tools, and processes required to create and maintain consistent and accurate lists of master data.” MDS contains a framework for identifying and implementing workflows, security, business rules and auditing to the business data. DQS is part of that process that can be used by the MDS product to accomplish these tasks. To best illustrate how they work together, I’d recommend downloading their DQS and MDS tutorial. At a high level MDS enforces rules for cleaning the data and DQS can be implemented as a part of that process. For example before adding data to MDS, you can use DQS to clean the data prior to evaluation, but it’s use isn’t required.

In your SSIS demo, what additional steps would you recommend to have the data steward process the invalid data so that it can be reprocessed under SSIS?

Ideally, when data fails a quality process, the data should be placed in a review table so that it can be reviewed by the data steward. Having a person resolve the issues and add them to the knowledgebase will continually improve the knowledgebase over time.

When building a knowledgebase within DQS, can you include multiple columns for evaluation within one table?

DQS is not limited to evaluating one column at a time. You can use combinations of columns within a table to determine what the correct values are.

 

If you missed my session, a recording is available here , so please feel free to review the video when you have a chance.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

DQS – DIY Guide to Getting Started with Data Quality Services

SQLServerDBListPower BI and SQL Server 2012 (and beyond) both have components Microsoft developed to shift the focus of Business Intelligence onto business users. Since there are more people who use Excel than have eaten at McDonald’s, you may know about Power BI and the data features in Excel, but have you heard of Data Quality Services [DQS]? If not, well now you have. To get started, if you don’t have Microsoft’s developer version of SQL Server 2012 or later, or access to an enterprise or BI server somewhere, you will need $59.95 to get a copy of the Developer Edition SQL Server 2014. Interestingly Microsoft won’t let you buy the Developer Edition SQL Server 2012, which I found out when tried earlier in the year. Unfortunately, although Visual Studio is now free, you still have to pay for the Developer Edition of SQL Server.

After you have installed SQL Server and selected that you wanted Data Quality Services, one would assume you had installed it. While that sounds like sterling logic, it is not correct. Here’s how to tell if you have DQS. Open up SSMS on your computer and look at the list of databases like I did here. Do you see any databases here which start with DQS? No. That is because it hasn’t been installed yet. It sure looks like it is if you look at the sql installer, which I have included below. I added the red boxes to highlight the fact that I really did select Data Quality Services when I installed.

If you don’t have the Data Quality Services and Data Quality Client installed in SQL Server like they are here, you will need to do that first, but this is only the first step. Once the install screen looks like the one pictured above, you need to go to the Data Quality Services folder in SQL Server and select the SQL Server Data Quality Server Installer. After this package is run, which takes a while, you will finally get a screen that lets you know the installation is finally completed.

DQSInstallSuccessfulScreenAfter DQS Server has installed, you will see that 3 databases have been added: DQS_Main, DQS_Projects and DQS_Staging_Data. Once these three databases are installed, you can then start using the DQS Client.

DQSDBList

The DQS client does not need to be installed on a server. Since I highly doubt most places will want their business users to be directly accessing their Server, most of the time it will not be installed on the server.

Once your environment is set up, it’s time to start using it. For more information on how to use DQS, please listen to my presentation on the PASS BI virtual chapter on November26. If you can’t make it, generally speaking it will be available on PASS BI’s You Tube Channel after about a week. I sincerely hope you can make it. Let me know what you think of my presentation by posting feeback to my blog.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Inquiring Minds Want to Know about Improving Data Quality

Lost in the sauce of all things Power released as part of Excel 2013, Microsoft snuck in an app that is pretty obscure, so I thought I’d shine some light on it. What’s the app? Inquire. Ok, be honest have you heard of this yet? Microsoft bought Inquire which was previously known as SpreadsheetIQ before Microsoft bought Prodiance and their product on June 6, 2011.

In order to see and use the Inquire, you will need to add it from the list of Com components shipped but not turned on in Excel. This is the same process that you need to follow to add PowerPivot. Check out my previous post blog and follow the same steps you needed to add PowerPivot to add Inquire, which will add yet another tab to Excel. Why would you want to add Inquire? Well, it contains some neat tools to help you analyze what is going on in Excel worksheets. If you are unaware of these tools or are not sure how to use them, you might want to consider taking an excel course online to gain a deeper understanding of the tools and functions available in Microsoft Excel. In addition to helping you improve your speed while doing major data analytics projects, it may help you increase your accuracy as well.

How accurate is your data?

Since more and more data is being stored in spreadsheets people have been starting to study how accurate the data is. In a recent study at the University of Hawaii they have found that spreadsheets are wrong most of the time. It is not surprising that there are tools, like Inquire, being created to help accuracy improve. One of the ways Inquire helps improve data quality is to expose the internals of Excel so they are easier to review, which will make it easier to see what all is included within in Excel. These tools make it easier to analyze the contents all in one place. However, if you are extracting data from a website in a JSON file, you may not be able to use the various data analysis tools as Excel may not support such file formats. In such cases, you may need to look for an online JSON to XML converter, which may help you use extracted website data in Excel.

Workbook Analysis

When you have selected this feature, Inquire will go through your excel spreadsheet and review the contents. The summary contains a lot of useful information. In addition to including things like how many cells create formulas, as well as see how many cells contain errors. The summary also shows Very Hidden Sheets, which is something I didn’t even know you could create via VBA. Data Connections, Validation Criteria, Dependents, Duplicate formulas etc are some of the many items that Inquire includes in the summary. If you accidently put a minus sign in front of a cell and didn’t see notice, Inquire will list the item under Negative Formulas. All of the numeric items you have stored as text are listed too. If you run the detail report, you will see the workbook sheet name and every cell address and formula where these numeric text fields are located, which will be extremely handy when trying to correct this problem. All the information is stored in a report, which naturally is generated in an excel spreadsheet for later review.

Understanding the connections

Inquire’s Workbook Relationship report shows a picture of how workbooks are related together.

Relationship Diagram

The cell relationship does a similar diagram on the individual cell level. If you have two spreadsheets and you want to know what the difference is between the two spreadsheets on a field by field basis, Inquire will show the differences when you select the Compare files button. If you just want to do some cleanup, there is a button for that too. Removing excess formatting can help improve the performance of the spreadsheet by eliminating duplicate and extraneous formatting. Lastly, inquire contains the ability for you to manage your passwords contained throughout the spreadsheet in one place.

With more and more data being stored in excel, either to create PowerBI analysis, Tabular Prototyping or just because excel is your data store of necessity for now, its nice to see Microsoft adding tools to help ensure that you can beat the odds and produce an excel spreadsheet where you can have greater confidence in the data stored within it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

H.I.P.P.O or How decisions are made without data

Hippo with glasses  cartoonOne of the reasons that I enjoy working with data, and especially the new analytical models is because when you have lots of data, you can analyze it to make decisions which may go against many misconceived preconceptions. One of the data projects I worked on in the past, gathered all of the call center data from the phone switches and matched it against the number of payments received from the people who were called. When the results were analyzed, the business chose to make different decisions than they had in the past. Previously the call centers were evaluated by the number of calls made, not what happened when someone was called. As a result in the change of the evaluation methodology, some call centers were closed, some managers were promoted, and other managers were fired. Absent data, decisions are made which can be called into question.

Decisions made without Data are really H.I.P.P.O.

In absence of having any data, decisions are often made using the H.I.P.P.O method, which stands for HIghest Paid Person’s Opinion. When data is gathered and displayed in a transparent manner, the managers knew they were underperforming and knew the consequences and were highly motivated to improve. HR felt confident that the reasons for letting people go were not going to be challenged, so they felt free to act as well.

Data Removes Ambiguity in the Decision Process

There are many examples where providing accurate clear data removes the questions people have regarding decisions. One other example which comes to mind was the question regarding the selection of speakers at the upcoming PASS Summit. As a disclaimer, I did not submit, so I was not surprised when I was not selected. PASS released the speaker feedback providing the data people needed to understand the criteria for acceptance. Gathering data and developing systems to accurately display it isn’t just a task undertaken because people like the technology, it is the method where transparency and decision making can be undertaken. So next time you are asked to create a report think about what you are really doing. It’s not an exercise in SSRS, you are providing tools which allow decisions to be made based on facts, not HIPPOs.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Quality, there is an App for that

On Wednesday, January 8, I will be giving a presentation of an interesting and potentially handy new feature of SQL Server 2012, Data Quality Services [DQS].  I discovered this interesting tool when I was studying for my 70-463 exam, which I did pretty well on and will be discussing in a post coming soon to a screen near you.  This tool is included as part of the standard release of SQL Server 2012.  It’s an application that given it’s usefulness will have a wide implementation in the future.

I will demonstrate how to use this tool in conjunction with third party reference data from the Azure Marketplace to validate some sample address information.  In order to assimilate ever increasing volumes of data, new applications and tools have been created to process it, but the underlying problem still remains.  The data needs to be complete and accurate.  While sometimes it is possible to improve your methods of gathering the data by say tweaking the input phone number field so that you only get valid number in a format you expect, such modifications are not always possible.  There are times where the data providers may benefit by sending more data, not more accurate data, as they are paid to deliver quantity and it is up to the recipient of the data to determine the quality. 

Microsoft has created two tools which they designed to improve the quality of collected data.  The first tool, which was released as part of SQL Server 2008 is Master Data Management.  I have yet to see or hear of anyone who is using this application, and outside of testing I have not used it either.  Master Data Management was designed to assist organizations by providing a framework which transfers ownership of data quality to subject matter experts who know the correct values.

In SQL Server 2012, Data Quality Services was added as a way to fix the data.  As part of the ongoing push to put Business Intelligence into the hands of users and away from IT, DQS has a very user friendly interface designed to develop a set of rules which are used to clean data.  Once the rules have been validated, you can take this knowledge and put it into an SSIS job to automate cleansing of larger quantities of data.  

As this is the first version of DQS, there are some things about it which are a little awkward.  An example of this is the two-step process required to install DQS.  When you install SQL Server 2012, and select DQS as an installation component it appears to be installed, but it isn’t.  After the SQL Server 2012 application is installed, then you have to run DQSInstaller.exe.  One of the other limitations of DQS is it is a bit on the slow side.  Microsoft doesn’t recommend it for millions of records, just a few thousand.  Even with these limitations, DQS can still provide a great deal of useful functionality which can be an important part of ensuring your production data is complete and correct.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur