How to solve common Data Quality Problems using Data Quality Services? (Part 1)

New Journal Article: First article of 2, where you will be able to see how you can use SQL Server 2012’s DQS to solve common data quality problems. http://bit.ly/172Kh5L

Topics covered:
– Data standardization
– Identifying and correcting unrealistic or invalid values
– Validation and correcting records using Regular Expressions

Read Here: How to solve common Data Quality Problems using Data Quality Services (Part 1)

Seven Demo’s for SQL Server 2012 Data Quality Services:

Here are the seven Demo’s that I had posted over the past few weeks, listing them here:

Detecting invalid values in Price domain or unrealistic values in Height domain

How to standardize data using DQS

How to clean records using third-party reference data-sets from Azure Data Market

How to use regular expressions to validate records?

cleaning records like “my company Inc.” and “my company incorporated” using Term Based Relations

How to use cross-domain rules using composite domains

Matching activity in Data Quality Services in action!

Matching activity in Data Quality Services in action!

In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .

Situation 1:

we’ve a list of movies and we want to identify “matching” movie titles.

Solution 1:

Create a Matching Policy

1) Mapping the Domains:

2 Data Quality Services matching policy

2) Configuring the Matching Policy:

3 Data Quality Services matching policy

Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.

3) View Matching results:

4 Data Quality Services matching policy de duplication

4) Publish the KB (which stores the matching policy)

Once you have the matching policy, you can use this in a Data Quality Project:

5) See How I ran a Data Quality Project (w/ matching activity) in the image below.

5 Data Quality Project matching activity policy de duplication

Note: You can export the clean data-set via Data Quality Project.

Situation 2:

we’ve a list of Supplier Names and we want to identify “matching” supplier names.

Note that in this situation, you would see how to use more than one domain to create a matching rule.

Solution 2:

Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results

Matching Policy:

6 supplier ID and name two domains in matching rule data quality services

Matching results:

7 supplier ID and name two domains in matching rule data quality services

Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM

DQS MDS Data quality services and master data services

Conclusion:

In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .

Data Quality Service’s Composite Domains in action!

In this post, I’ll show you how composite domains can help you create cross domain rules in Data Quality Services.

Scenario:

You have a data set of employee name, employee category and yearly salary. you want to validate the value in the yearly salary column based on the employee category. Here are the business rules:

Note: for the purpose of the demo, every number is a dollar.

Now, the rule in the Table can be read as:

If employee category is A then yearly salary should be greater than 100000 and less than 200000.

1 composite domains data quality services

Note: I have kept it simple for demo purposes.

Now here is our Data-Set before we set out to validate it:

Employee Name Employee Category Yearly Salary
Jon V Yang A 127000
Eugene L Huang B 90000
Ruben  Torres C 83000
Christy  Zhu D 70000
Elizabeth  Johnson A 90000
Julio  Ruiz C 65000
Janet G Alvarez D 43000
Marco  Mehta B 81000

*Names are taken from Adventure works database. The values in the names and salary column are purely fictional.

Solution:

It’s just an overview, It’s not covered in step by step fashion:

1. Create a KB > created three domains: Employee Category, Employee Name and Yearly Salarly

2. created a composite domain:

2 created a composite domain data quality services

3. Under Composite Domain (CD) Rules Tab:

I started out with defining the rules for category A:

3 create composite domains rules data quality services

And I completed w/ specifying business rules for all four categories

4 create composite domains SQL server 2012

4.  Published KB

5. Created a New DQS project > Selected the KB created above

6.  Selected the data source > Mapped domains

7. I also selected from the list of selected composite domains:

5 view select composite domains data quality project

8. After seeing the cleaning statistics, I switched to the invalid tab to see the records that didn’t match the record:

6 composite domain invalid tab new tab corrected tab correct tab

9. So by now, we have identified records that do not match the rules. A data steward can now correct them if he/she wants to or leave them as it is. Notice the Approve/reject check boxes.

Note that: Not only can you update the yearly salary but you can also update the employee category. So if you think that the employee has been wrongly categorized, you can change that.

10. After this, you can export the data-set which has records that match the business rules and the data-set would be ready to be consumed!

Conclusion:

In this post, we saw how to create cross domain rules using composite domains w/ an example of Employee Category and Yearly Salary.

 

How to clean address records using third-party reference data-sets in Data Quality Services?

In this post, we’ll see how to clean address records using third party reference data-sets in Data Quality Services.

You need to perform following steps to do so

Note that they are not step by step procedures, a high level overview of how DQS works to clean address records using 3rd party reference data-sets:

1) Configure Data Quality Services to use reference data sets. Tutorial here: http://msdn.microsoft.com/en-us/library/hh213070.aspx

2) Go to datamarket.azure.com > and I subscribed to “Address check – verify, correct, Geocode US and canadian Addresses Data” ; we’ll see how to use this in next steps.

3) Note that as of now, we can only have refernece data-sets from Azure Data Market. However, the MSDN thread: http://social.msdn.microsoft.com/Forums/hu-HU/sqldataqualityservices/thread/750faef8-dd69-4a71-b0c1-18ca2f93d59d suggests that we’ll have an ability to provide our (private/self-provided) reference data-sets in next service updates. So for now we’ll have to connect to Azure data market for reference data-sets and for the purpose of the Demo, I connected to Melissa Data’s Address Check.

4) Now via DQS client, let’s create a Knowledge Base!

5) I’ve created following domains:

  • Address Line
  • City
  • State
  • Zip

And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.

1 SQL server 2012 Data Quality Services Domain Management

6) And for the composite domain Full Address – this is how you configure reference data-sets:

2 1 SQL server 2012 Data Quality Services reference data services

7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:

You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:

SQL server 2012 Data Quality Services data quality project

That’s about it for this post. I hope it helps.

Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services

In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data.

So, what is the difference between Term Based Relations and Domain values – and when to use which? Here is the answer:

Domain Values Term Based Relations
It allows us to correct the entire value in a domain It allows us to correct a word/term within a value in a domain
E.g.

USA -> United States

US -> United States

United States of America -> United States

E.g.

John Senior -> John Sr.

John Junior -> John Jr.

George Sr -> George Sr.

Mister Smith -> Mr. Smith

Note that the entire value in the domain got changed. Note that only PART of the domain value got changed.

Further reading: Data Quality Services concepts: http://msdn.microsoft.com/en-us/library/hh213015.aspx

That’s about it for this post. I hope it helps.

 

How to detect unrealistic or invalid values using Data Quality Services?

When you see a price of an item as -10 (negative 10) – you know it’s not right! It’s BAD data!

When you see some person’s height as 120 inches (10ft!) – you know that’s just not true! Again, BAD data!

It’s not uncommon to see such values in REAL data sets. In this post, we’ll see how SQL Server 2012’s Data Quality Services would help us clean unrealistic a.k.a invalid values in our datasets and bring us a step closer to a CLEAN data!

Situation:

You’re an analyst for professional basketball team Dallas Mavericks and You need a Player Roster of your Basketball players having CLEAN data for analysis purposes. Someone recently passed you a player roster, great! you have data!

Problem:

Some of the Data about Height of players just doesn’t look right! A player with named Dominique Jones had height of just 5 inches.

dallas mavericks player names vs height for data cleaning project

That’s just not right. So for now, you may clean this data-set manually by looking up right values but wouldn’t it be great if you had a Data Cleaning solution that flags unrealistic values for your review every time you get such data-sets? So the analyst decided to build a solution in Data Quality Services – Here are the steps:

Steps to create a Knowledge Base in DQS and clean the invalid values:

Now, First let’s create the Knowledge Base:

You decide to create a rule that checks for player heights that’s less than 5ft and greater than 7ft 6inches for your review. So here’s what the Knowledge Base would look like:

domain management knowledge base domain rules data sqlNow after the Knowledge base (KB) has been published, it’s time to use this KB for a Data Quality Project.

Note: This KB can be re-used for creating KB’s for other teams too.

Here’s the result of running the results on our Data – Set:

3 data cleaning of invalid height values sql server data qualityNote that it says that height has 3 invalid records. Let’s correct them!

In the next tab, you can see the invalid records:

iteratively clean invalid data sql server unrealistic data quality

You can correct this invalid values one by one and then approve them:

specify the valid values approve data quality sql

After that in the next steps, you can see the results – I have highlighted records that we cleaned!

corrected data set clean sql server data quality

And here’s our clean data-set:

dallas mavericks player names vs height cleaned dataset dec 12 2012

Conclusion:

Yay! An analyst of a professional basketball team was able to quickly clean a data-set. And he can re-use the knowledge base he had built to clean other data-sets too in the future! Isn’t that nice?

In this post, we saw how to clean unrealistic or invalid records in a data set by using domain rules in SQL Server 2012 Data Quality Services.

How to standardize data using Data Quality Services?

I was playing with Data Quality Services when I though that it would be nice if I could implement what I had learned and create a quick demo that shows how to standardize data using the technology. So here it goes:

Note: This is not a step by step Guide – I am assuming that you are aware of different pieces of DQS. I am going to post a brief outline of how did I go about implementing a quick demo.

Technology used:

1) SQL Server 2012 Data Quality Services

2) SQL Server Integration services to show the DQS cleansing  component in action.

Problem:

We want to standardize to Gender to have only two values M & F – but now data coming from two files creates data inconsistency issues:

Two files that look like this:

File 1: (Excel)

1 sql server data quality services data source 1 excel file

File 2: (CSV)

2 sql server data quality services data source 2 csv file

Solution:

Let’s see how SQL server 2012’s data quality services would help us solve this problem:

1) I created a Knowledge base with a domain Gender and created domain values that looks like this:

3 data quality services domain management values sql server

2) Tested the Knowledge base quickly by creating a Data Quality Project. It worked!

3) Switched to SSIS and created a Data Flow which uses following components:

4 data quality services sql server integration services

What would be interesting to you to see mapping tab in the DQS cleansing transformation component, so here it is:

5 data quality SSIS data cleaning transformation

5) And let me show you the Target File – note that you can choose to NOT have Gender_Source, Gender_Status & Record Status. The column of interest is Gender_Output

6 standardized output SSIS gender data quality services

That’s about it for this post. I hope this helps!

what are the end – to – end processes for developers to build out a Master Data Services 2012 solution?

I recently completed an end to end tutorial that taught me basics of Master Data Services 2012. And I thought I would take what I’ve learned and create a diagram for reference that helps developers see end – to – end processes that are involved in building out a SQL Server Master Data services 2012 solution. So here it is:

SQL Server 2012 Master data services steps

Resource: Learn to build an Enterprise Information Management system using Data Quality Services, Master Data Services and SQL Server Integration Services

[UPDATE 29 Dec 2012: One more whitepaper that's on the same subject and should be read after the tutorial: Cleanse and Match Master Data by Using EIM ]

Here’s the resource: Tutorial: Enterprise Information Management using SSIS, MDS, and DQS Together

Here’s what you’ll learn:

- Learn to use Data Quality client to create DQS Knowledge Base, cleanse data in an excel data, remove duplicated data from the excel file

DQS client domain values clean data data quality services

- Learn Use MDS add-in for Excel, store the cleansed and matched data in MDS

Master data services add in for excel

- Learn how to use the MDS Web UI:

SQL Server 2012 master data services MDS Web UI

- Learn to Automate the process of receiving input data, cleaning and matching it and storing the master data into MDS via SSIS!

SSIS automate DQS MDS data quality component

That’s about it! check out the Tutorial to learn about all the DQS, MDS & SSIS goodness!