Seven Demo’s for SQL Server 2012 Data Quality Services:

Here are the seven Demo’s that I had posted over the past few weeks, listing them here:

Detecting invalid values in Price domain or unrealistic values in Height domain

How to standardize data using DQS

How to clean records using third-party reference data-sets from Azure Data Market

How to use regular expressions to validate records?

cleaning records like “my company Inc.” and “my company incorporated” using Term Based Relations

How to use cross-domain rules using composite domains

Matching activity in Data Quality Services in action!

Matching activity in Data Quality Services in action!

In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .

Situation 1:

we’ve a list of movies and we want to identify “matching” movie titles.

Solution 1:

Create a Matching Policy

1) Mapping the Domains:

2 Data Quality Services matching policy

2) Configuring the Matching Policy:

3 Data Quality Services matching policy

Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.

3) View Matching results:

4 Data Quality Services matching policy de duplication

4) Publish the KB (which stores the matching policy)

Once you have the matching policy, you can use this in a Data Quality Project:

5) See How I ran a Data Quality Project (w/ matching activity) in the image below.

5 Data Quality Project matching activity policy de duplication

Note: You can export the clean data-set via Data Quality Project.

Situation 2:

we’ve a list of Supplier Names and we want to identify “matching” supplier names.

Note that in this situation, you would see how to use more than one domain to create a matching rule.

Solution 2:

Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results

Matching Policy:

6 supplier ID and name two domains in matching rule data quality services

Matching results:

7 supplier ID and name two domains in matching rule data quality services

Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM

DQS MDS Data quality services and master data services

Conclusion:

In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .

Data Profiling and SQL Server 2012 Data Quality Services

Data Profiling in Data Quality Services happens at following stages:

1) While performing Knowledge Discovery activity

1A: In the Discover step:

1 knowledge discovery profiling data quality services sql server

1b. Also in the manage domain values step:

1b knowledge discovery profiling data quality services sql server

While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:

  • Newness
  • Uniqueness
  • Validity
  • Completeness

2) While Performing  Cleansing activity:

2A: on the cleansing step:

2 cleansing profiling data quality services sql server

2b: Also on the mange and view results step:

2b cleansing profiling data quality services sql server

Here the profiler gives you following statistics:

  • Corrected values
  • Suggested Values
  • Completeness
  • Accuracy

Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.

3) While performing Matching Policy activity (Knowledge Base Management)

3a. Matching policy step:

3a matching policy data quality services microsoft sql

3b. Matching Results step:

3b matching policy data quality services microsoft sql

Here the profiler gives following statistics:

  • newness
  • uniqueness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

4) While performing Matching activity (Data Quality Project)

4a. Matching step:

4a matching activity data quality services microsoft sql

4b. Export step:

4b matching activity data quality services microsoft sql export step

Here Profiler gives following statistics:

  • Newness
  • uniqueness
  • completeness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

Conclusion:

In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.

 

Data Quality Service’s Composite Domains in action!

In this post, I’ll show you how composite domains can help you create cross domain rules in Data Quality Services.

Scenario:

You have a data set of employee name, employee category and yearly salary. you want to validate the value in the yearly salary column based on the employee category. Here are the business rules:

Note: for the purpose of the demo, every number is a dollar.

Now, the rule in the Table can be read as:

If employee category is A then yearly salary should be greater than 100000 and less than 200000.

1 composite domains data quality services

Note: I have kept it simple for demo purposes.

Now here is our Data-Set before we set out to validate it:

Employee Name Employee Category Yearly Salary
Jon V Yang A 127000
Eugene L Huang B 90000
Ruben  Torres C 83000
Christy  Zhu D 70000
Elizabeth  Johnson A 90000
Julio  Ruiz C 65000
Janet G Alvarez D 43000
Marco  Mehta B 81000

*Names are taken from Adventure works database. The values in the names and salary column are purely fictional.

Solution:

It’s just an overview, It’s not covered in step by step fashion:

1. Create a KB > created three domains: Employee Category, Employee Name and Yearly Salarly

2. created a composite domain:

2 created a composite domain data quality services

3. Under Composite Domain (CD) Rules Tab:

I started out with defining the rules for category A:

3 create composite domains rules data quality services

And I completed w/ specifying business rules for all four categories

4 create composite domains SQL server 2012

4.  Published KB

5. Created a New DQS project > Selected the KB created above

6.  Selected the data source > Mapped domains

7. I also selected from the list of selected composite domains:

5 view select composite domains data quality project

8. After seeing the cleaning statistics, I switched to the invalid tab to see the records that didn’t match the record:

6 composite domain invalid tab new tab corrected tab correct tab

9. So by now, we have identified records that do not match the rules. A data steward can now correct them if he/she wants to or leave them as it is. Notice the Approve/reject check boxes.

Note that: Not only can you update the yearly salary but you can also update the employee category. So if you think that the employee has been wrongly categorized, you can change that.

10. After this, you can export the data-set which has records that match the business rules and the data-set would be ready to be consumed!

Conclusion:

In this post, we saw how to create cross domain rules using composite domains w/ an example of Employee Category and Yearly Salary.

 

How to clean address records using third-party reference data-sets in Data Quality Services?

In this post, we’ll see how to clean address records using third party reference data-sets in Data Quality Services.

You need to perform following steps to do so

Note that they are not step by step procedures, a high level overview of how DQS works to clean address records using 3rd party reference data-sets:

1) Configure Data Quality Services to use reference data sets. Tutorial here: http://msdn.microsoft.com/en-us/library/hh213070.aspx

2) Go to datamarket.azure.com > and I subscribed to “Address check – verify, correct, Geocode US and canadian Addresses Data” ; we’ll see how to use this in next steps.

3) Note that as of now, we can only have refernece data-sets from Azure Data Market. However, the MSDN thread: http://social.msdn.microsoft.com/Forums/hu-HU/sqldataqualityservices/thread/750faef8-dd69-4a71-b0c1-18ca2f93d59d suggests that we’ll have an ability to provide our (private/self-provided) reference data-sets in next service updates. So for now we’ll have to connect to Azure data market for reference data-sets and for the purpose of the Demo, I connected to Melissa Data’s Address Check.

4) Now via DQS client, let’s create a Knowledge Base!

5) I’ve created following domains:

  • Address Line
  • City
  • State
  • Zip

And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.

1 SQL server 2012 Data Quality Services Domain Management

6) And for the composite domain Full Address – this is how you configure reference data-sets:

2 1 SQL server 2012 Data Quality Services reference data services

7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:

You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:

SQL server 2012 Data Quality Services data quality project

That’s about it for this post. I hope it helps.

Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services

In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data.

So, what is the difference between Term Based Relations and Domain values – and when to use which? Here is the answer:

Domain Values Term Based Relations
It allows us to correct the entire value in a domain It allows us to correct a word/term within a value in a domain
E.g.

USA -> United States

US -> United States

United States of America -> United States

E.g.

John Senior -> John Sr.

John Junior -> John Jr.

George Sr -> George Sr.

Mister Smith -> Mr. Smith

Note that the entire value in the domain got changed. Note that only PART of the domain value got changed.

Further reading: Data Quality Services concepts: http://msdn.microsoft.com/en-us/library/hh213015.aspx

That’s about it for this post. I hope it helps.

 

Guest Blog: How we use Fuzzy Lookup add-in in our company to solve data inconsistency problems:

This is a Guest Blog from Mantresh Jain.

About Mantresh Jain:

Mantresh Jain is a C Level Executive at SMB in manufacturing domain based out of India. He has bachelor’s degree from a business school. And he holds a special interest in how businesses can leverage newest Information Technology Tools for optimizing business processes. He is working on a company-wide ERP implementation and is a single point of contact for the implementation process. He spends his free time on computer games of all kinds! Link with him here: http://www.linkedin.com/pub/mantresh-jain/43/562/749

 How did they discover Fuzzy Look-up add-in for Excel (A write-up by Paras)?

Some months ago, Mantresh approached me to see if I knew any tool that would help him deal with “messy” data. On Further questions, I learned that

-          Messy data = lots of duplicates

-          Uses SQL Server Express & do NOT have plans to upgrade to SQL Server versions that include Data Quality Services and/or Master Data Services. Remember the context here: They are a small and medium size business.

-          Do use Excel – a lot!

-          Do not have folks w/ “SQL” knowledge

With this requirements, I asked him to see if an add-in for excel called “Fuzzy Lookup” meets their need. After trying it out: here’s Mantresh’s experience of using Fuzzy Lookup add-in for Excel in their organization:

Summary:

In my company we are implementing ERP software. I faced a problem of Data migration from two fox Pro based software’s to SQL (for ERP)

More Details:

Two fox Pro Software’s worked independently form each other. And as a result each of them had their Separate Databases.

Lets Call them FX1 and FX2.

Now I wanted to import Account Master Data from them to SQL, Here are the fields in our Account Master data:

Name, Address, Bank Details, Phone Number among other fields

Problem

Both systems had issues of data Duplication and Data Inconsistency

To give you an example, I faced following problems:

1) FX1 had around 3500 entries and FX2 had 2400 entries

Now in FX1 out of 3500 around 2000 were same as FX2

Also FX2 had around 2000 entries same as FX1

Now i wanted to import only unique Account Master gathered by “combining” the two systems to SQL.

Example:

FX1 has “VMS Industires” while FX2 has “V.M.S Industries”

Solution

Fuzzy Look up add-in for Excel.

Step 1) Import data from both databases to excel

Step 2) Using Fuzzy Look up to find data matching to each other based on variable conditions that we select.

Step 3) It reorganizes data as

FX1 entry 1st matching FX2 Entry
2nd Matching FX2 entry

This is how we  find Duplicate entries and then clean our data-set

Benefit

If not for Fuzzy Look I would have had to manually match each entry to each other which would have taken estimated 60 to 100 Man Hours but with Fuzzy Look-up, we did the job in 24 Man Hours Only.

———————-

Conclusion by Paras:

Thanks Mantresh for sharing your experience!

And here’s a related post:
How to clean similar textual data in Excel via Fuzzy lookup add-in?

Data Quality Servies: What does a locked Knowledge Base indicate?

In this blog-post, we would see what does it mean to lock a knowledge base in Data Quality Services? So the lock on the Knowledge Base indicates that there are unsaved changes in the Knowledge base when you or someone else was working on it.

In the Data Quality Client, Here’s how a lock on the Knowledge Base looks:data quality services knowledge base lock

And here are few points for a locked knowledge base:

1) If you did not lock the Knowledge Base then you can open it in read-only only

2) if you locked the Knowledge base, you can open and edit it. The Knowledge base would be opened in the state that it was closed in.

3) A user working on the Knowledge base can unlock it by publishing it or by unlocking the knowledge base.

4) By positioning the cursor on the knowledge base – you can see who locked it:

who user lock the knowledge base data quality

Conclusion:

In this blog post, we saw what does a lock on a knowledge base in Data Quality Services mean?

SQL Server Data Quality Services: How to test Domain Rules in the DQS client.

In the domain management activity of Data Quality Services, you can test the domain rules on sample data while creating the Knowledge base . Here are more details:

1. Create a KB > Select “Domain Management as activity” > create a Domain > Go to the Domain Rules Section of a Domain.

2. For the purpose of this blog-post, I am going to create a domain rule named “Email Validation” which requires the Values to End With @ParasDoshi.com

3. Here’s the Domain Rule. Now click on “Run the selected Domain Rule on Test data” icon:

domain management activity of Data Quality Services, you can test the domain rules on sample data while creating the Knowledge base

4. Now add some test data and then click on “Test the domain rule on all items”.

data quality services sql server test the domain rule

Thus, This way you can build and test the Domain Rule while creating the Knowledge Base.

Conclusion:

In this blog-post, we saw how to test the Domain Rule in the Data Quality Services client of SQL Server 2012