How to clean data using Regular Expressions in Data Quality Services?

In this blog – post, I’ll share a quick demo of how you can use Regular Expressions in Data Quality services to clean records. For the purpose of the demo, I’ll show how to clean perform a preliminary cleaning procedure on email records.

Before we begin, just a primer on Regular Expressions:

“a regular expression provides a concise and flexible means to match (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.” – source: WikiPedia Regular Expressions

Let’s take an example:

using Regular Expressions you can:

  • check if a text starts with A or a
  • check if a text contains the word “data”
  • check if a text follows the following pattern:  something@something.something.
  • among other things..

Now, Note the power of Regular Expressions to check for a pattern like something@something.com – now can we not use it to perform a basic cleaning on email records? I think, we can! and so, let’s try that:

for the purpose of the blog post, let me quickly show how you can create a Domain in a Knowledge Base that uses Regular Expressions to clean email records:

1) Open Data Quality Client > New Knowledge Base > Domain Management > create a domain

2) Switch to Domain Rules Tab

3) create a new Domain domain rule which uses the Regular Expression: [A-Za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z] to perform a basic cleaning on email records:

1 sql server data quality cleaning email regular expressions

4) You can use this Knowledge base to clean email records in Data Quality Records But for now,  let’s test our Regular Expression >  click Test Domain Rule > Enter few records > Test them!

2 sql server data quality cleaning email regular expressions test

Note: This is just a basic demo to show Regular Expressions in action. For cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for quick demo but if you want to read more about Regular Expressions used to clean emails then I will strongly recommend you to read email regular expressions that I got to know from via the discussion on a Beyond Relational Tip that I had posted.

Conclusion:

In this blog post, we saw how to do basic cleaning on email records using regular expressions in Data Quality Services

Related articles

Data Quality Servies: What does a locked Knowledge Base indicate?

In this blog-post, we would see what does it mean to lock a knowledge base in Data Quality Services? So the lock on the Knowledge Base indicates that there are unsaved changes in the Knowledge base when you or someone else was working on it.

In the Data Quality Client, Here’s how a lock on the Knowledge Base looks:data quality services knowledge base lock

And here are few points for a locked knowledge base:

1) If you did not lock the Knowledge Base then you can open it in read-only only

2) if you locked the Knowledge base, you can open and edit it. The Knowledge base would be opened in the state that it was closed in.

3) A user working on the Knowledge base can unlock it by publishing it or by unlocking the knowledge base.

4) By positioning the cursor on the knowledge base – you can see who locked it:

who user lock the knowledge base data quality

Conclusion:

In this blog post, we saw what does a lock on a knowledge base in Data Quality Services mean?

How to Delete a Knowledge Base in SQL Server 2012 Data Quality Services?

A quick post on step – by – step to delete a Knowledge Base (KB) in SQL Server 2012 Data Quality Services (DQS):

1. Open the Data Quality Client

2. Click on “Open Knowledge Base”

SQL Server 2012 Data Quality Services delete a Knowledge Base

3. Select the KB > Right Mouse Click > Delete.

Note that it also allows you to Rename, Open and see properties of a Knowledge Base via this Knowledge Base Management UI.

SQL Server 2012 Data Quality Services knowledge base management

That’s about it for this post. I took me five minutes to figure how to delete a KB and so I thought I would document it if someone is not able to find this option as quick as they might have thought. Happy Data Cleaning!

How I think of what is KnowledgeBase in SQL Server 2012 Data Quality Services

This Blog Post is meant to share how I think of what is KnowledgeBase in SQL Server 2012 Data Quality Services (DQS) – In my mind, Knowledge Base (KB) captures:

1. WHAT needs cleaning

2. HOW to clean what needs cleaning.

Let’s dive a little deeper, In DQS – a Knowledge Base let’s you do three things: Knowledge Discovery, Domain Management & Matching Policy.

1. Knowledge Discovery:

This activity helps you find “WHAT” needs cleaning. DQS has inbuilt algorithms that helps in analyzing errors, inconsistencies and data quality issues in the sample data-set.

2. Domain management:

This activity helps in defining the rules that will be applied to create “HOW” to clean the data.

3. Matching Policy:

This activity helps in identifying “WHAT” needs to be De-Duplicated (De-Dup) and then it goes about helping create the “HOW” to De-DUP the data.

Conclusion:

In this short blog post, I shared how I think of what is Knowledge Base in SQL Server 2012 Data Quality Services. And here’s the official resource if you want to continue learning: DQS Knowledge Bases and Domains