How to clean data using Regular Expressions in Data Quality Services?

In this blog – post, I’ll share a quick demo of how you can use Regular Expressions in Data Quality services to clean records. For the purpose of the demo, I’ll show how to clean perform a preliminary cleaning procedure on email records.

Before we begin, just a primer on Regular Expressions:

“a regular expression provides a concise and flexible means to match (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.” – source: WikiPedia Regular Expressions

Let’s take an example:

using Regular Expressions you can:

  • check if a text starts with A or a
  • check if a text contains the word “data”
  • check if a text follows the following pattern:  something@something.something.
  • among other things..

Now, Note the power of Regular Expressions to check for a pattern like something@something.com – now can we not use it to perform a basic cleaning on email records? I think, we can! and so, let’s try that:

for the purpose of the blog post, let me quickly show how you can create a Domain in a Knowledge Base that uses Regular Expressions to clean email records:

1) Open Data Quality Client > New Knowledge Base > Domain Management > create a domain

2) Switch to Domain Rules Tab

3) create a new Domain domain rule which uses the Regular Expression: [A-Za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z] to perform a basic cleaning on email records:

1 sql server data quality cleaning email regular expressions

4) You can use this Knowledge base to clean email records in Data Quality Records But for now,  let’s test our Regular Expression >  click Test Domain Rule > Enter few records > Test them!

2 sql server data quality cleaning email regular expressions test

Note: This is just a basic demo to show Regular Expressions in action. For cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for quick demo but if you want to read more about Regular Expressions used to clean emails then I will strongly recommend you to read email regular expressions that I got to know from via the discussion on a Beyond Relational Tip that I had posted.

Conclusion:

In this blog post, we saw how to do basic cleaning on email records using regular expressions in Data Quality Services

Related articles

5 thoughts on “How to clean data using Regular Expressions in Data Quality Services?

  1. Pingback: SQL Server 2012 Data Quality Services Term based Relation’s in action! | Paras Doshi – Blog

  2. Pingback: Data Quality Service’s Composite Domains in action! | Paras Doshi – Blog

  3. Pingback: Data Profiling and SQL Server 2012 Data Quality Services | Paras Doshi – Blog

  4. Pingback: Matching activity in Data Quality Services in action! | Paras Doshi – Blog

  5. Pingback: Seven Demo’s for SQL Server 2012 Data Quality Services: | Paras Doshi – Blog

Thank this author by sharing the article on social media. If you have any questions or comments, please leave a reply below:

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s