In this blog – post, I’ll share a quick demo of how you can use Regular Expressions in Data Quality services to clean records. For the purpose of the demo, I’ll show how to clean perform a preliminary cleaning procedure on email records.
Before we begin, just a primer on Regular Expressions:
“a regular expression provides a concise and flexible means to match (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.” – source: WikiPedia Regular Expressions
Let’s take an example:
using Regular Expressions you can:
- check if a text starts with A or a
- check if a text contains the word “data”
- check if a text follows the following pattern: email@example.com.
- among other things..
Now, Note the power of Regular Expressions to check for a pattern like firstname.lastname@example.org – now can we not use it to perform a basic cleaning on email records? I think, we can! and so, let’s try that:
for the purpose of the blog post, let me quickly show how you can create a Domain in a Knowledge Base that uses Regular Expressions to clean email records:
1) Open Data Quality Client > New Knowledge Base > Domain Management > create a domain
2) Switch to Domain Rules Tab
3) create a new Domain domain rule which uses the Regular Expression: [A-Za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z] to perform a basic cleaning on email records:
4) You can use this Knowledge base to clean email records in Data Quality Records But for now, let’s test our Regular Expression > click Test Domain Rule > Enter few records > Test them!
Note: This is just a basic demo to show Regular Expressions in action. For cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for quick demo but if you want to read more about Regular Expressions used to clean emails then I will strongly recommend you to read email regular expressions that I got to know from Erik Eckhardt via the discussion on a Beyond Relational Tip that I had posted.
In this blog post, we saw how to do basic cleaning on email records using regular expressions in Data Quality Services