How to solve common Data Quality Problems using Data Quality Services? (Part 1)

New Journal Article: First article of 2, where you will be able to see how you can use SQL Server 2012′s DQS to solve common data quality problems. http://bit.ly/172Kh5L

Topics covered:
- Data standardization
- Identifying and correcting unrealistic or invalid values
- Validation and correcting records using Regular Expressions

Read Here: How to solve common Data Quality Problems using Data Quality Services (Part 1)

How to view error(s) that occur during Master data Service’s Staging Process?

To view error(s) that occur during MDS’s staging process, we have two views: stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails. For the purpose of this blog post, let me show you how I (as an administrator) view information from stg.viw_name_MemberErrorDetails.

1) Name of the entity: supplier & Name of model is Suppliers

2) I imported data into staging table stg.supplier_leaf via SSIS

3) Now here’s how you can see the errors that occur during MDS’s staging process:

we can go to MDS web application to initiate the staging process > start the batch > after completion you can see the status as well as if it has any errors or not:

Master Data Services Errors4) Now if you see that there’s an error, then you can go to stg.viw_name_MemberErrorDetails to see what are the errors. In my case, I am going to run the query:

select * from stg.viw_supplier_MemberErrorDetails where Batch_ID=2

You can get the above query via MDS web application too:

Master Data Services SQL Server Staging Process

5) And as you can imagine, you can get access to this error data via SSIS (SQL Server Integration services) too. So if you have a workflow that a. Loads data to MDS and b. initiates the Batch process via Stored Procedure; then you can program it to get access to errors from the stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails tables.

SQL Server Integration Services and Master Data Services

That’s about it for this post! I hope this helps. Your comments are very welcome!

Microsoft® HDInsight Preview for Windows: How to use Sqoop to load data into HDFS from SQL Server?

In this post, we’ll see how to use Sqoop to load data into HDFS from SQL Server?

With that, here are the steps:

1. You have the Microsoft® HDInsight Preview for Windows Installed on your machine. Here’s a tutorial: Installing HDInsight (Microsoft’s Hadoop) on windows 7

2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine

Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!

3. Before we begin, decide on three things:

3a: Username and Password that Sqoop would use to login to the SQL Server database. If you create a new username and pasword, test it via SSMS before you proceed.

3b. select the table that you want to load into HDFS

In my case, it’s this table:

sql table to be loaded into hadoop hdfs from sql server3c: The target directory in HDFS. in my case I want it to be /user/data/sqoopstudent1

You can create by command: hadoop fs -mkdir /user/data/sqoopstudent1

[to learn about how to create directory, read: How to create a directory in Hadoop File System? ]

4. Now Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)

5. Navigate to: c:\Hadoop\sqoop-1.4.2\bin>

*This path may change in future, but navigate to the bin folder under the SQOOP_HOME.

6. Run dir command to see various files under this directory.

sqoop list files under the HOMe directory import export

Also you can run sqoop help for more information on the command that we are about to run.

sqoop list of commands help

7. Now here’s the command to Load data from SQL Server to HDFS:

c:\Hadoop\sqoop-1.4.2\bin>sqoop import –connect “jdbc:sqlserver://localhost;dat
abase=UniversityDB;username=sqoop;password=**********” –table student –tar
get-dir /user/data/sqoopstudent1 -m 1

sqoop command to load data from sql server to hadoop file system

8. After successfully running the above command, let’s browse the file in HDFS!

sqoop see the content of the file

That’s about it for this post!

Thanks

Thanks Aviad Ezra who answered my question on this MSDN thread: An error while trying to use Sqoop on HDInsight to import data from SQL server to HDFS

Conclusion:

In this post, we saw how to load data into Hadoop from SQL Server using Sqoop (SQL Hadoop)

Related Articles:

SQL Server 2012 Data Quality Services Term based Relation’s in action!

In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data. In this post, we’ll see a use case of Term based relations. But before we do that, you can consider reading: Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services

Now, Here’s Term Based Relation’s in action

Scenario:

After data profiling you realize that there are records such as “my Company Inc.” and “my Company Incorporated” – so you set out to automatically find these mismatches in terms inside a value and correct them.

BEFORE cleaned data, the report showed that “my company Inc” revenue is less than that of Google:

Before data cleaning company names and revenues sql server

AFTER cleaned data, the report correctly shows that “my company Inc” revenue is more than that of Google:

After data cleaning company names and revenues sql server

Steps taken to clean data:

(Just an overview, not covered in a step by step fashion)

1. Created the Knowledge Base w/ Two domains Company Names & Revenue

2. Term Based Relations Tab of Company Names domain:

2 term based relations data quality sql server

3. Published the KB > Let’s create a DQS project

4. Mapped the domains:

3. DQS Project Mapping Domain Names

5. DQS cleaned following records:

4 term based relations inc incorporated corrected records

6. Exported the data and created a report out of clean data-set!

Business user is happy :)

conclusion:

In this post, we saw how to correct a word/term within a value in a domain. The example we used was Inc. , Inc and Incorporated . It can be used to correct terms like Jr. and Junior. Sr. and Senior. etc. Things like this are difficult to catch during data entry – But using Term Based Relations, a person who knows the Data can clean it so that it generates correct reports.

After all reports like this are of little to no use, are they? So Let’s NOT create confusing reports anymore.

Please note:

  • The revenue figures shown are just for demo purposes. I pulled up these numbers from Wikipedia. Please don’t make any financial decision based on these reports and if you do, I am not responsible for that.
  • The name “my Company Inc” is a fictional firm. It’s not any firm that I am/was associated with in past, future of present. It’s a fictional name!

And Writing disclaimers like this are NO fun – sucks the joy out of “Thinking out Loud” :)

How to clean data using Regular Expressions in Data Quality Services?

In this blog – post, I’ll share a quick demo of how you can use Regular Expressions in Data Quality services to clean records. For the purpose of the demo, I’ll show how to clean perform a preliminary cleaning procedure on email records.

Before we begin, just a primer on Regular Expressions:

“a regular expression provides a concise and flexible means to match (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.” – source: WikiPedia Regular Expressions

Let’s take an example:

using Regular Expressions you can:

  • check if a text starts with A or a
  • check if a text contains the word “data”
  • check if a text follows the following pattern:  something@something.something.
  • among other things..

Now, Note the power of Regular Expressions to check for a pattern like something@something.com – now can we not use it to perform a basic cleaning on email records? I think, we can! and so, let’s try that:

for the purpose of the blog post, let me quickly show how you can create a Domain in a Knowledge Base that uses Regular Expressions to clean email records:

1) Open Data Quality Client > New Knowledge Base > Domain Management > create a domain

2) Switch to Domain Rules Tab

3) create a new Domain domain rule which uses the Regular Expression: [A-Za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z] to perform a basic cleaning on email records:

1 sql server data quality cleaning email regular expressions

4) You can use this Knowledge base to clean email records in Data Quality Records But for now,  let’s test our Regular Expression >  click Test Domain Rule > Enter few records > Test them!

2 sql server data quality cleaning email regular expressions test

Note: This is just a basic demo to show Regular Expressions in action. For cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for quick demo but if you want to read more about Regular Expressions used to clean emails then I will strongly recommend you to read email regular expressions that I got to know from via the discussion on a Beyond Relational Tip that I had posted.

Conclusion:

In this blog post, we saw how to do basic cleaning on email records using regular expressions in Data Quality Services

Related articles

Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services

In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data.

So, what is the difference between Term Based Relations and Domain values – and when to use which? Here is the answer:

Domain Values Term Based Relations
It allows us to correct the entire value in a domain It allows us to correct a word/term within a value in a domain
E.g.

USA -> United States

US -> United States

United States of America -> United States

E.g.

John Senior -> John Sr.

John Junior -> John Jr.

George Sr -> George Sr.

Mister Smith -> Mr. Smith

Note that the entire value in the domain got changed. Note that only PART of the domain value got changed.

Further reading: Data Quality Services concepts: http://msdn.microsoft.com/en-us/library/hh213015.aspx

That’s about it for this post. I hope it helps.

 

How to standardize data using Data Quality Services?

I was playing with Data Quality Services when I though that it would be nice if I could implement what I had learned and create a quick demo that shows how to standardize data using the technology. So here it goes:

Note: This is not a step by step Guide – I am assuming that you are aware of different pieces of DQS. I am going to post a brief outline of how did I go about implementing a quick demo.

Technology used:

1) SQL Server 2012 Data Quality Services

2) SQL Server Integration services to show the DQS cleansing  component in action.

Problem:

We want to standardize to Gender to have only two values M & F – but now data coming from two files creates data inconsistency issues:

Two files that look like this:

File 1: (Excel)

1 sql server data quality services data source 1 excel file

File 2: (CSV)

2 sql server data quality services data source 2 csv file

Solution:

Let’s see how SQL server 2012′s data quality services would help us solve this problem:

1) I created a Knowledge base with a domain Gender and created domain values that looks like this:

3 data quality services domain management values sql server

2) Tested the Knowledge base quickly by creating a Data Quality Project. It worked!

3) Switched to SSIS and created a Data Flow which uses following components:

4 data quality services sql server integration services

What would be interesting to you to see mapping tab in the DQS cleansing transformation component, so here it is:

5 data quality SSIS data cleaning transformation

5) And let me show you the Target File – note that you can choose to NOT have Gender_Source, Gender_Status & Record Status. The column of interest is Gender_Output

6 standardized output SSIS gender data quality services

That’s about it for this post. I hope this helps!

Join #SQLPASS virtual chapters for free online #SQLServer learning!

SQLPASS virtual chapters (VC) provides free sqlserver training year-round:

If you are not signed up already, then consider signing up! With that, And here’s a Quick walk-through on how to Join a VC:

If you do not have a SQLPASS account:

a. Go to http://www.sqlpass.org/RegisterforSQLPASS.aspx

b. Fill up the required information and register

Now, After successful login/registration:

a. Go to http://www.sqlpass.org/MYPASS.aspx

b. switch to MyChapters section

c. Now under virtual chapters, you would see a list of virtual chapters. Join the one’s you are interested in!

my PASS my Chapter Azure VC

How to import data from Excel file to SQL Server:

In this post, we would see how to import data from Excel file to SQL Server using Import and Export Wizard, we would import data from four excel worksheets to SQL server. Here are the steps:

1) Right click the Target Database > Tasks > Import Data to open the SQL Server Import and export wizard

import data ssms sql server import export

2) select the Data Source. In this case, Excel file. You would select the worksheets to fetch data from later, for now – select the excel file as shown below:

import data ssms sql server import export

3) choose the Target Database:

choose the target database import and export systems

4) Select the tables (in our case- worksheets) or specify a query. For the purpose of this demo, we would go with “copy data from one or more table or views” option:

 specify table copy or query

5) Select the worksheets from the Excel file and also specify the tables in the target systems. If the tables in the target database are not already created, then would be created by this wizard:

sql server map source and targetsql server map source and target

6) You have the option of running this wizard immediately or you have the option of creating a SSIS package and then run it later. For the purpose of this post, we would select Run Immediately

7) Review the information and click Finish

sql server review the complete wizard

8) After successful execution:

8 sql server import export execution was successful

Conclusion:

In this blog post, we saw how to import data from Excel file to SQL server.