SSIS: Using Data Profiling Task to check the candidate key profile of unknown data source(s)

As a part of Business Intelligence projects, we spend a significant amount in extracting, transforming and loading data from source systems. So it’s always helpful to know as much as you can about the data sources like NULLS, keys, statistics among other things. One of the things that I like to do if the data is unknown is to make sure that I get the candidate keys correct to make sure the key used can uniquely identify the rows in the data. It’s really helpful if you do this upfront because it would avoid a lot of duplicate value errors in your projects.

So here’s a quick tutorial on how you can check the candidate key profile using data profiling task in SSIS, You need to perform two main tasks:
1. Generate the xml file using the Data profiling task in SSIS
2. View the content of the xml file using the Data Profile Viewer Tool or using the Open Profile Viewer option in the Data Profiling task editor in SSIS.

Here are the steps:
1a. Open SQL Server Data Tools (Visual Studio/BIDS) and the SSIS project type
1b. Bring in Data Profiling Task on Control Flow
1c. Open the Data Profiler Task editor and configure the destination folder that the tasks uses to create the XML file. You can either create a new connection or use an existing one. If you use an existing connection, make sure that you are setting the OverwriteDestination property to True if you want the file to be overwritten at the destination.

1 SSIS Data Profiling Task Data Cleaning Candidate Key

1d. Click on Quick Profile to configure the data source for the data profiler task

2 SSIS Data Profiling Task Data Cleaning Candidate Key

1e. In the quick profile form, you’ll need to select the connection, table/view and also specify what you to need to computer. For candidate key profile, make sure that the candidate key profile box is checked.

3 SSIS Data Profiling Task Data Cleaning Candidate Key

1f. Run the Task and a XML file should be placed at the destination you specified in step 1C.

Now, It’s time to view what profiler captured.

2a. you can open “Data Profile Viewer” by searching for its name in the start button.

4 SSIS Data Profiling Task Data Cleaning Candidate Key

2b. once it opens up, click on open and browse to the xml file generated by the data profiling task.

5 SSIS Data Profiling Task Data Cleaning Candidate Key

2c. once the file opens up, you can the candidate key profiles.

6 SSIS Data Profiling Task Data Cleaning Candidate Key

2d. Alternatively, You can also open the data profile viewer from the “Data Profiling Task” in SSIS. Go to the Editor > Open Profile Viewer:

7 SSIS Data Profiling Task Data Cleaning Candidate Key

Conclusion:
In this post, you saw how to profile data using the Data Profiling Task in SSIS.

Data Profiling and SQL Server 2012 Data Quality Services

Data Profiling in Data Quality Services happens at following stages:

1) While performing Knowledge Discovery activity

1A: In the Discover step:

1 knowledge discovery profiling data quality services sql server

1b. Also in the manage domain values step:

1b knowledge discovery profiling data quality services sql server

While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:

  • Newness
  • Uniqueness
  • Validity
  • Completeness

2) While Performing  Cleansing activity:

2A: on the cleansing step:

2 cleansing profiling data quality services sql server

2b: Also on the mange and view results step:

2b cleansing profiling data quality services sql server

Here the profiler gives you following statistics:

  • Corrected values
  • Suggested Values
  • Completeness
  • Accuracy

Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.

3) While performing Matching Policy activity (Knowledge Base Management)

3a. Matching policy step:

3a matching policy data quality services microsoft sql

3b. Matching Results step:

3b matching policy data quality services microsoft sql

Here the profiler gives following statistics:

  • newness
  • uniqueness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

4) While performing Matching activity (Data Quality Project)

4a. Matching step:

4a matching activity data quality services microsoft sql

4b. Export step:

4b matching activity data quality services microsoft sql export step

Here Profiler gives following statistics:

  • Newness
  • uniqueness
  • completeness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

Conclusion:

In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.

 

How to detect unrealistic or invalid values using Data Quality Services?

When you see a price of an item as -10 (negative 10) – you know it’s not right! It’s BAD data!

When you see some person’s height as 120 inches (10ft!) – you know that’s just not true! Again, BAD data!

It’s not uncommon to see such values in REAL data sets. In this post, we’ll see how SQL Server 2012’s Data Quality Services would help us clean unrealistic a.k.a invalid values in our datasets and bring us a step closer to a CLEAN data!

Situation:

You’re an analyst for professional basketball team Dallas Mavericks and You need a Player Roster of your Basketball players having CLEAN data for analysis purposes. Someone recently passed you a player roster, great! you have data!

Problem:

Some of the Data about Height of players just doesn’t look right! A player with named Dominique Jones had height of just 5 inches.

dallas mavericks player names vs height for data cleaning project

That’s just not right. So for now, you may clean this data-set manually by looking up right values but wouldn’t it be great if you had a Data Cleaning solution that flags unrealistic values for your review every time you get such data-sets? So the analyst decided to build a solution in Data Quality Services – Here are the steps:

Steps to create a Knowledge Base in DQS and clean the invalid values:

Now, First let’s create the Knowledge Base:

You decide to create a rule that checks for player heights that’s less than 5ft and greater than 7ft 6inches for your review. So here’s what the Knowledge Base would look like:

domain management knowledge base domain rules data sqlNow after the Knowledge base (KB) has been published, it’s time to use this KB for a Data Quality Project.

Note: This KB can be re-used for creating KB’s for other teams too.

Here’s the result of running the results on our Data – Set:

3 data cleaning of invalid height values sql server data qualityNote that it says that height has 3 invalid records. Let’s correct them!

In the next tab, you can see the invalid records:

iteratively clean invalid data sql server unrealistic data quality

You can correct this invalid values one by one and then approve them:

specify the valid values approve data quality sql

After that in the next steps, you can see the results – I have highlighted records that we cleaned!

corrected data set clean sql server data quality

And here’s our clean data-set:

dallas mavericks player names vs height cleaned dataset dec 12 2012

Conclusion:

Yay! An analyst of a professional basketball team was able to quickly clean a data-set. And he can re-use the knowledge base he had built to clean other data-sets too in the future! Isn’t that nice?

In this post, we saw how to clean unrealistic or invalid records in a data set by using domain rules in SQL Server 2012 Data Quality Services.