Visual analytics is amazing – it helps “data enthusiasts” save time in answering questions using Data. Let’s see one such example. For the purpose of the blog post, I am going to show how to do it in Excel 2010:
Here’s the Business Question: What was sales of Tea in North Region in 2012 Q1
Here’s the data:
|SALES DATA(2012 Q1)
|| $ 7,348.00
|| $ 7,238.00
|| $ 1,543.00
|| $ 9,837.00
|| $ 1,823.00
|| $ 9,572.00
|| $ 8,235.00
|| $ 3,057.00
|| $ 8,934.00
|| $ 13,814.00
|| $ 5,782.00
|| $ 8,941.00
|| $ 9,235.00
|| $ 392.00
|| $ 1,268.00
|| $ 9,012.00
|| $ 2,590.00
|| $ 4,289.00
|| $ 7,848.00
|| $ 340.00
So it’s easy to give out answer using the data: $8934
But let me CHANGE the business question:
WHICH Products in WHAT regions are doing the best?
Now this questions is not as easy as the previous one? WHY? because you’ll have to manually go through each number in a linear fashion to answer the question. Now imagine a bigger data-set. It’ll take even more time.
What can Excel Power users and Data Enthusiasts do to answer the new business question in an efficient way? Well, let’s see what conditional formatting can do it:
Now with the Data Bars, it’s easier to just glance at the report and see best performing products and regions. For instance, it’s very easy to spot that Tea is performing best in South among all products and region.
So how do you create data bars?
1. Select the data
2. Home > Conditional Formatting > Data Bars
3.Done! you’ll see this:
4. You can play with other options here to see what suits the best for your needs. But I just wanted to point out that there is a way for you to highlight the data in a way that helps you save time in answering business questions using data
Visual analytics is a great way to quickly analyze data. In most cases, Human brain is much faster at interpreting the visual results as oppose to text/numbers – so why not use it to your advantage. And tools like Excel have inbuilt functionality to help you do that!
Business persons may not realize that Business Analytics project may involve significant efforts for the under the hood technical tasks like Data Cleaning, Data Integration, Building-a-data-warehouse, creating ETL processes, gathering business requirements among other tasks. And that explains the title of this blog: Business Analytics project is like an iceberg. It’s because, a business person may just see the tool used to visualize data but may not realize the work that went into making it “analytics-ready”. From a project management standpoint – before a project is initiated, the discussion about this different aspects of the project need to communicated to the business stakeholders so that they are in the know of efforts involved in building an analytics solution. And with that, Here’s the summary of this discussion in form of an Image:
One of the key thing I’ve learned is importance of differentiating the concepts of “Data Reporting” and “Data Analysis”. So, let’s first see them visually:
Here’s the logic for putting Data Reporting INSIDE Data Analysis: if you need to do “analysis” then you need reports. But you do not have to necessarily do data analysis if you want to do data reporting.
From a process standpoint, Here’s how you can visualize Data Reporting and Data Analysis:
Let’s thing about this for a moment: Why do we need “analysis”?
We need it because TOOLS are really great at generating data reports. But it requires a HUMAN BRAIN to translate those “data points/reports” into “business insights”. This process of seeing the data points and translating them into business insights is core of what is Data Analysis. Here’s how it looks visually:
Note after performing data analysis, we have information like Trends and Insights, Action items or Recommendations, Estimated impact on business that creates business value.
Data Reporting ≠ Data Analysis
I just researched about Machine Generated Data from the context of “Big data”, Here’s the list I compiled:
- Data sent from Satellites
- Temperature sensing devices
- Flood Detection/Sensing devices
- web logs
- location data
- Data collected by Toll sensors (context: Road Toll)
- Phone call records
And a Futuristic one:
Imagine sensors on human bodies that continuously “monitor” health. How about if we use them to detect diabetes/cancer/other-diseases in their early phases. Possible? May be!
Machine can generate data “faster” than humans. This characteristics makes it interesting to think about to analyze machine generate data and in some cases, how to analyze them in real-time or near real-time
Search for Machine Generated Data, you’ll be able to find much more, it’s worth reading about from the context of Big Data.
In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .
we’ve a list of movies and we want to identify “matching” movie titles.
Create a Matching Policy
1) Mapping the Domains:
2) Configuring the Matching Policy:
Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.
3) View Matching results:
4) Publish the KB (which stores the matching policy)
Once you have the matching policy, you can use this in a Data Quality Project:
5) See How I ran a Data Quality Project (w/ matching activity) in the image below.
Note: You can export the clean data-set via Data Quality Project.
we’ve a list of Supplier Names and we want to identify “matching” supplier names.
Note that in this situation, you would see how to use more than one domain to create a matching rule.
Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results
Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM
In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .
Data Profiling in Data Quality Services happens at following stages:
1) While performing Knowledge Discovery activity
1A: In the Discover step:
1b. Also in the manage domain values step:
While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:
2) While Performing Cleansing activity:
2A: on the cleansing step:
2b: Also on the mange and view results step:
Here the profiler gives you following statistics:
- Corrected values
- Suggested Values
Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.
3) While performing Matching Policy activity (Knowledge Base Management)
3a. Matching policy step:
3b. Matching Results step:
Here the profiler gives following statistics:
- number of clusters
- % of matched and unmatched records
- avg, min & max cluster size
4) While performing Matching activity (Data Quality Project)
4a. Matching step:
4b. Export step:
Here Profiler gives following statistics:
- number of clusters
- % of matched and unmatched records
- avg, min & max cluster size
In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.
In this post, we’ll see how to clean address records using third party reference data-sets in Data Quality Services.
You need to perform following steps to do so
Note that they are not step by step procedures, a high level overview of how DQS works to clean address records using 3rd party reference data-sets:
1) Configure Data Quality Services to use reference data sets. Tutorial here: http://msdn.microsoft.com/en-us/library/hh213070.aspx
2) Go to datamarket.azure.com > and I subscribed to “Address check – verify, correct, Geocode US and canadian Addresses Data” ; we’ll see how to use this in next steps.
3) Note that as of now, we can only have refernece data-sets from Azure Data Market. However, the MSDN thread: http://social.msdn.microsoft.com/Forums/hu-HU/sqldataqualityservices/thread/750faef8-dd69-4a71-b0c1-18ca2f93d59d suggests that we’ll have an ability to provide our (private/self-provided) reference data-sets in next service updates. So for now we’ll have to connect to Azure data market for reference data-sets and for the purpose of the Demo, I connected to Melissa Data’s Address Check.
4) Now via DQS client, let’s create a Knowledge Base!
5) I’ve created following domains:
- Address Line
And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.
6) And for the composite domain Full Address – this is how you configure reference data-sets:
7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:
You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:
That’s about it for this post. I hope it helps.
I like to keep an eye on Technology Trends. One of the ways I do that is by subscribing to leading magazines for articles – I may not always read the entire article but I definitely read the headlines to see what Industry is talking about. during last 12 months or so I have seen a lot of buzz around Big Data and I thought to myself – It would be nice to see a Trend line for Big Data. Taking it a step further, I am also interested in seeing if there is a correlation between growing trend in “Hadoop” and “Big Data”. Also, I wanted to see how it compares with the Terms like Business Intelligence and Data Science. With this, I turned to Google Trends to quickly create a Trend report to see the results.
Here’s the report:
Here are some observations:
1) There’s a correlation between Trend of Big Data and Hadoop. In fact, it looks like growing interest in Hadoop fueled interest in “Big Data”.
2) Trend line of Big Data and Hadoop overtook that of Business Intelligence in Oct 2012 and sep 2012 respectively.
3) Decline in Trend line of Business Intelligence.
4) There seems to be a steady increase in Trend line for Business Analytics and Data Science.
And Here’s the Google Trend report URL: http://www.google.com/trends/explore#q=Big%20Data%2C%20Hadoop%2C%20Business%20Intelligence%2C%20Business%20Analytics%2C%20Data%20Science&cmpt=q
What do you think about these trends?
In this blog-post, we would see the Three V’s of Big Data with Example:
TB’s and PB’s and ZB’s of data that gets created:
The speed at which information flows.
Example: 50 Million tweets per day!
(This is back in Nov. of 2010 – the number must have increased!)
UPDATE 23 Nov 2012: on, wikipedia it says – 340 million tweets per day!
All types of data is now being captured which may be in structured format or not.
Example: Text from PDF’s, Emails, Social network updates, voice calls, web traffic logs, sensor data, click streams, etc
And this may be followed by other V’s like V for Value.
In this blog-post, we saw Three V’s of Big Data with Example.
Who on earth is creating “Big data”?
Examples to help clarify what’s unstructured data and what’s structured?
[UPDATE 29 Dec 2012: One more whitepaper that's on the same subject and should be read after the tutorial: Cleanse and Match Master Data by Using EIM ]
Here’s the resource: Tutorial: Enterprise Information Management using SSIS, MDS, and DQS Together
Here’s what you’ll learn:
- Learn to use Data Quality client to create DQS Knowledge Base, cleanse data in an excel data, remove duplicated data from the excel file
- Learn Use MDS add-in for Excel, store the cleansed and matched data in MDS
- Learn how to use the MDS Web UI:
- Learn to Automate the process of receiving input data, cleaning and matching it and storing the master data into MDS via SSIS!
That’s about it! check out the Tutorial to learn about all the DQS, MDS & SSIS goodness!