Notes from Atlanta SQL BI meeting: “Bus Matrix”

Bus matrix is a very important (if not the most important) topic for a Data Warehouse Developer/Architect. And it was great to re-visit and grow knowledge around this topic at Atlanta SQL BI meeting presented by Bill Anton (b|t).

Here are some of my notes:

  • Dimensional Modeling is more important than ever specially w/ the rise in self-service BI. it’s important to lay out data in an intuitive way to business users.
  • Dimensional Modeling is great! But at the same time it’s important to communicate this to business users. The solution doesn’t involve teaching ER diagrams to users, simplified designs, offloading mapping data to business processes work to business users. So, what’s the solution? That’s where Bus Matrix comes into picture!
  • Bus Matrix is a 2 dimensional intersection of Facts & Dimensions.
  • Bus matrix can not only help in communication w/ Business users, it can also be useful for IT in project planning and documentation purposes.
  • In the requirement gathering phase, a Bus Matrix could be a 2 dimensional intersection of Subject Areas (Sales, Inventory, etc) by Source Systems w/ priorities shown at the intersection. It’s great for road-mapping a BI/DW project.
  • Business Matrix is created during requirements gathering stage (subject area vs source systems), before development (Facts vs Dimensions) and needs to be kept updated over time. After the solution is live, it could be one of the important end-user documentation.

Also, Here’s a great resource page posted by the Presenter (includes his slides): http://byobi.com/blog/bus-matrix/ – check it out! it has some great resources in there.

Thank you Bill & Meeting organizers!

PASS Business Analytics VC’s Online Event: “Power BI Info Management and Data Stewardship”

Power BI is an exciting new technology in the business analytics space from Microsoft. I’ve played with its current preview version & attended couple of sessions on Power BI at PASS Summit 2013. Based on my first impression, I noted down Problems that Power BI solves. Note that as of today, it’s in preview & so information around cost is not availale yet but I try to learn and understand as much as I can Today about how Power BI is going to help business users & power users in the future. As a part of that, I’m attending Business Analytics VC’s session on “Power BI Info Management and Data Stewardship” by Matthew Roche & Ofer Ashkenazi on Nov 7th 12 PM EST.

Topic: Power BI Info Management and Data Stewardship

Date & Time: Nov 7th 12 PM EST

Here’s the Link to register: http://bit.ly/PASSBAVC

Topic Abstract:

“Business intelligence tools continue to improve, letting users shorten their time to insight and take that insight to more devices in more places. But this evolution of BI doesn’t change one fundamental fact of information management: You can’t gain insight from data you can’t access.

In this session, Matthew Roche and Ofer Ashkenazi will introduce the role of the data steward and the self-service information management capabilities included in Power Query and Power BI for Office 365, focusing on how Power BI empowers business users to add value to the organization.”

Closing note:

I recently volunteered at Business Analytics VC as VP of Marketing, so it’s in my interest to spread word about the event but I would not spread word about something unless it gets me personally excited about it! :)

I hope to see you at the session and for some reason if you can not make it, we usually record sessions & so you can check out the meeting archives section of the PASS BA VC site after the event.

Exploring, filtering and shaping web-based public data using Data Explorer Excel add-in:

Data Explorer let’s you “Explore” (search) for web-based public data. This is a great way to combine data that you may have in your data-sources with public data sources for data analysis purposes. Sometimes your data might not tell you the reason behind the observed trends, so when that happens – you can try to see if a public data-set might give you the much-needed context. Let me give you an Example before we start hands-on w/ data explorer so that you have better understanding of importance of public datasets. Here’s a sample that I found here. So, Here’s a demo:

An auto company is seeing sales trends of Hybrid cars and SUV’s from the sales data-sources. But what is the reason behind that? company data does not show that. Someone hypothesizes that it might be because of gas prices. So they test out the hypothesis by combining gas prices information available via public data. And turns out gas prices might be the driving force of sales trends! SEE:

if the gas prices increase, then the sale of SUV go down and the sale of Hybrids go up:

data analysis combine data with public datasets

You know that public data can be helpful! So how can you search for public data-sets? Well, You can manually search online, ask someone, browse through public data repositories like azure data market (and other data markets), there’s also a public data search engine! OR you can directly search for them via Data Explorer.

Here are the steps:

1) Excel 2010/2013 > Data Explorer Tab > Online Search > type “Tallest Buildings”

excel public data search data explorer2) I selected one of the data-sets that said “Tallest completed building…. “

excel data from internet

3) Now let’s do some filtering and shaping. Here are the requirements:

- Hide columns: Image, notes & key

- clean columns that has heights data

- Show only city name in location

OK, let’s get to this one by one!

4) Hiding Columns:

Click on Filter & Shape button from the Query Settings:

excel data shaping cleaning

Select Image Column > Right Click > Hide:

excel hide remove columns

Repeat the steps for notes & key column.

Click on DONE

5) clean column that has heights data.

Click on Filter & Shape to open the query editor

A) let’s rename it. Select column > Right Click > Rename to Height > press ENTER

B) let’s remove the values in brackets. Select Column > right click > split column > By delimiter > At each occurrence of the delimiter > Custom and enter “(” > OK

excel split a columnThis should transform the data like this:

excel data explorer split a column

Hide height.2 and rename the height.1 to height

Click on DONE

6) Let’s just have city names in the location column

click on Filter & shape to load query editor:

A) select location > right click > split column > by delimiter > Custom – Enter: ° in the text box like this:

an excel split by delimiter dataclick on OK

Hide Location.2, Location.3, Location.4 & Location.5

Select Location.1 > Right Click > Split Column > by Number of characters > Number of characters: 2 > Once, as far right as possible > OK

cleaning data in excel shaping filtering

Hide Location.1.2 and rename Location.1.1 to Location

One last thing! making sure that the data type of height is numbers.

Select height > change type > number

Also,

Select floors > change type > number

click on DONE. Here’s our filtered and shaped data!

filter data excel shape clean

7) LET”S VISUALIZE IT!

For the purpose of visualization I copied first 20 rows to a separate excel sheet and created a chart:

z excel data visualization

That’s about it for this post. Here are some related Posts on Data Explorer:
Unpivoting data using the data explorer preview for Excel 2010/2013
Merging/Joining datasets in Excel using Data Explorer add-in
Remove Duplicates in Excel Tables using Data Explorer Add-in
Web Scraping Tables using Excel add-in Data Explorer preview:

Your comments are very welcome!

How conditionally formatting your data in Excel can help you save time in answering business questions?

Visual analytics is amazing – it helps “data enthusiasts” save time in answering questions using Data. Let’s see one such example. For the purpose of the blog post, I am going to show how to do it in Excel 2010:

Problem:

Here’s the Business Question: What was sales of Tea in North Region in 2012 Q1

Here’s the data:

SALES DATA(2012 Q1)  East West Central North South
Coffee  $  7,348.00  $  7,238.00  $  1,543.00  $  9,837.00  $    1,823.00
Tea  $  9,572.00  $  8,235.00  $  3,057.00  $  8,934.00  $  13,814.00
Herbal Tea  $  5,782.00  $  8,941.00  $  9,235.00  $     392.00  $    1,268.00
Espresso  $  9,012.00  $  2,590.00  $  4,289.00  $  7,848.00  $       340.00

So it’s easy to give out answer using the data: $8934

But let me CHANGE the business question:

WHICH Products in WHAT regions are doing the best?

Now this questions is not as easy as the previous one? WHY? because you’ll have to manually go through each number in a linear fashion to answer the question. Now imagine a bigger data-set. It’ll take even more time.

Solution

What can Excel Power users and Data Enthusiasts do to answer the new business question in an efficient way? Well, let’s see what conditional formatting can do it:

Excel Visual Analytics Conditional formatting

Now with the Data Bars, it’s easier to just glance at the report and see best performing products and regions. For instance, it’s very easy to spot that Tea is performing best in South among all products and region.

So how do you create data bars?

1. Select the data

2. Home > Conditional Formatting > Data Bars

Excel Visual Analytics Conditional formatting 2

3.Done! you’ll see this:

Excel Visual Analytics Conditional formatting

4. You can play with other options here to see what suits the best for your needs. But I just wanted to point out that there is a way for you to highlight the data in a way that helps you save time in answering business questions using data

Conclusion:

Visual analytics is a great way to quickly analyze data. In most cases, Human brain is much faster at interpreting the visual results as oppose to text/numbers – so why not use it to your advantage. And tools like Excel have inbuilt functionality to help you do that!

Business Analytics project is like an iceberg:

Business persons may not realize that Business Analytics project may involve significant efforts for the under the hood technical tasks like Data Cleaning, Data Integration, Building-a-data-warehouse, creating ETL processes, gathering business requirements among other tasks. And that explains the title of this blog: Business Analytics project is like an iceberg. It’s because, a business person may just see the tool used to visualize data but may not realize the work that went into making it “analytics-ready”. From a project management standpoint – before a project is initiated, the discussion about this different aspects of the project need to communicated to the business stakeholders so that they are in the know of efforts involved in building an analytics solution. And with that, Here’s the summary of this discussion in form of an Image:

iceberg business analytics data cleaning business intelligence data warehouse

Related articles

Data Reporting ≠ Data Analysis

One of the key thing I’ve learned is importance of differentiating the concepts of “Data Reporting” and “Data Analysis”. So, let’s first see them visually:

data analysis and data reporting

Here’s the logic for putting Data Reporting INSIDE Data Analysis: if you need to do “analysis” then you need reports. But you do not have to necessarily do data analysis if you want to do data reporting.

From a process standpoint, Here’s how you can visualize Data Reporting and Data Analysis:

data analysis and data reporting process

Let’s thing about this for a moment: Why do we need “analysis”?

We need it because TOOLS are really great at generating data reports. But it requires a HUMAN BRAIN to translate those “data points/reports” into “business insights”. This process of seeing the data points and translating them into business insights is core of what is Data Analysis. Here’s how it looks visually:

Data analysis Data Reporting

Note after performing data analysis, we have information like Trends and Insights, Action items or Recommendations, Estimated impact on business that creates business value.

Conclusion:

Data Reporting ≠ Data Analysis

Examples of Machine Generated Data from “Big Data” perspective:

I just researched about Machine Generated Data from the context of “Big data”, Here’s the list I compiled:

- Data sent from Satellites

- Temperature sensing devices

- Flood Detection/Sensing devices

- web logs

- location data

- Data collected by Toll sensors (context: Road Toll)

- Phone call records

- Financial

And a Futuristic one:

Imagine sensors on human bodies that continuously “monitor” health. How about if we use them to detect diabetes/cancer/other-diseases in their early phases. Possible? May be!

Interesting Fact:

Machine can generate data “faster” than humans. This characteristics makes it interesting to think about to analyze machine generate data and in some cases, how to analyze them in real-time or near real-time

Ending Note:

Search for Machine Generated Data, you’ll be able to find much more, it’s worth reading about from the context of Big Data.

Thanks:

http://www.dbms2.com/2010/04/08/machine-generated-data-example/

http://en.wikipedia.org/wiki/Machine-generated_data

http://tdwi.org/articles/2012/10/23/machine-generated-big-data.aspx

Matching activity in Data Quality Services in action!

In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .

Situation 1:

we’ve a list of movies and we want to identify “matching” movie titles.

Solution 1:

Create a Matching Policy

1) Mapping the Domains:

2 Data Quality Services matching policy

2) Configuring the Matching Policy:

3 Data Quality Services matching policy

Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.

3) View Matching results:

4 Data Quality Services matching policy de duplication

4) Publish the KB (which stores the matching policy)

Once you have the matching policy, you can use this in a Data Quality Project:

5) See How I ran a Data Quality Project (w/ matching activity) in the image below.

5 Data Quality Project matching activity policy de duplication

Note: You can export the clean data-set via Data Quality Project.

Situation 2:

we’ve a list of Supplier Names and we want to identify “matching” supplier names.

Note that in this situation, you would see how to use more than one domain to create a matching rule.

Solution 2:

Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results

Matching Policy:

6 supplier ID and name two domains in matching rule data quality services

Matching results:

7 supplier ID and name two domains in matching rule data quality services

Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM

DQS MDS Data quality services and master data services

Conclusion:

In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .

Data Profiling and SQL Server 2012 Data Quality Services

Data Profiling in Data Quality Services happens at following stages:

1) While performing Knowledge Discovery activity

1A: In the Discover step:

1 knowledge discovery profiling data quality services sql server

1b. Also in the manage domain values step:

1b knowledge discovery profiling data quality services sql server

While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:

  • Newness
  • Uniqueness
  • Validity
  • Completeness

2) While Performing  Cleansing activity:

2A: on the cleansing step:

2 cleansing profiling data quality services sql server

2b: Also on the mange and view results step:

2b cleansing profiling data quality services sql server

Here the profiler gives you following statistics:

  • Corrected values
  • Suggested Values
  • Completeness
  • Accuracy

Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.

3) While performing Matching Policy activity (Knowledge Base Management)

3a. Matching policy step:

3a matching policy data quality services microsoft sql

3b. Matching Results step:

3b matching policy data quality services microsoft sql

Here the profiler gives following statistics:

  • newness
  • uniqueness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

4) While performing Matching activity (Data Quality Project)

4a. Matching step:

4a matching activity data quality services microsoft sql

4b. Export step:

4b matching activity data quality services microsoft sql export step

Here Profiler gives following statistics:

  • Newness
  • uniqueness
  • completeness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

Conclusion:

In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.

 

How to clean address records using third-party reference data-sets in Data Quality Services?

In this post, we’ll see how to clean address records using third party reference data-sets in Data Quality Services.

You need to perform following steps to do so

Note that they are not step by step procedures, a high level overview of how DQS works to clean address records using 3rd party reference data-sets:

1) Configure Data Quality Services to use reference data sets. Tutorial here: http://msdn.microsoft.com/en-us/library/hh213070.aspx

2) Go to datamarket.azure.com > and I subscribed to “Address check – verify, correct, Geocode US and canadian Addresses Data” ; we’ll see how to use this in next steps.

3) Note that as of now, we can only have refernece data-sets from Azure Data Market. However, the MSDN thread: http://social.msdn.microsoft.com/Forums/hu-HU/sqldataqualityservices/thread/750faef8-dd69-4a71-b0c1-18ca2f93d59d suggests that we’ll have an ability to provide our (private/self-provided) reference data-sets in next service updates. So for now we’ll have to connect to Azure data market for reference data-sets and for the purpose of the Demo, I connected to Melissa Data’s Address Check.

4) Now via DQS client, let’s create a Knowledge Base!

5) I’ve created following domains:

  • Address Line
  • City
  • State
  • Zip

And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.

1 SQL server 2012 Data Quality Services Domain Management

6) And for the composite domain Full Address – this is how you configure reference data-sets:

2 1 SQL server 2012 Data Quality Services reference data services

7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:

You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:

SQL server 2012 Data Quality Services data quality project

That’s about it for this post. I hope it helps.