Resource: Introduction to Data Science by Prof Bill Howe, UW

Introduction to Data Science course taught by Bill Howe just started on coursera platform. Having studied the Data Intensive Computing in Cloud course at UW taught by Prof Bill Howe, I can say that this course would be great resource too!

Check it out: https://www.coursera.org/course/datasci

Introduction to Data Science

PASS Business Analytics Conference Keynote Day #2

Dr. Steven Levitt’s (Indiana Jones of economics & Author of Freakonomics) work involves finding insights from data. In the keynote, he shared some of the interesting & fun insights that he found from data.

One Example: Dr. Levitt: According to the data, It was 7 times more dangerous to sell crack in Chicago than it was being in combat in Iraq. https://twitter.com/markvsql/status/322707949158006786

He also talked about other insights that he found which could also be found in his book Freakonomics. After getting audience fascinated about what analyzing data can do – he moved to his real world experiences of analyzing data for businesses. And tied all these fascinating insights back to some tips he had for the audience. Here is a brief recap of the tips he shared:

> “Ideas don’t come out of the blue. Almost always ideas come out of the data” – Dr. Steven Levitt

> “You guys are the future. What you’re doing is the key to a business’ success or failure.”

> Experiment & Test Hypothesis using DATA

> Misconceptions can cripple you. Let the data speak, even when it might be difficult

> Most important people = who understand and know what to do with data, not those who pretend they know the answer.

> Dr. Levitt: without data any biz will be left behind, must experiment and accept failure

*Above text is linked to tweets.

That’s about it for this post. What do you think about the tips that Dr Levitt shared?

 

Excel: Swapping (reversing) the Axis of a Table Data

Data preparation (or call it pre-processing) is an essential and time-consuming part of any data analytic’s project. To that end, I was working on a data set needed some changes before I could plot it on an effective data visualization. Here’s what I did:

My Challenge:

I was working on a data set that looked like this:

Date Abu Dhabi, United Arab Emirates Adalaj, Gujarat, India Addison, TX
1/1/2013 1 4 2
1/2/2013 1 4 2
1/3/2013 1 4 3
1/4/2013 3 3 3
1/5/2013 2 2 4
1/6/2013 2 3 4
1/7/2013 2 3 3
1/8/2013 2 2 4
1/9/2013 2 2 3

BUT: I wanted my data to look like

Date 1/1/2013 1/2/2013 1/3/2013 1/4/2013 1/5/2013 1/6/2013 1/7/2013 1/8/2013 1/9/2013
Abu Dhabi, United Arab Emirates 1 1 1 3 2 2 2 2 2
Adalaj, Gujarat, India 4 4 4 3 2 3 3 2 2
Addison, TX 2 2 3 3 4 4 3 4 3

What did my real data looked liked?

it has 380 columns and 500+ Rows and so MANUAL copy pasting was NOT an option!

Excel 2010 Solution:

It’s so simple!

Step 1: Select the data > COPY (Shortcut: ctrl + c)

Step 2: Switch to a new/different excel sheet

step 3: Paste Special > Transpose (T)

excel paste special transpose swap axis data

So After doing this, This is how the Input & output looks:

excel paste special reverse axis

Conclusion:

In this post, We saw how to swap or reverse the axis of a table data in Excel 2010.

An auto-refreshing Word Document that fetches data numbers from Excel:

In this post, I’ll show you how you can create a word document that:

1) Gets numbers from Excel

2) auto refreshes when the data gets changed/updated in Excel

Note: I am using Excel 2010 & Word 2010 for the purpose of this blog.

Here are the steps:

1. Scenario:

Let’s say we have a line in word that says “The revenue for year _____ was $ ______”

and we want to fetch data numbers for Year and revenue from Excel.

2. Let’s go!

here’s the data in Excel:

excel data numbers

3. In excel; copy A2 that has value 2012

4. switch to WORD:

now this is IMPORTANT. do NOT just paste it.

move your cursor to the destination position > right-click > paste options > Link and Merge formatting 

link and merge formatting

There’s also an option of link and keep source formatting, this is how it looks:

link and keep source formatting

5. repeat the steps for revenue

6. Close Word.

7. Open Excel and change the value of year to 2011 and revenue to 4000

8 Now open Word 2010, say yes if a dialog box pops up

And here you go:

linked word file to excel data

Conclusion:

In this post, we saw how to create an auto-refreshing word document that fetches data numbers from Excel.

How to add secondary axis in a chart in Excel 2010?

In this post, I’ll show you how you can add secondary axis in a chart in Excel 2010:

First up, Why do you need secondary axis?

Look at the Data and the chart that I just created:

creating a chart from Data excel 2010

Notice something? Column A is not “visible” – Is it? So Now let’s see the effect how adding a Secondary Axes:

secondary axis in a chart in excel 2010

Looks better? Here are the steps:

1. Select the chart

2. can you see charts tool options in the menu bar? Yes? Great!

3. Switch to Format Tab

4. select the column that you want to put on a secondary axes:

select column in format area data excel chart

5. After selecting the column, click on Format Selection

6. In the dialog box, select secondary axes:

secondary axis excel 2010 chartclick on close.

7. Can you see the secondary axis now? Great!

secondary axis in a chart in excel 2010

8. Let’s do one more thing!

Let’s change the chart type of the secondary axis. Here are the steps:

a. Make sure the column is selected from the format Tab

b. Switch to the Design Tab under the charts Tool

c. Click on change chart type and select the chart of your choice. I select a column chart and this is how it looks:

more than one chart type on a single chart in excel 2010

Conclusion:

In this post, we saw how to add a secondary axis. We also saw how to change the chart type of the secondary axis.

Statistics 101: Nominal, Ordinal, Interval, Ratio Data

If you work with any statistical analysis tool, sometimes you may have run into configuring the data into either of these following categories: Nominal, Ordinal, Interval, Ratio

Here is what each term means:

Nominal Simply names or call them set of characters Example: Full name, fruits, cars, etc
Ordinal Nominal + They have order Example: Small, medium, big
Interval Ordinal + the intervals between each value are equally split Example: temperature in Fahrenheit scale:10 20 30 etc

Note that 20F is not twice as cold as 40F. So multiplication does not make sense on Interval data. But addition and subtraction works. Which brings us to next point: Ratio

Ratio Interval + multiplication makes sense Weight: 60KG, 120KG.120 KG = 2 * 60 KG

I hope the examples are of help when you are working with statistical analysis tools and need to categorize the data.

How to view error(s) that occur during Master data Service’s Staging Process?

To view error(s) that occur during MDS’s staging process, we have two views: stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails. For the purpose of this blog post, let me show you how I (as an administrator) view information from stg.viw_name_MemberErrorDetails.

1) Name of the entity: supplier & Name of model is Suppliers

2) I imported data into staging table stg.supplier_leaf via SSIS

3) Now here’s how you can see the errors that occur during MDS’s staging process:

we can go to MDS web application to initiate the staging process > start the batch > after completion you can see the status as well as if it has any errors or not:

Master Data Services Errors4) Now if you see that there’s an error, then you can go to stg.viw_name_MemberErrorDetails to see what are the errors. In my case, I am going to run the query:

select * from stg.viw_supplier_MemberErrorDetails where Batch_ID=2

You can get the above query via MDS web application too:

Master Data Services SQL Server Staging Process

5) And as you can imagine, you can get access to this error data via SSIS (SQL Server Integration services) too. So if you have a workflow that a. Loads data to MDS and b. initiates the Batch process via Stored Procedure; then you can program it to get access to errors from the stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails tables.

SQL Server Integration Services and Master Data Services

That’s about it for this post! I hope this helps. Your comments are very welcome!

Playing w/ the Occupational Employement Statistics Data-Set:

I found some data-sets on Occupational Employment Statistics on Bureau of Labor Statistics site and I played with it to see if I can find something interesting:

Few things about the data & visualization that I am going to share

  • US only
  • I downloaded the national level data But there’s also state level data available if you’re interested to drill down.
  • The reports that you see where created after I got a chance to “clean” the data-set a bit and created a data model that suited basic reporting on top of it.
  • For this blog post, I am going to play w/ May 2010 & 2011 data
  • With the help of original data-set, you can drill down to get statistics about a particular Job Category if you want. For this blog-post, I am going to share visualizations that correspond to Job categories.
  • click on images to see the higher resolution image.

With that, Here are some visualizations:

1) Job Category VS mean hourly salary:

1 Job category vs hourly salary mean bureau of labour statistics

2) Job Category VS number of employees:

2 Job category vs number of employees bureau of labour statistics

3) Scatter Plot:

X Axis: Number of employees

Y – Axis: Wage (Mean Hourly Salary May 2011)

Size of Bubble: Wage (Mean Hourly Salary May 2011)

*Note: This may not be the best approach to create the Scatter Plot as I have used the same value (Mean Hourly Salary May 2011) twice – But since I was just playing w/ it, I went with what I had in the model.

Here’s the visualization:

3 scatter plot number of employees vs mean hourly wage may 2011 employment statistics

Some of the things I observed:

1) I belong to an Industry (Computer and Mathematical occupations) which has relatively higher mean hourly wage.

2) There are few people working in “farming, fishing & forestry occupations” that do not get paid much.

3) There are lots of people working in “office administrative support occupations” that do not get paid much.

4) Management Occupations, Legal Occupations and computer & mathematical occupations have relatively higher mean hourly wages.

Conclusion:

In this post, I played w/ Occupational Employment statistics data-sets and shared some visualizations.

SQL Server 2012 Data Quality Services Term based Relation’s in action!

In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data. In this post, we’ll see a use case of Term based relations. But before we do that, you can consider reading: Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services

Now, Here’s Term Based Relation’s in action

Scenario:

After data profiling you realize that there are records such as “my Company Inc.” and “my Company Incorporated” – so you set out to automatically find these mismatches in terms inside a value and correct them.

BEFORE cleaned data, the report showed that “my company Inc” revenue is less than that of Google:

Before data cleaning company names and revenues sql server

AFTER cleaned data, the report correctly shows that “my company Inc” revenue is more than that of Google:

After data cleaning company names and revenues sql server

Steps taken to clean data:

(Just an overview, not covered in a step by step fashion)

1. Created the Knowledge Base w/ Two domains Company Names & Revenue

2. Term Based Relations Tab of Company Names domain:

2 term based relations data quality sql server

3. Published the KB > Let’s create a DQS project

4. Mapped the domains:

3. DQS Project Mapping Domain Names

5. DQS cleaned following records:

4 term based relations inc incorporated corrected records

6. Exported the data and created a report out of clean data-set!

Business user is happy :)

conclusion:

In this post, we saw how to correct a word/term within a value in a domain. The example we used was Inc. , Inc and Incorporated . It can be used to correct terms like Jr. and Junior. Sr. and Senior. etc. Things like this are difficult to catch during data entry – But using Term Based Relations, a person who knows the Data can clean it so that it generates correct reports.

After all reports like this are of little to no use, are they? So Let’s NOT create confusing reports anymore.

Please note:

  • The revenue figures shown are just for demo purposes. I pulled up these numbers from Wikipedia. Please don’t make any financial decision based on these reports and if you do, I am not responsible for that.
  • The name “my Company Inc” is a fictional firm. It’s not any firm that I am/was associated with in past, future of present. It’s a fictional name!

And Writing disclaimers like this are NO fun – sucks the joy out of “Thinking out Loud” :)

How to clean similar textual data in Excel via Fuzzy lookup add-in?

In this post, we would see how to get started with Fuzzy look-up add-in for excel.

First up, Which problem does Fuzzy Lookup add-in for excel solve? It cleans similar (a.k.a matching) textual data in Excel. E.g. “Mr Paras Doshi”, “Doshi Paras”, “Paras A Doshi” are similar and may refer to the same person – Fuzzy look-up helps you detect such similar textual data. This add-in is really handy if you are “combining” data from different systems where the data is not in the same format – using this add-in you can detect similar looking text and clean the data-set at hand. With that, Here are the steps to download, install and play with this add-in:

1) Download “Fuzzy Lookup add-in for Excel. Read: Over view, system requirements and Instructions

URL: http://www.microsoft.com/en-us/download/details.aspx?id=15011 

(If this doesn’t work – search for “Fuzzy lookup add in for excel”

2) Note the sample excel file called “Portfolio” that comes with the download files. We’ll open it after the Successful installation of the add-in:

1 excel 2010 sample excel file to play with fuzzy lookup

3)Install it.

I left the default on all dialog boxes.

4) Open the sample file: “portfolio

5) Click on Install if you see a dialog box “Microsoft Office customization folder”

(please read the message in the dialog box too)

6) Can you see Fuzzy Lookup in Excel Toolbar? Yes? Great!

7) Now if you have opened the Portfolio file then you’ll see a tutorial on how to get started on the worksheet named portfolio:

3 excel 2010 tutorial to play with fuzzy lookup

8) follow the above tutorial to get started. I just did!

2 excel 2010 tutorial to play with fuzzy lookup output

Conclusion:

In this post, we saw how Fuzzy Lookup add-in for excel 2010 help you find matching text from two data sources. I hope that helps.