Introduction to Data Science course taught by Bill Howe just started on coursera platform. Having studied the Data Intensive Computing in Cloud course at UW taught by Prof Bill Howe, I can say that this course would be great resource too!
Check it out: https://www.coursera.org/course/datasci
Dr. Steven Levitt’s (Indiana Jones of economics & Author of Freakonomics) work involves finding insights from data. In the keynote, he shared some of the interesting & fun insights that he found from data.
One Example: Dr. Levitt: According to the data, It was 7 times more dangerous to sell crack in Chicago than it was being in combat in Iraq. https://twitter.com/markvsql/status/322707949158006786
He also talked about other insights that he found which could also be found in his book Freakonomics. After getting audience fascinated about what analyzing data can do – he moved to his real world experiences of analyzing data for businesses. And tied all these fascinating insights back to some tips he had for the audience. Here is a brief recap of the tips he shared:
> “Ideas don’t come out of the blue. Almost always ideas come out of the data” – Dr. Steven Levitt
> “You guys are the future. What you’re doing is the key to a business’ success or failure.”
> Experiment & Test Hypothesis using DATA
> Misconceptions can cripple you. Let the data speak, even when it might be difficult
> Most important people = who understand and know what to do with data, not those who pretend they know the answer.
> Dr. Levitt: without data any biz will be left behind, must experiment and accept failure
*Above text is linked to tweets.
That’s about it for this post. What do you think about the tips that Dr Levitt shared?
Data preparation (or call it pre-processing) is an essential and time-consuming part of any data analytic’s project. To that end, I was working on a data set needed some changes before I could plot it on an effective data visualization. Here’s what I did:
I was working on a data set that looked like this:
||Abu Dhabi, United Arab Emirates
||Adalaj, Gujarat, India
BUT: I wanted my data to look like
|Abu Dhabi, United Arab Emirates
|Adalaj, Gujarat, India
What did my real data looked liked?
it has 380 columns and 500+ Rows and so MANUAL copy pasting was NOT an option!
Excel 2010 Solution:
It’s so simple!
Step 1: Select the data > COPY (Shortcut: ctrl + c)
Step 2: Switch to a new/different excel sheet
step 3: Paste Special > Transpose (T)
So After doing this, This is how the Input & output looks:
In this post, We saw how to swap or reverse the axis of a table data in Excel 2010.
In this post, I’ll show you how you can create a word document that:
1) Gets numbers from Excel
2) auto refreshes when the data gets changed/updated in Excel
Note: I am using Excel 2010 & Word 2010 for the purpose of this blog.
Here are the steps:
Let’s say we have a line in word that says “The revenue for year _____ was $ ______”
and we want to fetch data numbers for Year and revenue from Excel.
2. Let’s go!
here’s the data in Excel:
3. In excel; copy A2 that has value 2012
4. switch to WORD:
now this is IMPORTANT. do NOT just paste it.
move your cursor to the destination position > right-click > paste options > Link and Merge formatting
There’s also an option of link and keep source formatting, this is how it looks:
5. repeat the steps for revenue
6. Close Word.
7. Open Excel and change the value of year to 2011 and revenue to 4000
8 Now open Word 2010, say yes if a dialog box pops up
And here you go:
In this post, we saw how to create an auto-refreshing word document that fetches data numbers from Excel.
In this post, I’ll show you how you can add secondary axis in a chart in Excel 2010:
First up, Why do you need secondary axis?
Look at the Data and the chart that I just created:
Notice something? Column A is not “visible” – Is it? So Now let’s see the effect how adding a Secondary Axes:
Looks better? Here are the steps:
1. Select the chart
2. can you see charts tool options in the menu bar? Yes? Great!
3. Switch to Format Tab
4. select the column that you want to put on a secondary axes:
5. After selecting the column, click on Format Selection
6. In the dialog box, select secondary axes:
click on close.
7. Can you see the secondary axis now? Great!
8. Let’s do one more thing!
Let’s change the chart type of the secondary axis. Here are the steps:
a. Make sure the column is selected from the format Tab
b. Switch to the Design Tab under the charts Tool
c. Click on change chart type and select the chart of your choice. I select a column chart and this is how it looks:
In this post, we saw how to add a secondary axis. We also saw how to change the chart type of the secondary axis.
If you work with any statistical analysis tool, sometimes you may have run into configuring the data into either of these following categories: Nominal, Ordinal, Interval, Ratio
Here is what each term means:
||Simply names or call them set of characters
||Example: Full name, fruits, cars, etc
||Nominal + They have order
||Example: Small, medium, big
||Ordinal + the intervals between each value are equally split
||Example: temperature in Fahrenheit scale:10 20 30 etc
Note that 20F is not twice as cold as 40F. So multiplication does not make sense on Interval data. But addition and subtraction works. Which brings us to next point: Ratio
||Interval + multiplication makes sense
||Weight: 60KG, 120KG.120 KG = 2 * 60 KG
I hope the examples are of help when you are working with statistical analysis tools and need to categorize the data.
To view error(s) that occur during MDS’s staging process, we have two views: stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails. For the purpose of this blog post, let me show you how I (as an administrator) view information from stg.viw_name_MemberErrorDetails.
1) Name of the entity: supplier & Name of model is Suppliers
2) I imported data into staging table stg.supplier_leaf via SSIS
3) Now here’s how you can see the errors that occur during MDS’s staging process:
we can go to MDS web application to initiate the staging process > start the batch > after completion you can see the status as well as if it has any errors or not:
4) Now if you see that there’s an error, then you can go to stg.viw_name_MemberErrorDetails to see what are the errors. In my case, I am going to run the query:
select * from stg.viw_supplier_MemberErrorDetails where Batch_ID=2
You can get the above query via MDS web application too:
5) And as you can imagine, you can get access to this error data via SSIS (SQL Server Integration services) too. So if you have a workflow that a. Loads data to MDS and b. initiates the Batch process via Stored Procedure; then you can program it to get access to errors from the stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails tables.
That’s about it for this post! I hope this helps. Your comments are very welcome!
I found some data-sets on Occupational Employment Statistics on Bureau of Labor Statistics site and I played with it to see if I can find something interesting:
Few things about the data & visualization that I am going to share
- US only
- I downloaded the national level data But there’s also state level data available if you’re interested to drill down.
- The reports that you see where created after I got a chance to “clean” the data-set a bit and created a data model that suited basic reporting on top of it.
- For this blog post, I am going to play w/ May 2010 & 2011 data
- With the help of original data-set, you can drill down to get statistics about a particular Job Category if you want. For this blog-post, I am going to share visualizations that correspond to Job categories.
- click on images to see the higher resolution image.
With that, Here are some visualizations:
1) Job Category VS mean hourly salary:
2) Job Category VS number of employees:
3) Scatter Plot:
X Axis: Number of employees
Y – Axis: Wage (Mean Hourly Salary May 2011)
Size of Bubble: Wage (Mean Hourly Salary May 2011)
*Note: This may not be the best approach to create the Scatter Plot as I have used the same value (Mean Hourly Salary May 2011) twice – But since I was just playing w/ it, I went with what I had in the model.
Here’s the visualization:
Some of the things I observed:
1) I belong to an Industry (Computer and Mathematical occupations) which has relatively higher mean hourly wage.
2) There are few people working in “farming, fishing & forestry occupations” that do not get paid much.
3) There are lots of people working in “office administrative support occupations” that do not get paid much.
4) Management Occupations, Legal Occupations and computer & mathematical occupations have relatively higher mean hourly wages.
In this post, I played w/ Occupational Employment statistics data-sets and shared some visualizations.
In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data. In this post, we’ll see a use case of Term based relations. But before we do that, you can consider reading: Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services
Now, Here’s Term Based Relation’s in action
After data profiling you realize that there are records such as “my Company Inc.” and “my Company Incorporated” – so you set out to automatically find these mismatches in terms inside a value and correct them.
BEFORE cleaned data, the report showed that “my company Inc” revenue is less than that of Google:
AFTER cleaned data, the report correctly shows that “my company Inc” revenue is more than that of Google:
Steps taken to clean data:
(Just an overview, not covered in a step by step fashion)
1. Created the Knowledge Base w/ Two domains Company Names & Revenue
2. Term Based Relations Tab of Company Names domain:
3. Published the KB > Let’s create a DQS project
4. Mapped the domains:
5. DQS cleaned following records:
6. Exported the data and created a report out of clean data-set!
Business user is happy :)
In this post, we saw how to correct a word/term within a value in a domain. The example we used was Inc. , Inc and Incorporated . It can be used to correct terms like Jr. and Junior. Sr. and Senior. etc. Things like this are difficult to catch during data entry – But using Term Based Relations, a person who knows the Data can clean it so that it generates correct reports.
After all reports like this are of little to no use, are they? So Let’s NOT create confusing reports anymore.
- The revenue figures shown are just for demo purposes. I pulled up these numbers from Wikipedia. Please don’t make any financial decision based on these reports and if you do, I am not responsible for that.
- The name “my Company Inc” is a fictional firm. It’s not any firm that I am/was associated with in past, future of present. It’s a fictional name!
And Writing disclaimers like this are NO fun – sucks the joy out of “Thinking out Loud” :)
In this post, we would see how to get started with Fuzzy look-up add-in for excel.
First up, Which problem does Fuzzy Lookup add-in for excel solve? It cleans similar (a.k.a matching) textual data in Excel. E.g. “Mr Paras Doshi”, “Doshi Paras”, “Paras A Doshi” are similar and may refer to the same person – Fuzzy look-up helps you detect such similar textual data. This add-in is really handy if you are “combining” data from different systems where the data is not in the same format – using this add-in you can detect similar looking text and clean the data-set at hand. With that, Here are the steps to download, install and play with this add-in:
1) Download “Fuzzy Lookup add-in for Excel. Read: Over view, system requirements and Instructions
(If this doesn’t work – search for “Fuzzy lookup add in for excel”
2) Note the sample excel file called “Portfolio” that comes with the download files. We’ll open it after the Successful installation of the add-in:
I left the default on all dialog boxes.
4) Open the sample file: “portfolio“
5) Click on Install if you see a dialog box “Microsoft Office customization folder”
(please read the message in the dialog box too)
6) Can you see Fuzzy Lookup in Excel Toolbar? Yes? Great!
7) Now if you have opened the Portfolio file then you’ll see a tutorial on how to get started on the worksheet named portfolio:
8) follow the above tutorial to get started. I just did!
In this post, we saw how Fuzzy Lookup add-in for excel 2010 help you find matching text from two data sources. I hope that helps.