Resource: 12 recorded sessions from the 24hop business analytics edition are online! #passbac #msbi

Recently, PASS hosted a 24hop business analytics event:

And now, the 12 one hour sessions ranging from data visualization, predictive analytics to Big Data are online for you to watch! They also serve as “Trailer” for what you can expect at the PASS Business Analytics conference!

Here’s the URL: http://passbaconference.com/Sessions/SneakPeeks.aspx

And I was following some of these sessions live on the event day – and I can tell you, these sessions are great resources!

Also, I participated in the twitter contest (by Microsoft BI) that was happening along w/ the event – and this is what I got for my win!

24 hop twitter contest prize

hoodie w/ embedded earphones!

That’s about it for this post. Enjoy the recordings!

Playing w/ the Occupational Employement Statistics Data-Set:

I found some data-sets on Occupational Employment Statistics on Bureau of Labor Statistics site and I played with it to see if I can find something interesting:

Few things about the data & visualization that I am going to share

  • US only
  • I downloaded the national level data But there’s also state level data available if you’re interested to drill down.
  • The reports that you see where created after I got a chance to “clean” the data-set a bit and created a data model that suited basic reporting on top of it.
  • For this blog post, I am going to play w/ May 2010 & 2011 data
  • With the help of original data-set, you can drill down to get statistics about a particular Job Category if you want. For this blog-post, I am going to share visualizations that correspond to Job categories.
  • click on images to see the higher resolution image.

With that, Here are some visualizations:

1) Job Category VS mean hourly salary:

1 Job category vs hourly salary mean bureau of labour statistics

2) Job Category VS number of employees:

2 Job category vs number of employees bureau of labour statistics

3) Scatter Plot:

X Axis: Number of employees

Y – Axis: Wage (Mean Hourly Salary May 2011)

Size of Bubble: Wage (Mean Hourly Salary May 2011)

*Note: This may not be the best approach to create the Scatter Plot as I have used the same value (Mean Hourly Salary May 2011) twice – But since I was just playing w/ it, I went with what I had in the model.

Here’s the visualization:

3 scatter plot number of employees vs mean hourly wage may 2011 employment statistics

Some of the things I observed:

1) I belong to an Industry (Computer and Mathematical occupations) which has relatively higher mean hourly wage.

2) There are few people working in “farming, fishing & forestry occupations” that do not get paid much.

3) There are lots of people working in “office administrative support occupations” that do not get paid much.

4) Management Occupations, Legal Occupations and computer & mathematical occupations have relatively higher mean hourly wages.

Conclusion:

In this post, I played w/ Occupational Employment statistics data-sets and shared some visualizations.

Data Quality Servies: What does a locked Knowledge Base indicate?

In this blog-post, we would see what does it mean to lock a knowledge base in Data Quality Services? So the lock on the Knowledge Base indicates that there are unsaved changes in the Knowledge base when you or someone else was working on it.

In the Data Quality Client, Here’s how a lock on the Knowledge Base looks:data quality services knowledge base lock

And here are few points for a locked knowledge base:

1) If you did not lock the Knowledge Base then you can open it in read-only only

2) if you locked the Knowledge base, you can open and edit it. The Knowledge base would be opened in the state that it was closed in.

3) A user working on the Knowledge base can unlock it by publishing it or by unlocking the knowledge base.

4) By positioning the cursor on the knowledge base – you can see who locked it:

who user lock the knowledge base data quality

Conclusion:

In this blog post, we saw what does a lock on a knowledge base in Data Quality Services mean?

Let’s Install R & RStudio on Windows Machine!

I was recently searching for a way to do some text mining on Twitter Data. I was interested in a tool that has some “library” that helps to fetch twitter data & later, I wanted to create visualization like say word cloud, time series. etc. Turns out that “R” perfectly suited my needs because of libraries/packages such as TwitteR and ggplot2 - And so, I downloaded and installed R and RStudio on my windows machine. Here are the steps (I am using Windows Server 2008 R2 machine 64 bit):

1. Download R for Windows:

Install R for windows twitter analytics

2. After downloading it > Install it by leaving all options to default.

3. Download RStudio Desktop for windows:

install R studio for windows desktop

4. Install RStudio > leave all options to default.

5. Open RStudio > In the Bottom Right Pane, switch to Packages Tab > Click on Install Packages > In the packages box, type in ggplot2 and > click on Install.

ggplot2 package R Rstudio

5. Check that ggplot2 successfully unpacked and installed > Now similarly install the package: twitteR > make sure it is successfully unpacked and installed.

twitteR package R Rstudio windows analytics6. And I quickly created a chart of Twitter UserName vs Number of Tweets for #sqlpass:

we can do much mire but just wanted to show how you can do social media analytics with R!

Twitter Analytics with R Studio windows Bar Plot

Conclusion:

In this blog post, we saw a step by step process to download and install R and R studio on a windows machine.

 

Examples to help clarify what’s unstructured data and what’s structured?

I have been reading and researching about BigData and BigData on cloud. One of the concept that’s repeated is that “Big Data is about analyzing unstructured data…” and in this blog post, I just want to show few examples that would help you differentiate between Structured data & Unstructured data.

Before we begin, here’s the definition of Unstructured data:

Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables – Wikipedia

Also I just wanted to point that it’s not unstructured because you cannot fit the data into a schema/model but even after fitting it into the model – it would not help. Example. Consider email body as an example of unstructured data. You can create a column “EMAIL BODY”. Now think of questions that are likely to be asked. Do they get answered? if not – then fitting it into model and calling it structured does not make sense, does it? With that, Here are the examples:

1 Word Doc & PDF’s & Text files

Unstructured data

Examples: Books, Articles

2. Audio files

Unstructured data

Example: Call center conversations.

3. email body

Unstructured data

Example: you don’t need an example here!

4. Videos

Unstructured data

Example: Video footage of criminal interrogation

5. A Data Mart / Data Warehouse

Structured Data

6. XML

Semi Structured Data

Couple of Applications for your brain cells:

1. Map disease patterns by analyzing medical records (Text)

2. Tuning customer support by analyzing calls (Audio)

Few Quotes about Unstructured data that I liked:

80 percent of business-relevant information originates in unstructured form -  Justin Langseth. URL (Wikipedia Article says that even Merrill Lynch cited this)

BUT some-one else had a nice perspective about this 80%:

but managing it (this 80%) really isn’t a significant problem……………the innovation isn’t in structuring text, it’s in applying models to discover and exploit their inherent structure. Source

My Experience with Unstructured Data (in context of BigData) and Cloud:

I have been playing with MapReduce on Windows Azure (Project Daytona), Elastic Map reduce (Amazon Web Services) and Google’s BigQuery platform. To give you one example. I’ll use the example of Microsoft’s project daytona. Here I uploaded data in unstructured format in form of TEXT. And the goal was to run the “Word Count”. It helps you answer questions like: which word has the highest frequency? or which is the least popular word? and you could tweak the algorithm to consider words with length greater than four (among other constraints) – Now this is what happens when you run the algo: amazing MapReduce framework (App deployed on Windows Azure in this case) does some analysis on unstructured data (TEXT  in this case) and it helps you answer the question that you were looking for. So I hope you know how it works.

That’s about it for this post. Do you have an example or application of unstructured data? Please do post it in the comments!