I have been reading and researching about BigData and BigData on cloud. One of the concept that’s repeated is that “Big Data is about analyzing unstructured data…” and in this blog post, I just want to show few examples that would help you differentiate between Structured data & Unstructured data.
Before we begin, here’s the definition of Unstructured data:
Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables – Wikipedia
Also I just wanted to point that it’s not unstructured because you cannot fit the data into a schema/model but even after fitting it into the model – it would not help. Example. Consider email body as an example of unstructured data. You can create a column “EMAIL BODY”. Now think of questions that are likely to be asked. Do they get answered? if not – then fitting it into model and calling it structured does not make sense, does it? With that, Here are the examples:
1 Word Doc & PDF’s & Text files
Examples: Books, Articles
2. Audio files
Example: Call center conversations.
3. email body
Example: you don’t need an example here!
Example: Video footage of criminal interrogation
5. A Data Mart / Data Warehouse
Semi Structured Data
Couple of Applications for your brain cells:
1. Map disease patterns by analyzing medical records (Text)
2. Tuning customer support by analyzing calls (Audio)
Few Quotes about Unstructured data that I liked:
80 percent of business-relevant information originates in unstructured form - Justin Langseth. URL (Wikipedia Article says that even Merrill Lynch cited this)
BUT some-one else had a nice perspective about this 80%:
but managing it (this 80%) really isn’t a significant problem……………the innovation isn’t in structuring text, it’s in applying models to discover and exploit their inherent structure. Source
My Experience with Unstructured Data (in context of BigData) and Cloud:
I have been playing with MapReduce on Windows Azure (Project Daytona), Elastic Map reduce (Amazon Web Services) and Google’s BigQuery platform. To give you one example. I’ll use the example of Microsoft’s project daytona. Here I uploaded data in unstructured format in form of TEXT. And the goal was to run the “Word Count”. It helps you answer questions like: which word has the highest frequency? or which is the least popular word? and you could tweak the algorithm to consider words with length greater than four (among other constraints) – Now this is what happens when you run the algo: amazing MapReduce framework (App deployed on Windows Azure in this case) does some analysis on unstructured data (TEXT in this case) and it helps you answer the question that you were looking for. So I hope you know how it works.
That’s about it for this post. Do you have an example or application of unstructured data? Please do post it in the comments!