Download Link Here:
SQL Saturday 185 (Trinidad): Why Big Data Matters? by Paras Doshi
(if you need the .ppt version of this talk, please contact me via http://parasdoshi.com/contact/)
Download Link Here:
SQL Saturday 185 (Trinidad): Why Big Data Matters? by Paras Doshi
(if you need the .ppt version of this talk, please contact me via http://parasdoshi.com/contact/)
HDFS and MapReduce inner workings in a nutshell.
Click on the image to view larger sized image
In this post, I want to point out that HDInsight (Hadoop on Windows) comes with a sample datasets (log files) that you can load using the command:
1. Hadoop command Line > Navigate to c:\Hadoop\GettingStarted
2. Execute the following command:
powershell -ExecutionPolicy unrestricted –F importdata.ps1 w3c
After you have successfully executed the command, you can sample files in /w3c/input folder:
Conclusion: In this post, we saw how to load some data to Hadoop on Windows file system to get started. Your comments are very welcome.
Official Resource: http://gettingstarted.hadooponazure.com/loadingData.html
This Blog post applies to Microsoft® HDInsight Preview for a windows machine. In this Blog Post, we’ll see how you can browse the HDFS (Hadoop Filesystem)?
1. I am assuming Hadoop Services are working without issues on your machine.
2. Now, Can you see the Hadoop Name Node Status Icon on your desktop? Yes? Great! Open it (via Browser)
3. Here’s what you’ll see:
4. Can you see the “Browse the filesystem” link? click on it. You’ll see:
5. I’ve used the /user/data lately, so Let me browse to see what’s inside this directory:
6. You can also type in the location in the check box that says Goto
7. If you’re on command line, you can do so via the command:
hadoop fs -ls /
And if you want to browse files inside a particular directory:
In this post, we saw how to browse Hadoop File system via Hadoop Command Line & Hadoop Name Node Status
Related Articles:
Problem Statement: Find Maximum Temperature for a city from the Input data.
File 1:
New-york, 25
Seattle, 21
New-york, 28
Dallas, 35
File 2:
New-york, 20
Seattle, 21
Seattle, 22
Dallas, 23
File 3:
New-york, 31
Seattle, 33
Dallas, 30
Dallas, 19
Let’s say Map1, Map2 & Map3 run on File1, File2 & File3 in parallel, Here is their output:
(Note how it outputs the “Key – Value” pair. The key would be used by the reduce function later to do a “group by“)
Map 1:
Seattle, 21
New-york, 28
Dallas, 35
Map 2:
New-york, 20
Seattle, 22
Dallas, 23
Map 3:
New-york, 31
Seattle, 33
Dallas, 30
Reduce Function takes the input from Map1, Map2 & Map3, to give an output:
New-york, 31
Seattle, 33
Dallas, 35
Conclusion:
In this post, we visualized MapReduce Programming Model with an example: Finding Max Temp. for a city. And as you can imagine you can extend this post, to visualize:
1) Find Minimum Temperature for a city.
2) In this post, the key was City, But you could substitute it by other relevant real world entity to solve similar looking problems.
I hope this helps.
Related Articles:
What is Neologism?
Neologism means The coining or use of new words – And I believe it’s one of the challenge faced by IT professionals. Nowadays, we put our time & energy trying to get head around “new terms/words/trends”.
Let’s take couple of example(s):
Sometime back, we had cloud computing. Nowadays, its Big Data; In my mind – Big Data has been coined to mean following technologies/techniques under different contexts:
Note: The above image is just for illustration purpose. It does not comprehensively cover every technology that is now called “Big Data”. Feel free to point it out if you think I missed something important.
And Neologism is challenge because:
1) Generally, it’s a new trend and there is little to no consensus on what does it “Exactly” mean
2) It means different things in different context
3) Every person can have their own “interpretation” and no one is wrong.
4) It’s a moving ball. The definition used today will change in future. So we always need a “working” definition for these terms.
Now, Don’t get me wrong, It’s fun trying to figure out what does it all mean and trying to gauge whether it matters to me and my organization or not! What do you think – as a Person in Information Technology, do you think that Neologism is one of the challenges faced by us? consider leaving a reply in the comment section!
Related Articles:
Want to learn about BigData? read Oreilly’s Book “Planning for BigData”
Quote for Big-Data / Data-Science/ Data-Analysis enthusiasts:
Who on earth is creating “Big data”?
Examples to help clarify what’s unstructured data and what’s structured?
I like to keep an eye on Technology Trends. One of the ways I do that is by subscribing to leading magazines for articles – I may not always read the entire article but I definitely read the headlines to see what Industry is talking about. during last 12 months or so I have seen a lot of buzz around Big Data and I thought to myself – It would be nice to see a Trend line for Big Data. Taking it a step further, I am also interested in seeing if there is a correlation between growing trend in “Hadoop” and “Big Data”. Also, I wanted to see how it compares with the Terms like Business Intelligence and Data Science. With this, I turned to Google Trends to quickly create a Trend report to see the results.
Here’s the report:
Here are some observations:
1) There’s a correlation between Trend of Big Data and Hadoop. In fact, it looks like growing interest in Hadoop fueled interest in “Big Data”.
2) Trend line of Big Data and Hadoop overtook that of Business Intelligence in Oct 2012 and sep 2012 respectively.
3) Decline in Trend line of Business Intelligence.
4) There seems to be a steady increase in Trend line for Business Analytics and Data Science.
And Here’s the Google Trend report URL: http://www.google.com/trends/explore#q=Big%20Data%2C%20Hadoop%2C%20Business%20Intelligence%2C%20Business%20Analytics%2C%20Data%20Science&cmpt=q
What do you think about these trends?