Book Review: R in a Nutshell

R is a popular tool among data scientists because it’s just like a Swiss Army knife (or may be more!) for them!

R Language Data scientist swiss army knife tool

Analogy credit: Tapping the Data Deluge with R by Jeffrey Breen

Sometime back I worked on a research project that involved writing some R code – we were searching for tools ways to pull data from multiple social networks, perform text analysis and create effective data visualizations. R seemed like a great tool & so I was searching for a book/guides that teaches me fundamentals I needed to know to get few R related things done. One of the books that I used often during the research project was “R in nutshell”. I didn’t read it cover-to-cover but it was a great reference book for me. I used to read guides online/other-books and then I used to combine information from this book to get stuff done. The section I liked the most was on Data visualization which included some great code snippets to create effective data visualization using ggplot2 library. I used to take code snippets from this book & apply it on data-sets that I had.

text analysis sentiment

Fun stuff!

Also, I liked it that the book has some end-to-end examples that cover the entire life cycle of data analysis/statistical-analysis.

Summary:

I recommend this book as a “reference” for someone who started working with R.

Note:

I received a copy of this book as part of OREILLY’s Blogger program. Thanks OREILLY! If you are a blogger, you should check out that program!

Business Analytics Continuum:

Think of “continuum” as something you start and you never stop improving upon. In my mind, Business Analytics Continuum is continuous investment of resources to take business analytics capabilities to next level. So what are these levels? Douglas McDowell explained about this concept in recent post here – I think it was a great food for thought for me and hence I posting about this particular concept here. 

Here is the visual representation of the concept:

business analytics continuum

And I would encourage you to read the entire post and other posts in the series here: PASS BAC Preview Series: Business Analytics Defined

Resource: 12 recorded sessions from the 24hop business analytics edition are online! #passbac #msbi

Recently, PASS hosted a 24hop business analytics event:

And now, the 12 one hour sessions ranging from data visualization, predictive analytics to Big Data are online for you to watch! They also serve as “Trailer” for what you can expect at the PASS Business Analytics conference!

Here’s the URL: http://passbaconference.com/Sessions/SneakPeeks.aspx

And I was following some of these sessions live on the event day – and I can tell you, these sessions are great resources!

Also, I participated in the twitter contest (by Microsoft BI) that was happening along w/ the event – and this is what I got for my win!

24 hop twitter contest prize

hoodie w/ embedded earphones!

That’s about it for this post. Enjoy the recordings!

Quick Post: Uploading Local Data to Hadoop file system using Hadoop Command Line

This is a Quick Post, Just want to share a command to upload local data to HDFS using Hadoop Command Line.

The command looks like:

> hadoop fs -copyFromLocal input.txt input/SqrtJob/input.txt

1

How to start Analyzing Twitter Data w/ R?

Over the past few weeks, I have posted notes about Analyzing Twitter Data w/ R, listing them here:

1. Install R & RStudio

2. R code to download twitter data

3. Perform Sentiment Analysis on Twitter Data (in R)

How to load some data to Hadoop on Windows to get started?

In this post, I want to point out that HDInsight (Hadoop on Windows) comes with a sample datasets (log files) that you can load using the command:

1. Hadoop command Line > Navigate to c:\Hadoop\GettingStarted

2. Execute the following command:

powershell -ExecutionPolicy unrestricted –F importdata.ps1 w3c

import data to hadoop on windows file system

After you have successfully executed the command, you can sample files in /w3c/input folder:

w3c log files iis hadoop on windows

Conclusion: In this post, we saw how to load some data to Hadoop on Windows file system to get started. Your comments are very welcome.

Official Resource: http://gettingstarted.hadooponazure.com/loadingData.html

Microsoft® HDInsight Preview for Windows: How to use Sqoop to load data into HDFS from SQL Server?

In this post, we’ll see how to use Sqoop to load data into HDFS from SQL Server?

With that, here are the steps:

1. You have the Microsoft® HDInsight Preview for Windows Installed on your machine. Here’s a tutorial: Installing HDInsight (Microsoft’s Hadoop) on windows 7

2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine

Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!

3. Before we begin, decide on three things:

3a: Username and Password that Sqoop would use to login to the SQL Server database. If you create a new username and pasword, test it via SSMS before you proceed.

3b. select the table that you want to load into HDFS

In my case, it’s this table:

sql table to be loaded into hadoop hdfs from sql server3c: The target directory in HDFS. in my case I want it to be /user/data/sqoopstudent1

You can create by command: hadoop fs -mkdir /user/data/sqoopstudent1

[to learn about how to create directory, read: How to create a directory in Hadoop File System? ]

4. Now Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)

5. Navigate to: c:\Hadoop\sqoop-1.4.2\bin>

*This path may change in future, but navigate to the bin folder under the SQOOP_HOME.

6. Run dir command to see various files under this directory.

sqoop list files under the HOMe directory import export

Also you can run sqoop help for more information on the command that we are about to run.

sqoop list of commands help

7. Now here’s the command to Load data from SQL Server to HDFS:

c:\Hadoop\sqoop-1.4.2\bin>sqoop import –connect “jdbc:sqlserver://localhost;dat
abase=UniversityDB;username=sqoop;password=**********” –table student –tar
get-dir /user/data/sqoopstudent1 -m 1

sqoop command to load data from sql server to hadoop file system

8. After successfully running the above command, let’s browse the file in HDFS!

sqoop see the content of the file

That’s about it for this post!

Thanks

Thanks Aviad Ezra who answered my question on this MSDN thread: An error while trying to use Sqoop on HDInsight to import data from SQL server to HDFS

Conclusion:

In this post, we saw how to load data into Hadoop from SQL Server using Sqoop (SQL Hadoop)

Related Articles:

Neologism is the new challenge for IT professionals, Here’s why:

What is Neologism?

Neologism means The coining or use of new words – And I believe it’s one of the challenge faced by IT professionals. Nowadays, we put our time & energy trying to get head around “new terms/words/trends”.

Let’s take couple of example(s):

Sometime back, we had cloud computing. Nowadays, its Big Data; In my mind – Big Data has been coined to mean following technologies/techniques under different contexts:

Big Data Unstrucutred External Text Public Data

Note: The above image is just for illustration purpose. It does not comprehensively cover every technology that is now called “Big Data”. Feel free to point it out if you think I missed something important.

And Neologism is challenge because:

1) Generally, it’s a new trend and there is little to no consensus on what does it “Exactly” mean

2) It means different things in different context

3) Every person can have their own “interpretation” and no one is wrong.

4) It’s a moving ball. The definition used today will change in future. So we always need a “working” definition for these terms.

Now, Don’t get me wrong, It’s fun trying to figure out what does it all mean and trying to gauge whether it matters to me and my organization or not! What do you think – as a Person in Information Technology, do you think that Neologism is one of the challenges faced by us? consider leaving a reply in the comment section!

Related Articles:

Want to learn about BigData? read Oreilly’s Book “Planning for BigData”

Quote for Big-Data / Data-Science/ Data-Analysis enthusiasts:

Who on earth is creating “Big data”?

Examples to help clarify what’s unstructured data and what’s structured?

Things I shared on Social Media Networks during Noc 12 – Dec 31 (2012)

Big Data: The Coming Sensor Data Driven Productivity Revolution http://bit.ly/TQAPsW

Check out some nice getting started tutorials at beyondrelational site: http://bit.ly/RVVHRV

Complexity is your enemy. Any fool can make something complicated. It is hard to make something simple – Richard Branson

— via Paras Doshi – Blog http://on.fb.me/WAQ5ky

The success of companies like Google, Facebook, Amazon, and Netflix, not to mention Wall Street firms and industries from manufacturing to retail and healthcare, is increasingly driven by better tools for extracting meaning from very large quantities of data,” says Tim O’Reilly

— via Paras Doshi – Blog http://on.fb.me/WAQ5ky

Nice collection of about 20+ videos around the topic of “Data Science”: http://bit.ly/WMkZqc

Nice collection of videos by Berkeley school of information: http://bit.ly/Tf1yAD #Information #Data

Just found Facebook’s data team’s page: http://on.fb.me/ToYILO

via V Talk Tech – A Parth Acharya Blog – Nice HeatMap of stocks! http://on.fb.me/SfBbvF

what’s the biggest fear about cloud computing? via Windows Azure http://on.fb.me/VjIiHR

Resource: Presentations from the Sentiment Analysis Symposium http://bit.ly/VtPH3B

If I switched to the newest “holiday” theme on WordPress, this is how it would look: http://on.fb.me/UEuyFr

Nice! Code School now has R programming language! I have been playing with R for a while now and definitely want to learn more – here’s the link to learn R: http://bit.ly/VEAnkZ

Interesting tool from Google to optimize and analyze web page speeds: http://bit.ly/HTubNC

Performed #sentiment #Analysis on #starbucks twitter data using #R ! It was fun! http://on.fb.me/Z3qLo8

In 2002: The Data Warehousing Institute estimates that data quality problems cost U.S. businesses more than $600 billion a year. And of course, over the past 10 years, this number would be bigger. http://bit.ly/TPT9r3

Reading: Business Analytics vs Business Intelligence? http://bit.ly/YUtJwx

Big data is a nickname for the recent increase in largely external and unstructured business and consumer information. How are businesses across industries harnessing traditional enterprise information management functions and systems to translate big data into useful business intelligence? http://www.deloitte.com/view/en_US/us/Services/additional-services/deloitte-analytics-service/217c19e69249b310VgnVCM2000003356f70aRCRD.htm

For business analytics professionals: 12 webcasts on Jan 30th 2013 http://bit.ly/RUFsZ3 #sqlpass #analytics #24hop

Some nice insights about how to build an Internet platform, from the founder of Zipcar: http://bit.ly/Yco6IP

Let’s connect and converse on any of these people networks!

paras doshi blog on facebookparas doshi twitter paras doshi google plus paras doshi linkedin