As a part of University of Washington’s (UW) cloud class’s assignment, I played with Google’s BigData offering BigQuery and I am writing this blog post to share what I think about it. please note that the views are my own and do not represent those of the instructor’s and fellow students at UW. And also I am not a BigData “Expert”, Think of me as a student trying to get my head around various offerings out there – So if you feel otherwise about what I have written, Just let me know in the comments section. Any-who read along to know what I think of BigQuery:
First up what is BigQuery?
It’s a platform to analyze your data (lot’s of it) by running SQL-Like Queries. And it’s really SQL-Like, and so if you are from SQL world like me – you would not face any issues in getting up and running in seconds by referring to the nicely written documentation.
And other point to consider here is that even though it’s SQL-Like, you’ll be able to analyze considerable number of rows in few seconds. Let me give you an example: I played with a sample (called gsod) which had 115M rows and as per my experiments, I was able to get answers to simple computations like max, mean, avg, etc in less than couple of seconds. And little complex queries having where, joins and group by in around 5-6 seconds. Your results may vary depending on the type of query you run but the BOTTOMLINE is that it is FAST. that’s a good news!
BigQuery is Fast!
But what bothers me is that How am I suppose to “UPLOAD” lots of data on the Google CLOUD. It takes time, right? But I guess that’s an issue with every cloud based BigData offering. But here’s what I am thinking – If your data is already on the cloud. for e.g. Amazon’s or Microsoft’s – Does it not make sense to run analytic’s on Amazon’s and Microsoft’s cloud instead of porting your data to Google’s?
[Sidenote: I like it that Hadoop on Azure allows Amazon S3 data source. Nice move!]
My concern: Time spent in uploading truckload of data to Google’s cloud just so that we can use it for BigQuery
And even if you have your data on GAE data-store, you’ll have to uplaod your data to BigQuery separately. Source
Zooming out for a moment, I feel the Goal of BigQuery was to offer an easy to use BigData platform, And I feel that’s what they have delivered:
An easy-to-use + easy-to-setup “Hadoop+Hive” Like Offering.
[Update: Aug 20th 2012: I have been thinking about it more and I realized that BigQuery is more about satisfying real-time Big Data Scenario's. And Hadoop/Hive/MapReduce is more about Batch Oriented analysis and it's great if you need to pre-process tons and tons of data]
But this “easiness” means that It is NOT as advanced as a Hadoop Installation (or Hadoop-on-Azure or Amazon’s elastic-map-reduce). But again, it’s easier and faster to get started with BigQuery. I guess, it just depends on what you are trying to achieve and based on that you’ll have to figure which is right tool for your scenario. No generic answer here, Sorry!
And BTW BigQuery supports only CSV – Talk about Variability (One of the V’s of BigData!). Let’s not get into that. I just wanted to Point that out because if you’re looking to analyze data-sets that cannot be converted to CSV for running SQL-Like Queries on top of them then BigQuery is not for you.
Try out BigQuery. It’s easy to get started. It’s powerful if SQL-Like queries are all what you’ll need to analyze your data. If you are BigData enthusiast/expert/student – It’ll be a nice exercise to mentally compare other BigData offerings with BigQuery.
If you decide to try BigQuery or have already tried it out, I’ll love to hear what you think of it. Please leave a comment!
UPDATE (based on Michael Manoochehri’s comment): I didn’t implied that it is prohibitively expensive to upload data to BigQuery. Because I know, it’s NOT! Here is the result that Michael Manoochehri shared: As a test I once ingested about 350 Gb of CSV data (split into 10gb raw files, then I gzipped each one into ~1Gb). I ingested the entire batch using the bq command line tool, and had the entire dataset in BigQuery in just a few hours. I agree that it’s not 100% trivial to move 300 Gb of data from a local cluster into Google’s cloud – but it’s not really that difficult.
[Update: Aug 20th 2012: If you are interested in the Mechanics behind BigQuery - search for "Google Dremel Whitepaper". it's an amazing read]