Thursday, November 18, 2010

First look at Google BigQuery

In this blog I would primarily be talking about the Google BigQuery API. This has been a major step in the computing world to provide database as a service on cloud computing environment. BigQuery API has been designed to analyse huge amounts of data in really less timeframe. Interface of the service is quite straightforward, and involves HTTP calls to get data from the tables.
Since Google BigTable is a NoSQL oriented database, the format of BigQuery doesn't support JOINS but is able to support lots of other features like grouping, ordering, ranges, rejection and so on.


BigQuery API only provides a basic querying mechanism to the end users. There is no functionality that allows you to manipulate the data once uploaded into the BigTable structure. You can only append more data into the structure or fire SELECT queries against the data. This makes BigQuery engine more of an analytics tool over trillions of rows of data.

BigTable is a part of Google's storage servers and can be run without the need of any additional hardware. Moreover, the API has been exposed via simple HTTP and JSON, which are easy to put in any application. One can leverage these benefits to cut down on the hardware costs of setting up high performance systems for analysis.

The whole process of carrying out analysis on some dataset involves two basic steps:

1. Importing data into the BigTable structure (Through Google Storage and import job)
2. Running the queries against that data

Importing the data:

In order to analyse your data, you should do following steps to port your data into the Google BigTable system:

1. Ensure that you have Google Storage and BigQuery enabled account with Google.
2. Upload the data in CSV format to Google storage using their storage manager HTTP interface. Header information is not required. Only the columns of CSV must match the structure of the table as specified in subsequent steps.
3. Use cURL tool to get the authorization key which can be used in subsequent calls to the HTTP service exposed. Command to be executed is :

curl -X POST -d accountType=GOOGLE -d Email=@gmail.com -d Passwd= -d service=ndev -d source=google-bigquery-manualimport-1 -H "Content-Type: application/x-www-form-urlencoded" https://www.google.com/accounts/ClientLogin

This call will return an authorization key whose format is similar to :


Just keep a note of the key as it will be used further. Down the line, I will refer to it as AUTH_KEY
4. Create the table using following command.

curl -H "Authorization: GoogleLogin auth=AUTH_KEY" -X POST -H "Content-type: application/json" -d '{"data": {"fields": [ {"type": "string", "id": "name", "mode": "REQUIRED"}, {"type": "integer", "id": "age", "mode": "REQUIRED"}, {"type":"float", "id": "weight", "mode": "REQUIRED"}, {"type": "boolean", "id": "is_magic", "mode": "REQUIRED"} ], "name": "/testtable" }}' 'https://www.googleapis.com/bigquery/v1/tables'

bucket_name is the name of bucket where you want the reference of your table to be created. Buckets are nothing but folders inside Google storage.
5. After you have successfully added the table, import process needs to be triggered to port the uploaded CSV data into the BigTable structure. Following command will do the same :

curl -H "Authorization: GoogleLogin auth=AUTH_KEY" -X POST -H "Content-type: application/json" -d '{"data": {"sources": [{"uri": "/.csv"}]}}' 'www.googleapis.com/bigquery/v1/tables/%2Ftesttable/imports'

* Please note that testtable is the table name that you have created in step 4.
* Response to this kind of a request is a JSON string which is of a format : {"data":"kind":"bigquery#import_id","table":"/testtable","import":""}}
* import_id is useful in getting the current state of the import process. It also returns errors in case it has encountered any. Any errors will just fail the import process and rollback everything.
6. To know the status of your import process, just fire the following command:

curl -X GET https://www.googleapis.com/bigquery/v1/tables/%2Ftesttable/imports/ -H "Authorization: GoogleLogin auth=AUTH_KEY"

Querying the data:

After the data is successfully ported, you can query the database using BigQuery. Just download the bqshell tool from here and build the code with all the required dependencies. The tool works on python and has a detailed "how-to" to install it.

This tool has a login console in which you can specify the username and password and query the sample datasets or your own datasets that you create using cURL. This tool is a parser to the JSON responses returned from the BigQuery API and displays them in a SQL output fashion. Alternatively, you can also use cURL calls to query the database and see the JSON response yourself.

Conclusion :

* BigQuery API shows good performance in scenarios where you have huge amounts of data that needs to be processed. Running any query on almost 28 million rows uploaded in a test data set gives back response in just 2-3 seconds. The time includes request post and response recieved times as well.

* There are sample data sets provided by Google themselves, one of which contains 60 billion records and queries get executed in 4-8 seconds. For more reference, please visit the following link : http://code.google.com/apis/bigquery/docs/dataset-mlab.html

* One of the drawbacks is that there is no way to insert data without the upload and import job process. One can also not delete any record from the table once inserted. In order to change the data, you will have to delete the table and do the complete import process again. One thing worth mentioning is that in case you only need to add data to an existing table, you can do so by firing another upload job for the CSV file containing additional data.

For more info.... search on Google below