Mailing-List: contact core-commits-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Apache Wiki <wikidiffs@apache.org>
To: core-commits@hadoop.apache.org
Date: Sun, 17 May 2009 13:08:39 -0000
Message-ID: <20090517130839.15725.45049@eos.apache.org>
Subject: [Hadoop Wiki] Update of "Hive/HiveAws" by JoydeepSensarma

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/Hive/HiveAws

The comment on the change is:
under construction

New page:
= Hive and Amazon Web Services =

== Background ==

This document explores the different ways of leveraging Hive on Amazon Web Services - namely [[http://aws.amazon.com/s3 S3]], [[http://aws.amazon.com/EC2 EC2]] and [[http://aws.amazon.com/elasticmapreduce/ Elastic Map-Reduce]]. 

Hadoop already has a rich tradition of being run on EC2 and S3. These are well document documented here and a must read:
 * [[http://wiki.apache.org/hadoop/AmazonS3 Hadoop and S3]]
 * [[http://wiki.apache.org/hadoop/AmazonEC2 Amazon and EC2]]

The second document also has pointers on how to get started using EC2 and S3. For people who are new to S3 - there's a few helpful hints in [#S3n00b S3 for n00bs section] below. The rest of the documentation below assumes that the reader can launch a hadoop cluster in EC2 and run some simple Hadoop jobs.

== Introduction to Hive and AWS ==
There are three separate questions to consider when running Hive on AWS:
 1. Where to run the [wiki:LanguageManual/Cli Hive CLI] from and store the metastore db (that contains table and schema definitions).
 1. How to define Hive tables over existing datasets (potentially in S3)
 1. How to dispatch Hive queries (which are all executed using one or more map-reduce programs) to a Hadoop cluster running in EC2.

We walk you through the choices involved here and then show you simple sample configurations.

=== Running the Hive CLI ===
Hive CLI environment is completely independent of Hadoop. The CLI takes in queries, compiles them into a plan consisting of map-reduce jobs and then submits them to the configured Hadoop Cluster. For this reason the CLI can be run from any node that has a Hive distribution, a Java Runtime Engine and that can connect to the Hadoop cluster. There are two choices on where to run the CLI from:

 1. Run Hive CLI from within EC2 - the Hadoop master node being the obvious choice. One problem here is the lack of comprehensive AMIs that bundle different versions of Hive and Hadoop distributions (and difficulty in doing so considering the large number of such combinations). [[http://www.cloudera.com/hadoop-ec2 Cloudera]] provides some AMIs that bundle Hive with Hadoop - although the choice in terms of Hive and Hadoop versions may be restricted. Another issue here is that any required map-reduce scripts may also need to be copied to the master.

 2. Run Hive CLI from outside EC2. In this case, the user installs a Hive distribution on a personal machine, - the main trick with this option is connecting to the Hadoop cluster - both for submitting jobs and for reading writing files. The section on [[http://wiki.apache.org/hadoop/AmazonEC2#FromRemoteMachine Running jobs from a remote machine]] details how this can be done. [#CaseStudyOne Case Study I] goes into this in more detail.

By default, Hive stores metadata in a local Derby database (created under a folder named metastore_db in the directory from where hive is launched).

 1. For Option 1, the metastore db can/should be zipped up and stored persistently in S3 (before terminating the Hadoop cluster) and conversely restored from there the next time a Hadoop cluster is launched. One can also consider alternative persistent stores in AWS like EBS. Th
 2. For Option 2, the metastore db can be stored on local disk and does not need to be stored in the cloud.

=== Loading Data into Hive Tables ===
Before getting into this - it is useful to go over the main storage choices for Hadoop/EC2 environment:

 * S3 is an excellent place to store data for the long term. There are a couple of choices on how S3 can be used:
  * Data can be either stored as files within S3 using tools like aws and s3curl as detailed in [#S3n00b S3 for n00bs section]. This suffers from the restriction of 5G limit on file size in S3. But the nice thing is that there are probably scores of tools that can help in copying/replicating data to S3 in this manner.
   * Alternatively Hadoop can be used to use S3 as a backing store for HDFS. In this case - data can only be read and written via HDFS.

 * HDFS instance on the local drives of Hadoop clusters allocated


[[Anchor(S3n00b)]]
=== S3 for n00bs ===

For n00bs - one of the things useful to understand is how S3 is used as a file system. Each S3 bucket can be considered as a root of a File System. Different files within this filesystem become objects stored in S3 - where the path name of the file (path components joined with '/') become the S3 key within the bucket and file contents become the value. Different tools like [[https://addons.mozilla.org/en-US/firefox/addon/3247 S3Fox]] and native S3 FileSystem in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in the keys. Not all tools are able to create an empty directory - in particular - S3Fox does (by creating a empty key representing the directory). Other popular