hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/HiveAws" by JoydeepSensarma
Date Sun, 17 May 2009 16:22:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/Hive/HiveAws

------------------------------------------------------------------------------
    * If the default Derby database is used - then one has to think about persisting state
beyond the lifetime of one hadoop cluster. S3 is an obvious choice - but the user must restore
and backup Hive metadata at the launch and termination of the Hadoop cluster.
  
   2. Run Hive CLI remotely from outside EC2. In this case, the user installs a Hive distribution
on a personal workstation, - the main trick with this option is connecting to the Hadoop cluster
- both for submitting jobs and for reading and writing files to HDFS. The section on [[http://wiki.apache.org/hadoop/AmazonEC2#FromRemoteMachine
Running jobs from a remote machine]] details how this can be done. [wiki:/HivingS3nRemotely
Case Study 1] goes into the setup for this in more detail. This option solves the problems
mentioned above:
-   * Stock Hadoop AMIs can be used. The user can run any version of Hive on their workstation,
launch a Hadoop cluster with the desired version etc. on EC2 and start running queries.
+   * Stock Hadoop AMIs can be used. The user can run any version of Hive on their workstation,
launch a Hadoop cluster with the desired Hadoop version etc. on EC2 and start running queries.
    * Map-reduce scripts are automatically pushed by Hive into Hadoop's distributed cache
at job submission time and do not need to be copied to the Hadoop machines.
    * Hive Metadata can be stored on local disk painlessly.
  
@@ -56, +56 @@

  
  == Submitting jobs to a Hadoop cluster ==
  This applies particularly when Hive CLI is run remotely. A single Hive CLI session can switch
across different hadoop clusters (especially as clusters are bought up and terminated). Only
two configuration variables:
-  * fs.default.name
+  * {{{fs.default.name}}}
-  * mapred.job.tracker
+  * {{{mapred.job.tracker}}}
  need to be changed to point the CLI from one Hadoop cluster to another. Beware though that
tables stored in previous HDFS instance will not be accessible as the CLI switches from one
cluster to another. Again - more details can be found in [wiki:/HivingS3nRemotely Case Study
1].
  
  == Case Studies ==
   1. [wiki:/HivingS3nRemotely Querying files in S3 using EC2, Hive and Hadoop ] 
  
  == Appendix ==
- 
  [[Anchor(S3n00b)]]
  === S3 for n00bs ===
- One of the things useful to understand is how S3 is used as a file system normally. Each
S3 bucket can be considered as a root of a File System. Different files within this filesystem
become objects stored in S3 - where the path name of the file (path components joined with
'/') become the S3 key within the bucket and file contents become the value. Different tools
like [[https://addons.mozilla.org/en-US/firefox/addon/3247 S3Fox]] and native S3 FileSystem
in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in
the keys. Not all tools are able to create an empty directory. In particular - S3Fox does
(by creating a empty key representing the directory). Other popular tools like [[http://timkay.com/aws/
aws], [[http://s3tools.org/s3cmd s3cmd] and [[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=128
s3curl]] provide convenient ways of accessing S3 from the command line - but don't have the
capability of creating empty dire
 ctories.
+ One of the things useful to understand is how S3 is used as a file system normally. Each
S3 bucket can be considered as a root of a File System. Different files within this filesystem
become objects stored in S3 - where the path name of the file (path components joined with
'/') become the S3 key within the bucket and file contents become the value. Different tools
like [[https://addons.mozilla.org/en-US/firefox/addon/3247 S3Fox]] and native S3 !FileSystem
in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in
the keys. Not all tools are able to create an empty directory. In particular - S3Fox does
(by creating a empty key representing the directory). Other popular tools like [[http://timkay.com/aws/
aws], [[http://s3tools.org/s3cmd s3cmd] and [[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=128
s3curl]] provide convenient ways of accessing S3 from the command line - but don't have the
capability of creating empty dir
 ectories.
  

Mime
View raw message