hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HiveAws/HivingS3nRemotely" by JoydeepSensarma
Date Mon, 18 May 2009 12:50:57 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely

The comment on the change is:
add sample TPCH query ..

------------------------------------------------------------------------------
   * These SQL queries should be executed using computed resources provisioned from EC2. Ideally,
the compute resources can be provisioned in proportion to the compute costs of the queries
   * Results from such queries that need to be retained for the long term can be stored back
in S3
  
- This tutorial walks through the steps required to accomplish this.
+ This tutorial walks through the steps required to accomplish this. Please send email to
the hive-users mailing list in case of any problems with this tutorial.
  
  == Required Software ==
  On the client side (PC), the following are required:
@@ -34, +34 @@

  
  The values assigned to s3n keys are just an example and need to be filled in by the user
as per their account details. Explanation for the rest of the values can be found in [#ConfigHell
Configuration Guide] section below.
  
- Instead of specifying these command lines each time the CLI is bought up - we can store
these persistently within {{{hive-site.xml}}} in the {{{conf/} directory of the Hive installation
(from where they will be picked up each time the CLI is launched.
+ Instead of specifying these command lines each time the CLI is bought up - we can store
these persistently within {{{hive-site.xml}}} in the {{{conf/}}} directory of the Hive installation
(from where they will be picked up each time the CLI is launched.
  
  == Example Public Data Sets ==
  Some example data files are provided in the S3 bucket {{{data.s3ndemo.hive}}}. We will use
them for the sql examples in this tutorial:
@@ -53, +53 @@

  alter table pkv add partition (insertdate='2008-01-01') location 's3n://data.s3ndemo.hive/pkv/2008-01-01';
  }}}
   * Declare a table over a TPCH table:
+ {{{
+ create external table lineitem (l_orderkey int, l_partkey int, l_suppkey int, l_linenumber
int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag
string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
l_shipinstruct string, l_shipmode string, l_comment string) row format delimited fields terminated
by '|' location 's3n://data.s3ndemo.hive/tpch/lineitem';
+ }}}
    
+ The TPCH DDL statements are slightly modified versions of the original TPCH statements (since
Hive does not support all the data types used in TPCH).
+ 
+ == Executing Queries ==
+ Hive can execute some queries without a Hadoop cluster. For example:
+  * {{{ select * from kv limit 10; }}}
+ 
+ {{{select *}}} queries with limit clauses can be performed on the PC itself. If you are
doing this - please note that:
+  * {{{fs.default.name}}} should be set to {{{file:///}}}
+  * '''Please Please do not select all the rows from large data sets'''. This will cause
large amount of data to be downloaded from S3 to outside AWS and incur charges on the host
account for these data sets!
+ 
+ Of course - the real fun is in doing some non-trivial queries using map-reduce. For this
we will need a Hadoop cluster (finally!):
+ 
+  1. Start a Hadoop cluster on EC2 (using directions from [[http://wiki.apache.org/hadoop/AmazonEC2
Hadoop-EC2 tutorial] - but making sure to use a version of ec2 scripts with HADOOP-5839 applied!
User is free to allocate any number of nodes they wish - although this tutorial was tried
out with
+  2. Note down the public hostnames of the master node. For example - the public hostname
maybe something like: {{{ec2-75-101-170-131.compute-1.amazonaws.com}}}
+  3. Point the Hive CLI to use this Hadoop cluster by executing:
+   * {{{set fs.default.name=hdfs://ec2-75-101-170-131.compute-1.amazonaws.com:50001; }}}
+   * {{{set mapred.job.tracker=ec2-75-101-170-131.compute-1.amazonaws.com:50002;}}}
+  4. Set up a ssh tunnel via port 2600 to the Hadoop master. This can be done by executing
the following from another terminal/window:
+   * {{{ ssh -i <path to Hadoop private key path> -D 2600 ec2-75-101-170-131.compute-1.amazonaws.com
}}}
+ 
+ Now we are all setup. The sample query from TPCH (1.sql) can be tried as follows:
+  * {{{ insert overwrite directory '/tmp/tpcresults/1-2.sql' select l_returnflag, l_linestatus,
sum ( l_quantity ) as sum_qty, sum ( l_extendedprice ) as sum_base_price, sum ( l_extendedprice
* ( 1 - l_discount )) as sub_disc_price, sum ( l_extendedprice * ( 1 - l_discount ) * ( 1
+ l_tax )) as sum_charge, avg ( l_quantity ) as avg_qty, avg ( l_extendedprice ) as avg_price,
avg ( l_discount ) as avg_disc, count ( 1 ) as count_order from lineitem where l_shipdate
<= to_date('1998-12-01') group by l_returnflag, l_linestatus ; }}}
+ 
+ This launches one map-reduce job and on 10 nodes with default hadoop/hive settings - this
took about 10 minutes.
  
  
  == Appendix ==

Mime
View raw message