hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HiveAws/HivingS3nRemotely" by JoydeepSensarma
Date Tue, 19 May 2009 15:41:20 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely

------------------------------------------------------------------------------
  
  == Hive Setup ==
  A few Hadoop configuration variables are required to be specified for all Hive sessions.
These can be set using the hive cli as follows:
- {{{
- set hadoop.socks.server=localhost:2600;
+ {{{hive> set hadoop.socks.server=localhost:2600; 
- set hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory;
+ hive> set hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.SocksSocketFactory;
- set hadoop.job.ugi=root,root;
+ hive> set hadoop.job.ugi=root,root;
- set mapred.map.tasks=40;
+ hive> set mapred.map.tasks=40;
- set mapred.reduce.tasks=-1;
+ hive> set mapred.reduce.tasks=-1;
- set fs.s3n.awsSecretAccessKey=2GAHKWG3+1wxcqyhpj5b1Ggqc0TIxj21DKkidjfz
+ hive> set fs.s3n.awsSecretAccessKey=2GAHKWG3+1wxcqyhpj5b1Ggqc0TIxj21DKkidjfz
- set fs.s3n.awsAccessKeyId=1B5JYHPQCXW13GWKHAG2
+ hive> set fs.s3n.awsAccessKeyId=1B5JYHPQCXW13GWKHAG2}}}
- }}}
  
  The values assigned to s3n keys are just an example and need to be filled in by the user
as per their account details. Explanation for the rest of the values can be found in [#ConfigHell
Configuration Guide] section below.
  
@@ -38, +36 @@

  
  == Example Public Data Sets ==
  Some example data files are provided in the S3 bucket {{{data.s3ndemo.hive}}}. We will use
them for the sql examples in this tutorial:
-  * s3n://data.s3ndemo.hive/kv - Key Value pairs in a text file
+  * {{{s3n://data.s3ndemo.hive/kv}}} - Key Value pairs in a text file
-  * s3n://data.s3ndemo.hive/pkv - Key Value pairs in a directories that are partitioned by
date
+  * {{{s3n://data.s3ndemo.hive/pkv}}} - Key Value pairs in a directories that are partitioned
by date
-  * s3n://data.s3ndemo.hive/tpch/* - Eight directories containing data corresponding to the
eight tables used by [[http://tpc.org/tpch/ TPCH benchmark]]. The data is generated for a
scale 10 (approx 10GB) database using the standard {{{dbgen}}} utility provided by TPCH.
+  * {{{s3n://data.s3ndemo.hive/tpch/*}}} - Eight directories containing data corresponding
to the eight tables used by [[http://tpc.org/tpch/ TPCH benchmark]]. The data is generated
for a scale 10 (approx 10GB) database using the standard {{{dbgen}}} utility provided by TPCH.
  
  == Setting up tables (DDL Statements) ==
  In this example - we will use HDFS as the default table store for Hive. We will make Hive
tables over the files in S3 using the {{{external tables}}} functionality in Hive. Executing
DDL commands does not require a functioning Hadoop cluster (since we are just setting up metadata):
  
   * Declare a simple table containing key-value pairs:
- {{{create external table kv (key int, values string)  location 's3n://data.s3ndemo.hive/kv';}}}
+  . {{{hive> create external table kv (key int, values string)  location 's3n://data.s3ndemo.hive/kv';}}}
   * Declare a partitioned table over a nested directory containing key-value pairs and associate
table partitions with dirs:
- {{{
+  . {{{
- create external table pkv (key int, values string) partitioned by (insertdate string);
+ hive> create external table pkv (key int, values string) partitioned by (insertdate string);
- alter table pkv add partition (insertdate='2008-01-01') location 's3n://data.s3ndemo.hive/pkv/2008-01-01';
+ hive> alter table pkv add partition (insertdate='2008-01-01') location 's3n://data.s3ndemo.hive/pkv/2008-01-01';
  }}}
   * Declare a table over a TPCH table:
- {{{
- create external table lineitem (l_orderkey int, l_partkey int, l_suppkey int, l_linenumber
int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag
string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
l_shipinstruct string, l_shipmode string, l_comment string) row format delimited fields terminated
by '|' location 's3n://data.s3ndemo.hive/tpch/lineitem';
+  . {{{
+ hive> create external table lineitem (
+ l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double,
+ l_extendedprice double, l_discount double, l_tax double, l_returnflag string, 
+ l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
+ l_shipinstruct string, l_shipmode string, l_comment string) 
+ row format delimited fields terminated by '|' 
+ location 's3n://data.s3ndemo.hive/tpch/lineitem';
  }}}
    
  The TPCH DDL statements are slightly modified versions of the original TPCH statements (since
Hive does not support all the data types used in TPCH).
  
  == Executing Queries ==
  Hive can execute some queries without a Hadoop cluster. For example:
-  * {{{ select * from kv limit 10; }}}
+ {{{hive> select * from kv limit 10; }}}
  
- {{{select *}}} queries with limit clauses can be performed on the PC itself. If you are
doing this - please note that:
+ {{{select *}}} queries with limit clauses can be performed locally on the Hive CLI itself.
If you are doing this - please note that:
-  * {{{fs.default.name}}} should be set to {{{file:///}}}
+  * {{{fs.default.name}}} should be set to {{{file:///}}} in case CLI is not configured to
use a working Hadoop cluster
   * '''Please Please do not select all the rows from large data sets'''. This will cause
large amount of data to be downloaded from S3 to outside AWS and incur charges on the host
account for these data sets!
  
  Of course - the real fun is in doing some non-trivial queries using map-reduce. For this
we will need a Hadoop cluster (finally!):
- 
-  1. Start a Hadoop cluster on EC2 (using directions from [[http://wiki.apache.org/hadoop/AmazonEC2
Hadoop-EC2 tutorial] - but making sure to use a version of ec2 scripts with HADOOP-5839 applied!
User is free to allocate any number of nodes they wish - although this tutorial was tried
out with
+  1. Start a Hadoop cluster on EC2 (using directions from [[http://wiki.apache.org/hadoop/AmazonEC2
Hadoop-EC2 tutorial] - but making sure to use a version of ec2 scripts with HADOOP-5839 applied!
User is free to allocate any number of nodes they wish - although this tutorial was tried
out with 10 nodes.
-  2. Note down the public hostnames of the master node. For example - the public hostname
maybe something like: {{{ec2-75-101-170-131.compute-1.amazonaws.com}}}
+  1. Note down the public hostnames of the master node. For example, the public hostname
maybe something like: 
+  . {{{ec2-75-101-170-131.compute-1.amazonaws.com}}}
-  3. Point the Hive CLI to use this Hadoop cluster by executing:
+  1.#3 Point the Hive CLI to use this Hadoop cluster by executing:
-   * {{{set fs.default.name=hdfs://ec2-75-101-170-131.compute-1.amazonaws.com:50001; }}}
+  . {{{hive> set fs.default.name=hdfs://ec2-75-101-170-131.compute-1.amazonaws.com:50001;
-   * {{{set mapred.job.tracker=ec2-75-101-170-131.compute-1.amazonaws.com:50002;}}}
+ hive> set mapred.job.tracker=ec2-75-101-170-131.compute-1.amazonaws.com:50002;}}}
-  4. Set up a ssh tunnel via port 2600 to the Hadoop master. This can be done by executing
the following from another terminal/window:
+  1.#4 Set up a ssh tunnel via port 2600 to the Hadoop master. This can be done by executing
the following from another terminal/window:
-   * {{{ ssh -i <path to Hadoop private key path> -D 2600 ec2-75-101-170-131.compute-1.amazonaws.com
}}}
+  . {{{linux> ssh -i <path to Hadoop private key path> -D 2600 ec2-75-101-170-131.compute-1.amazonaws.com}}}
  
  Now we are all setup. The sample query from TPCH (1.sql) can be tried as follows:
-  * {{{ insert overwrite directory '/tmp/tpcresults/1-2.sql' select l_returnflag, l_linestatus,
sum ( l_quantity ) as sum_qty, sum ( l_extendedprice ) as sum_base_price, sum ( l_extendedprice
* ( 1 - l_discount )) as sub_disc_price, sum ( l_extendedprice * ( 1 - l_discount ) * ( 1
+ l_tax )) as sum_charge, avg ( l_quantity ) as avg_qty, avg ( l_extendedprice ) as avg_price,
avg ( l_discount ) as avg_disc, count ( 1 ) as count_order from lineitem where l_shipdate
<= to_date('1998-12-01') group by l_returnflag, l_linestatus ; }}}
  
- This launches one map-reduce job and on 10 nodes with default hadoop/hive settings - this
took about 10 minutes.
+ {{{
+ hive> insert overwrite directory '/tmp/tpcresults-1.sql' 
+  select l_returnflag, l_linestatus, sum ( l_quantity ) as sum_qty, sum ( l_extendedprice
) as sum_base_price,
+  sum ( l_extendedprice * ( 1 - l_discount )) as sub_disc_price, 
+  sum ( l_extendedprice * ( 1 - l_discount ) * ( 1 + l_tax )) as sum_charge,
+  avg ( l_quantity ) as avg_qty, avg ( l_extendedprice ) as avg_price, 
+  avg ( l_discount ) as avg_disc, count ( 1 ) as count_order
+  from lineitem where l_shipdate <= to_date('1998-12-01') group by l_returnflag, l_linestatus
; 
+ }}}
  
+ This launches one map-reduce job and on 10 nodes with default hadoop/hive settings - this
took about 10 minutes. The results in this case are stored in HDFS and can be obtained by
doing a {{{dfs -cat /tmp/tpcresults/1-2.sql/*}}} - either from bin/hadoop or from hive CLI.

+ 
+ == Storing results back in S3 ==
+ The results could also have been stored as a file in S3 directly, for example, we could
alter the previous insert clause to read as:
+ {{{hive> insert overwrite directory 's3n://jssarma/tpcresults-1.sql' }}} - where {{{jssarma}}}
is a pre-existing bucket in S3.
+ 
+ As another alternative, one could have created an external table over S3 and stored the
results directly in it, for example:
+ {{{hive> create external table t1 (flag string, status string, double ...)
+ location 's3n://jssarma/tpcresults-1.sql';
+ hive> insert overwrite table t1 select ...}}}
+ 
+ Similarly, one could have stored the results back in a partition in an partitioned external
table as well.
+ 
+ == Using tmp tables in HDFS ==
+ Currently, Hive does not have any explicit support tmp tables. But tables defined over HDFS
in EC2 are like tmp tables since they only last for the duration of the Hadoop cluster. Since
they are likely to be much faster than accessing S3 directly - they can be used to stage data
that may be accessed repeatedly during a session. For example - for the TPCH dataset - one
may want to do some analysis of customer attributes against order details - and it would be
first beneficial to materialize the join of these data sets and then do repeated queries against
it. Here's some example sql that would do the same:
+ {{{hive> create table cust_order (nationkey string, acctbal double, mktsegment string,
orderstatus string, totalprice double);
+ hive> from customer c left outer join orders o on (c.c_custkey = o.o_custkey) insert
overwrite table cust_order select c.c_nationkey, c.c_acctbal, c.c_mktsegment, o.o_orderstatus,
o.o_totalprice; }}}
  
  == Appendix ==
  [[Anchor(ConfigHell)]]
  === Configuration Guide ===
  The socket related options allow Hive CLI to communicate with the Hadoop cluster using a
ssh tunnel (that will be established later). The job.ugi is specified to avoid issues with
permissions on HDFS. {{{mapred.map.tasks}}} specification is a hack that works around [[https://issues.apache.org/jira/browse/HADOOP-5861
HADOOP-5861]] and may need to be set higher for large clusters. {{{mapred.reduce.tasks}}}
is specified to let Hive determine the number of reducers (see [[https://issues.apache.org/jira/browse/HIVE-490
HIVE-490]]).
  
+ === Links ===
+  * [[../../HiveAws Hive and AWS]] presents general landscape and alternative on running
Hive queries in AWS.
+  * [[http://jsensarma.com/blog/2009/05/hive-hadoop-s3-ec2-it-works On issues and lessons
learned during this integration effort]]
+ 

Mime
View raw message