hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang
Date Thu, 15 Apr 2010 19:29:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=19&rev2=20

--------------------------------------------------

  }}}
  It is assumed that the array and map fields in the input.txt files are null fields for these
examples.
  
+ In the above examples, the user has to know which partition to insert into and only one
partition can be inserted in one insert statement. If you want to load into multiple partitions,
you have to use multi-insert statement as illustrated below. 
+ {{{
+     FROM page_view_stg pvs
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime,
pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime,
pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime,
pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK'; 
  
+ }}}
+ 
+ In order to load data into all country partitions in a particular day, you have to add an
insert statement for each country in the input data. This is very inconvenient and inefficient
since you have to have the priori knowledge of the list of countries exist in the input data
and create the partitions beforehand. If the list changed for another day, you have to modify
your insert DML as well as the partition creation DDLs. 
+ 
+ ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve this problem
by dynamically determining which partitions should be created and populated. This is a newly
added feature that is only available from version 0.6.0 (trunk now). In the dynamic partition
insert, the input column values are evaluated to determine which partition this row should
be inserted into. If that partition has not been created, it will create that partition automatically.
Using this feature you need only one insert statement to create and populate all necessary
partitions. Below is an example of loading data to all country partitions using one insert
statement:
+ 
+ {{{
+ FROM page_view_stg pvs
+ INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
+ SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
+ }}}
+ 
+ There are several syntactic differences from the multi-insert statement: 
+   * country appears in the PARTITION clause, but it has no value associated with it. In
this case, country is a ''dynamic partition column''. On the other hand, ds has a value associated
with it, which means it is a ''static partition column''. If a column is dynamic partition
column, its value will be coming from the input column. Currently we only allow dynamic partition
columns to be the last column(s) in the partition clause because the partition column order
indicates its hierarchical order (meaning dt is the root partition, and country is the child
partition). You cannot specify a partition clause with (dt, country='US') because that means
you need to update all partitions with any date and its country subpartition is 'US'. 
+   * An additional pvs.country column is added in the select statement. This is the corresponding
input column for the dynamic partition column. Note that you do not need to add an input column
for the static partition column because its value is already known in the PARTITION clause.

+ 
+ Semantics of the dynamic partition insert statement:
+   * Since a Hive partition corresponds to a directory in HDFS, the partition value has to
conform to the HDFS path format (URI in Java). Any character having a special meaning in URI
(e.g., '%', ':', '/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value.
 
+   * If the input column is a type different than STRING, its value will be first converted
to STRING to be used to construct the HDFS path. 
+   * If the input column value is NULL or empty string, the row will be put into a special
partition, whose name is controlled by the hive parameter hive.exec.default.dynamic.partition.name.
The default value is __HIVE_DEFAULT_PARTITION__. Basically this partition will contain all
"bad" rows whose value are not valid partition names. The caveat of this approach is that
the bad value will be lost and is replaced by __HIVE_DEFAULT_PARTITION__ if you select them
Hive. JIRA HIVE-1309 is a solution to let user specify "bad file" to retain the input partition
column values as well.
+   * Dynamic partition insert could potentially resource hog in that it could generate a
large number of partitions in a short time. To get yourself buckled, we define two parameters
hive.exec.max.dynamic.partitions (default value being 1000) which is the total number of dynamic
partitions could be created by one DML and hive.exec.max.dynamic.partition.pernode (default
value being 100) which is the maximum dynamic partitions that can be created for each mapper
or reducer. If one mapper or reducer created more than that the threshold, a fatal error will
be raised from the mapper/reducer (through counter) and the whole job will be killed. If each
mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then
an exception is raised at the end of the job before the intermediate data are moved to the
final destination. 
+   * Another situation we want to protect against dynamic partition insert is that the user
may accidentally specify all partitions to be dynamic partitions without specifying one static
partition, while the original intention is to just overwrite the sub-partitions of one root
partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent
the all-dynamic partition case. In the strict mode, you have to specify at least one static
partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false
to control whether to allow dynamic partition at all. The default value is false. 
+   * Currently dynamic partition insert will not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true,
so it internally turns off the merge parameters. The reason is that if either of the merge
parameters is set to true, there will be a map reduce job for that particular partition to
merge all files into one. In dynamic partition insert, we do not know the number of partitions
at compile time thus no MapReduce job could be generated. There is a new JIRA HIVE-1307 filed
for this task. 
+ 
  == Simple Query ==
  For all the active users, one can use the query of the following form:
  

Mime
View raw message