hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang
Date Sat, 17 Apr 2010 04:24:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=22&rev2=23

--------------------------------------------------

             SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null,
pvs.ip WHERE pvs.country = 'UK';    
  }}}
  
- In order to load data into all country partitions in a particular day, you have to add an
insert statement for each country in the input data. This is very inconvenient and inefficient
since you have to have the priori knowledge of the list of countries exist in the input data
and create the partitions beforehand. If the list changed for another day, you have to modify
your insert DML as well as the partition creation DDLs. 
+ In order to load data into all country partitions in a particular day, you have to add an
insert statement for each country in the input data. This is very inconvenient since you have
to have the priori knowledge of the list of countries exist in the input data and create the
partitions beforehand. If the list changed for another day, you have to modify your insert
DML as well as the partition creation DDLs. It is also inefficient since each insert statement
may be turned into a MapReduce Job. 
  
- ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve this problem
by dynamically determining which partitions should be created and populated while scanning
the input table. This is a newly added feature that is only available from version 0.6.0 (trunk
now). In the dynamic partition insert, the input column values are evaluated to determine
which partition this row should be inserted into. If that partition has not been created,
it will create that partition automatically. Using this feature you need only one insert statement
to create and populate all necessary partitions. Below is an example of loading data to all
country partitions using one insert statement:
+ ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve this problem
by dynamically determining which partitions should be created and populated while scanning
the input table. This is a newly added feature that is only available from version 0.6.0 (trunk
now). In the dynamic partition insert, the input column values are evaluated to determine
which partition this row should be inserted into. If that partition has not been created,
it will create that partition automatically. Using this feature you need only one insert statement
to create and populate all necessary partitions. In addition, since there is only one insert
statement, there is only one corresponding MapReduce job. This significantly improves performance
and reduce the Hadoop cluster workload comparing to the multiple insert case.
+ 
+ Below is an example of loading data to all country partitions using one insert statement:
  
  {{{
      FROM page_view_stg pvs
@@ -449, +451 @@

  }}}
  
  There are several syntactic differences from the multi-insert statement: 
-   * country appears in the PARTITION clause, but it has no value associated with it. In
this case, country is a ''dynamic partition column''. On the other hand, ds has a value associated
with it, which means it is a ''static partition column''. If a column is dynamic partition
column, its value will be coming from the input column. Currently we only allow dynamic partition
columns to be the last column(s) in the partition clause because the partition column order
indicates its hierarchical order (meaning dt is the root partition, and country is the child
partition). You cannot specify a partition clause with (dt, country='US') because that means
you need to update all partitions with any date and its country subpartition is 'US'. 
+   * country appears in the PARTITION specification, but with no value associated. In this
case, country is a ''dynamic partition column''. On the other hand, ds has a value associated
with it, which means it is a ''static partition column''. If a column is dynamic partition
column, its value will be coming from the input column. Currently we only allow dynamic partition
columns to be the last column(s) in the partition clause because the partition column order
indicates its hierarchical order (meaning dt is the root partition, and country is the child
partition). You cannot specify a partition clause with (dt, country='US') because that means
you need to update all partitions with any date and its country sub-partition is 'US'. 
    * An additional pvs.country column is added in the select statement. This is the corresponding
input column for the dynamic partition column. Note that you do not need to add an input column
for the static partition column because its value is already known in the PARTITION clause.

  
  Semantics of the dynamic partition insert statement:

Mime
View raw message