hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang
Date Fri, 07 May 2010 18:57:38 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=23&rev2=24

--------------------------------------------------

    * Another situation we want to protect against dynamic partition insert is that the user
may accidentally specify all partitions to be dynamic partitions without specifying one static
partition, while the original intention is to just overwrite the sub-partitions of one root
partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent
the all-dynamic partition case. In the strict mode, you have to specify at least one static
partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false
to control whether to allow dynamic partition at all. The default value is false. 
    * Currently dynamic partition insert will not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true,
so it internally turns off the merge parameters. The reason is that if either of the merge
parameters is set to true, there will be a map reduce job for that particular partition to
merge all files into one. In dynamic partition insert, we do not know the number of partitions
at compile time thus no MapReduce job could be generated. There is a new JIRA HIVE-1307 filed
for this task. 
  
+ Troubleshooting and best practices:
+   * As stated above, there are too many dynamic partitions created by a particular mapper/reducer,
a fatal error could be raised and the job will be killed. The error message looks something
like:
+ {{{
+     hive> set hive.exec.dynamic.partition.mode=nonstrict;
+     hive> FROM page_view_stg pvs
+           INSERT OVERWRITE TABLE page_view PARTITION(dt=, country) 
+                  SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null,
null, pvs.ip, 
+                         from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;
+ 
+ ...
+ 2010-05-07 11:10:19,816 Stage-1 map = 0%,  reduce = 0%
+ [Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
+ Ended Job = job_201005052204_28178 with errors
+ ...
+ }}}
+ The problem of this that one mapper will take a random set of rows and it is very likely
that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode.
 One way around it is to group the rows by the dynamic partition columns in the mapper and
distribute them to the reducers where the dynamic partitions will be created. In this case
the number of distinct dynamic partitions will be significantly reduced. The above example
query could be rewritten to:
+ {{{
+     hive> set hive.exec.dynamic.partition.mode=nonstrict;
+     hive> FROM page_view_stg pvs
+           INSERT OVERWRITE TABLE page_view PARTITION(dt=, country) 
+                  SELECT * FROM (
+                    SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null,
null, pvs.ip, 
+                           from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country

+                    DISTRIBUTE BY ds, country
+                    ) T;
+ }}}
+ This query will generate a MapReduce job rather than Map-only job. The inner-most SELECT-clause
will be converted to a plan to the mappers and the output will be distributed to the reducers
based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan
in the reducer which writes to the dynamic partitions. 
  
  == Inserting into local files ==
  In certain situations you would want to write the output into a local file so that you could
load it into an excel spreadsheet. This can be accomplished with the following command:

Mime
View raw message