hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Tutorial" by StevenWong
Date Thu, 21 Apr 2011 20:45:43 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/Tutorial" page has been changed by StevenWong.
The comment on this change is: Fix a wrong config parameter name..
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=35&rev2=36

--------------------------------------------------

    * When there are already non-empty partitions exists for the dynamic partition columns,
(e.g., country='CA' exists under some ds root partition), it will be overwritten if the dynamic
partition insert saw the same value (say 'CA') in the input data. This is in line with the
'insert overwrite' semantics. However, if the partition value 'CA' does not appear in the
input data, the existing partition will not be overwritten. 
    * Since a Hive partition corresponds to a directory in HDFS, the partition value has to
conform to the HDFS path format (URI in Java). Any character having a special meaning in URI
(e.g., '%', ':', '/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value.
 
    * If the input column is a type different than STRING, its value will be first converted
to STRING to be used to construct the HDFS path. 
-   * If the input column value is NULL or empty string, the row will be put into a special
partition, whose name is controlled by the hive parameter hive.exec.default.dynamic.partition.name.
The default value is `__HIVE_DEFAULT_PARTITION__`. Basically this partition will contain all
"bad" rows whose value are not valid partition names. The caveat of this approach is that
the bad value will be lost and is replaced by `__HIVE_DEFAULT_PARTITION__` if you select them
Hive. JIRA HIVE-1309 is a solution to let user specify "bad file" to retain the input partition
column values as well.
+   * If the input column value is NULL or empty string, the row will be put into a special
partition, whose name is controlled by the hive parameter hive.exec.default.partition.name.
The default value is `__HIVE_DEFAULT_PARTITION__`. Basically this partition will contain all
"bad" rows whose value are not valid partition names. The caveat of this approach is that
the bad value will be lost and is replaced by `__HIVE_DEFAULT_PARTITION__` if you select them
Hive. JIRA HIVE-1309 is a solution to let user specify "bad file" to retain the input partition
column values as well.
    * Dynamic partition insert could potentially resource hog in that it could generate a
large number of partitions in a short time. To get yourself buckled, we define three parameters:
      * '''hive.exec.max.dynamic.partitions.pernode''' (default value being 100) is the maximum
dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer
created more than that the threshold, a fatal error will be raised from the mapper/reducer
(through counter) and the whole job will be killed. 
      * '''hive.exec.max.dynamic.partitions''' (default value being 1000) is the total number
of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the
limit but the total number of dynamic partitions does, then an exception is raised at the
end of the job before the intermediate data are moved to the final destination.

Mime
View raw message