hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <>
Subject [jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
Date Wed, 05 Mar 2014 20:46:45 GMT


Sushanth Sowmyan commented on HIVE-6332:

Before I created a wiki page for this, I wanted to have the content checked/reviewed. [~leftylev],
[~ekoifman], could you please go through the following and suggest edits/changes? Thanks!


HCatalog job properties:

Storage directives:

hcat.pig.storer.external.location : An override to specify where HCatStorer will write to,
defined from pig jobs, either directly by user, or by using org.apache.hive.hcatalog.pig.HCatStorerWrapper.
HCat will write to this specified directory, rather than writing to the table/partition directory
specified/calculatable by the metadata. This will be used in lieu of the table directory if
this is a table-level write (unpartitioned table write) or in lieu of the partition directory
if this is a partition-level write. This parameter is used only for non-dynamic-partitioning
jobs which have multiple write destinations.

hcat.dynamic.partitioning.custom.pattern : For dynamic partitioning jobs, simply specifying
a custom directory is not good enough, since it writes to multiple destinations, and thus,
instead of a directory specification, it requires a pattern specification. That's where this
parameter comes in. For example, if one had a table that was partitioned by keys country and
state, with a root directory location of /apps/hive/warehouse/geo/ , then a dynamic partition
write into it that writes partitions (country=US,state=CA) & (country=IN,state=KA) would
create two directories: /apps/hive/warehouse/geo/country=US/state=CA/ and /apps/hive/warehouse/geo/country=IN/state=KA/
. If we wanted a different patterned location, and specified hcat.dynamic.partitioning.custom.patttern="/ext/geo/${country}-${state}",
it would create the following two partition dirs: /ext/geo/US-CA and /ext/geo/IN-KA . Thus,
it allows us to specify a custom dir location pattern for all the writes, and will interpolate
each variable it sees when attempting to create a destination location for the partitions.

Cache behaviour directives:

HCatalog maintains a cache of HiveClients to talk to the metastore, managing a cache of 1
metastore client per thread, defaulting to an expiry of 120 seconds. For people that wish
to modify the behaviour of this cache, a few parameters are provided:

hcatalog.hive.client.cache.expiry.time : Allows users to override the expiry time specified
- this is an int, and specifies number of seconds. Default is 120.
hcatalog.hive.client.cache.disabled : Default is false, allows people to disable the cache
altogether if they wish to. This is useful in highly multithreaded usecases.

Input Split Generation Behaviour:

hcat.desired.partition.num.splits : This is a hint/guidance that can be provided to HCatalog
to pass on to underlying InputFormats, to produce a "desired" number of splits per partition.
This is useful when we have a few large files and we want to increase parallelism by increasing
the number of splits generated. It is not yet so useful in cases where we would want to reduce
the number of splits for a large number of files. It is not at all useful, also, in cases
where there are a large number of partitions that this job will read. Also note that this
is merely an optimization hint, and it is not guaranteed that the underlying layer will be
capable of using this optimization. Also, mapreduce parameters mapred.min.split.size and mapred.max.split.size
can be used in conjunction with this parameter to tweak/optimize jobs.

Data Promotion Behaviour:

In some cases where a user of HCat (such as some older versions of pig) does not support all
the datatypes supported by hive, there are a few config parameters provided to handle data
promotions/conversions to allow them to read data through HCatalog. On the write side, it
is expected that the user pass in valid HCatRecords with data correctly. : promotes boolean to int on read from HCatalog, defaults
to false. : promotes tinyint/smallint to int on read from HCatalog,
defaults to false.

HCatRecordReader Error Tolerance Behaviour:

While reading, it is understandable that data might contain errors, but we may not want to
completely abort a task due to a couple of errors. These parameters configure how many errors
we can accept before we fail the task.

hcat.input.bad.record.threshold : A float parameter, defaults to 0.0001f, which means we can
deal with 1 error every 10,000 rows, and still not error out. Any greater, and we will.
hcat.input.bad.record.min : An int parameter, defaults to 2, which is the minimum number of
bad records we encounter before applying hcat.input.bad.record.threshold parameter, this is
to prevent an initial/early bad record from resulting in a task abort because the ratio of
errors it got was too high.

> HCatConstants Documentation needed
> ----------------------------------
>                 Key: HIVE-6332
>                 URL:
>             Project: Hive
>          Issue Type: Task
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
> HCatConstants documentation is near non-existent, being defined only as comments in code
for the various parameters. Given that a lot of api winds up being implemented as knobs that
can be tweaked here, we should have a public facing doc for this.

This message was sent by Atlassian JIRA

View raw message