hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <>
Subject [jira] [Commented] (HIVE-8808) HiveInputFormat caching cannot work with all input formats
Date Tue, 11 Nov 2014 21:24:34 GMT


Sushanth Sowmyan commented on HIVE-8808:

>From a strict M/R standpoint:

Traditional M/R guarantees state information availability to InputFormats through JobConf,
and through serialized InputSplits. Any expectation past that by the InputFormat is not guaranteed
to work. Practically, though, M/R does not standardize a setInput equivalent call, and thus,
InputFormats wind up implementing their own methodologies. It is not unheard of for them to
maintain state.

In practice also, though, we absolutely need to have a standardization, to be able to access
it from Hive/HCatalog. HCatalog took a route where it said that InputFormats, as currently
defined are not well-specified enough to be able to do all the setup needed to be effectively
stateless, and so, relegated that notion upwards, (in earlier versions of HCatalog to something
called StorageDriver, but as of HCat 0.3, replaced with Hive's StorageHandler) to StorageHandlers,
HCat's primary storage abstraction. While the InputFormat is itself considered stateless,
a StorageHandler is considered stateful, and HCat does the following:

a) Instantiate the appropriate StorageHandler using ReflectionUtils.newInstance   (will call
setConf if available, and usually is)
b) Call configureInputJobProperties() on that StorageHandler to set it up for input, and it
modifies a map of key value properties (jobProperties) that HCat ensures that it will put
into JobConf/Job before calling any methods on the relevant InputFormat.
c) Call .getInputFormatClass,  that class eventually gets instantiated at run time with ReflectionUtils.newInstance(inputFormatClass,
Job). Now, this allows the InputFormat to set up the Job (which already had the above map
of kvps inserted into it) any which way it wants, without itself being stateful.
d) Call .getInputSplits, again passing in the relevant JobConf as the state-carrier into it,
and the InputSplits themselves being a serializable state carrier on the outbound.
e) Call .createRecordReader on that InputFormat, again, the InputFormat itself can(and is)
stateless, but gets passed in an InputSplit(has state) and a TaskAttemptContext (has state,
with the above jobProperties map)

> HiveInputFormat caching cannot work with all input formats
> ----------------------------------------------------------
>                 Key: HIVE-8808
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Brock Noland
> In {{HiveInputFormat}} we implement instance caching (see {{getInputFormatFromCache}}).
In HS2, this assumes that InputFormats are stateless but I don't think this assumption is
true, especially with regards to HBase.

This message was sent by Atlassian JIRA

View raw message