hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Arnold <>
Subject Re: Storage handler guidance
Date Wed, 25 Jun 2014 02:56:34 GMT
Nevermind, after scraping the sources I found the relevant bits to answer
my own question. InputFormat generates arbitrary InputSplit's which define
the partitioning of input data sources, and OutputFormat's just get spun up
in mappers/reducers, resulting in implicit partitioning.

On Thu, Jun 19, 2014 at 1:01 PM, Dean Arnold <> wrote:

> I haven't been able to find an explicit reference, hoping some one can
> clarify for me:
> Do storage handler reads/write get executed as parallel resources, i.e.,
> in an INSERT...SELECT... from a storage handler, will multiple storage
> handler instances be created to read from the data source (using
> partitioning or some other scheme) ?
> Likewise, will INSERT into a storage handler be executed using multiple
> streams ?
> FYI: I need to stream data into/out of Hive from/to parallel-efficient
> data sources, and would prefer to avoid landing everything in HDFS 1st, esp
> if the ultimate Hive file format is ORC, i.e, avoid multiple file copies,
> esp when moving terabytes between data sources and sinks. The storage
> handler mechanism seems a very elegant solution *if* it supports true
> parallel stream operations.
> TIA,
> Dean

View raw message