hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: CombineHiveInputFormat does not call getSplits on custom InputFormat
Date Wed, 25 Feb 2015 17:15:38 GMT
Hi,

ThereĀ¹s a special interface in hive-1.0, which gives more information to
the input format.

https://hive.apache.org/javadocs/r1.0.0/api/ql/org/apache/hadoop/hive/ql/io
/CombineHiveInputFormat.AvoidSplitCombination.html


But entirely skipping combination results in so many performance problems
that in Tez we are forced to abandon this approach and have Tez generate
grouped-splits on the application master (which basically call
InputFormat::getSplits(), then groups them to get locality splits).

This is differentiated by hive.tez.input.format instead of just via
hive.input.format.

Cheers,
Gopal

On 2/19/15, 10:09 AM, "Luke Lovett" <luke.lovett@10gen.com> wrote:

>I'm working on defining a custom InputFormat and OutputFormat for use
>with Hive. I'd like tables using these IF/OF to be native tables, so
>that I can LOAD DATA and INSERT INTO them. However, I'm finding that
>with the default CombineHiveInputFormat, the getSplits method of my
>InputFormat is not being called. If I "set
>hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;", then
>getSplits is called.
>
>What I want to know is:
>- Is this difference in behavior between CombineHiveInputFormat and
>HiveInputFormat intentional?
>- Is there any way of forcing CombineHiveInputFormat to call getSplits
>on my own InputFormat? I was reading through the code for
>CombineHiveInputFormat, and it looks like it might only call my own
>InputFormat's getSplits method if the table is non-native. I'm not sure
>if I'm interpreting this correctly.
>- Is it better to set "hive.input.format" to work around this, or to
>create a StorageHandler and make non-native tables?
>
>Thanks for any advice.



Mime
View raw message