hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
Date Tue, 02 Feb 2010 16:23:30 GMT
On Mon, Feb 1, 2010 at 5:31 PM, Namit Jain <njain@facebook.com> wrote:
> I will take a look –
>
> It will be great if you can file a jira and add a patch for that
>
>
>
> From: Roberto Congiu [mailto:roberto.congiu@openx.org]
> Sent: Monday, February 01, 2010 11:02 AM
> To: Namit Jain
> Cc: hive-user@hadoop.apache.org
> Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
>
>
>
> Reviving this old thread...just found the time to work on this...
>
> I have a patch for using MultiFIleInputFormat in hadoop 0.19 as
> CombineHiveInputFormat - setting
>
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
>
> (or the equivalent setting on hive-site.xml) will have hive use
> MultiFIleInputFormat, packing many small files in
>
> mapred.multifileinputformat.splits splits (if set), or guessing the size by
> dividing the total input size by the DFS block size.
>
> Patch attached...I checked that it passes all unit tests according
> to http://wiki.apache.org/hadoop/Hive/HowToContribute#Setting_up_Eclipse_Development_Environment_.28Optional.29
>
>
>
>
>
>
>
> On Wed, Sep 30, 2009 at 4:34 AM, Namit Jain <njain@facebook.com> wrote:
>
> That’s right
>
>
>
> On 9/30/09 12:07 AM, "Roberto Congiu" <roberto.congiu@openx.org> wrote:
>
> Hi Namit,
> that's what I thought. Right now unfortunately we can't migrate to 0.20.
> I realize we lose data locality but as you said, it would still be
> considerably better than now.
>
> I had a look at the shim code, shouldn't be difficult since it would
> be basically mimicking CombineFileInputFormat.
>
> Once I add the appropriate logic to the shim, I have to set
> hive.input.format to
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive
> actually use it, right ?
>
> Roberto
>
> 2009/9/29 Namit Jain <njain@facebook.com>:
>> Hi Roberto,
>>
>> Talked with Raghu and Dhruba – it is possible to do so using
>> MutliFileInputFormat,
>> But the performance will not be very good since MutliFileInputFormat does
>> not
>> provide any locality. However, it will still be much better than the
>> problem
>> you are
>> running into right now.
>>
>> Can you move to hadoop-0.20 ? That might be simpler.
>>
>> If not, you can definitely implement the shim using MultiFileInputFormat
>> for
>> 0.19
>> (which should work even with 0.17). Do you need some help in understanding
>> the
>> current shim code ?
>>
>> Thanks,
>> -namit
>>
>>
>>
>>
>>
>> On 9/29/09 10:53 AM, "Namit Jain" <njain@facebook.com> wrote:
>>
>> Just checked – CombineFileInputFormat and a lot of other related stuff
>> went
>> to hadoop 0.20
>> So, it would be very difficult to add this for 0.19
>>
>>
>>
>> From: Namit Jain [mailto:njain@facebook.com]
>> Sent: Monday, September 28, 2009 10:30 PM
>> To: hive-user@hadoop.apache.org; roberto.congiu@openx.org
>> Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
>>
>> I am not sure whether CombineFileInputFormat (in hadoop) is available in
>> 0.19 -
>> If it is, we can add it, otherwise it will be very difficult.
>>
>>
>>
>> On 9/28/09 7:06 PM, "Raghu Murthy" <rmurthy@facebook.com> wrote:
>> Can we add MultiFileInputFormat as the CombineFileInputFormatShim for
>> hadoop-0.19?
>>
>> On 9/28/09 6:57 PM, "Roberto Congiu" <roberto.congiu@openx.org> wrote:
>>
>>> Hi guys,
>>> I've been working on integrating hive with a legacy file format we use
>>> here. I wrote the appropriate InputFormat and SerDe and everything
>>> works, but it's painfully slow.
>>> The reason is that the files I am reading are many and hive uses one
>>> mapper for every file.
>>> I saw the HIVE-74 patches but those use CombineFileInputFormat which
>>> is available on hadoop 0.20...but we use 0.19. Is there any reason the
>>> same goal could not be achieved using the deprecated (but present  <
>>> 0.20) MultiFileInputFormat ?
>>>
>>> Thanks,
>>> Roberto
>>
>>
>>
>
>

Has this been implemented for the 18 shims as well?

Mime
View raw message