hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Support multilines in 1 record in Hive
Date Fri, 31 Oct 2014 14:58:18 GMT
There is a configuration value in hive-site.xml called hive.input.format
there is where combine.hive.inputformat is set. You can set it to the non
combined versions and if I remember correctly you can set it to other input
formats like NLineInputFormat etc.

On Fri, Oct 31, 2014 at 9:09 AM, Petter von Dolwitz (Hem) <
petter.von.dolwitz@gmail.com> wrote:

> Hi CL,
>
> as you noticed the getSplits method is not called on your InputFormat. I
> don't know the reason for this. The CombineHiveInputFormat only delegates
> some functionality to your own implementation. Basically you cannot control
> this without re-implementing the CombineHiveInputFormat. It will however
> call the isSplittable() method on your InputFormat. So you can return false
> from this method not to split your data files. Then it is up to your
> implementation in the RecordReader how many lines you want to read to build
> up one item (that is then sent to the SerDe). Unless you have very large
> input files this solution should work for you.
>
> You can set a hadoop property (mapred.max.split.size) to control how much
> data is accepted for each map task which will then translate to how many
> data files each map task will accept (if you set your data as not
> splittable). I think the default value here is around 256MB.
>
> Br,
> Petter
>
>
>
> 2014-10-29 19:49 GMT+01:00 ltcuong211 <ltcuong211@gmail.com>:
>
>>  Hi all,
>>
>> Does Hive support multilines for a record? Eg: xml data with more than 1
>> line. 2 XML data records:
>> <a id='1'>
>>     <b>
>>     </b>
>> </a>
>> <a id='2'>
>>     <b>
>>     </b>
>> </a>
>>
>> I try to implement a InputFormat (GetSplits and RecordReader) and use it
>> in a Hive table:
>> *STORED AS INPUTFORMAT org....myinputformat*
>>
>> But Hive only calls my RecordReader, not GetSplits . I see Hive calls
>> CombineHiveInputFormat.getSplits.
>> Do you know how to let Hive uses my Getsplits method? Or A way to support
>> multilines in a record in Hive?
>>
>> Thank you.
>> CL
>>
>
>

Mime
View raw message