hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Petter von Dolwitz (Hem)" <>
Subject Re: Support multilines in 1 record in Hive
Date Fri, 31 Oct 2014 13:09:44 GMT
Hi CL,

as you noticed the getSplits method is not called on your InputFormat. I
don't know the reason for this. The CombineHiveInputFormat only delegates
some functionality to your own implementation. Basically you cannot control
this without re-implementing the CombineHiveInputFormat. It will however
call the isSplittable() method on your InputFormat. So you can return false
from this method not to split your data files. Then it is up to your
implementation in the RecordReader how many lines you want to read to build
up one item (that is then sent to the SerDe). Unless you have very large
input files this solution should work for you.

You can set a hadoop property (mapred.max.split.size) to control how much
data is accepted for each map task which will then translate to how many
data files each map task will accept (if you set your data as not
splittable). I think the default value here is around 256MB.


2014-10-29 19:49 GMT+01:00 ltcuong211 <>:

>  Hi all,
> Does Hive support multilines for a record? Eg: xml data with more than 1
> line. 2 XML data records:
> <a id='1'>
>     <b>
>     </b>
> </a>
> <a id='2'>
>     <b>
>     </b>
> </a>
> I try to implement a InputFormat (GetSplits and RecordReader) and use it
> in a Hive table:
> *STORED AS INPUTFORMAT org....myinputformat*
> But Hive only calls my RecordReader, not GetSplits . I see Hive calls
> CombineHiveInputFormat.getSplits.
> Do you know how to let Hive uses my Getsplits method? Or A way to support
> multilines in a record in Hive?
> Thank you.
> CL

View raw message