hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject Re: Logic of isSplittable() of class FileInputFormat
Date Wed, 26 Feb 2014 13:08:37 GMT
Or, as another example, I'm writing a program to analyze a large email
dump. The emails are more than one line. TextInputFormat will split them up
by line, in addition to deserializing them to text. I'm going to need to
customize RecordReader to split based on the MIME metadata length of the
emails instead of the newline character, and also preserve them in stream
form for the parser to properly parse.

Or, I could customize InputFormat to a subclass that was
isSplittable(false) and then just have to handle the preserving as
InputStream part. Incidentally, tips on that are welcome if anyone on the
list wants to help.

So, there are some reasons the isSplittable is able to be modified. There
is a trade-off for performance at some point, too, once the files get big,
I think, with the mapper having to spill records to disk if the data being
mapped gets too big for the JVM memory...

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Feb 26, 2014 at 6:04 AM, Dieter De Witte <drdwitte@gmail.com> wrote:

> if you have a simple one line record format you should allow files to be
> splitted, since your simulations will be better balanced.
>
>
> 2014-02-26 11:31 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>:
>
>> Oh. Ok. Thanks. So basically, to be on the safer side, one can always set
>> its value as false and keep the data of records consistent. I mean, the
>> length of all the records should be the same.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 3:57 PM, Dieter De Witte <drdwitte@gmail.com>wrote:
>>
>>> No, an example could be that records have a variable number of lines, if
>>> you would then allow a file to be split your record may be broken, so then
>>> you could override isSplittable to be always false.
>>>
>>>
>>> 2014-02-26 11:22 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>:
>>>
>>> So basically what I can deduce from it is, isSplittable() only applies
>>>> to stream compressed files. Right?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:06 PM, Jeff Zhang <jezhang@gopivotal.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Take gz file as an example, It is not splittable because of the
>>>>> compression algorithm it is used.  It can not guarantee that one record
is
>>>>> located in one block, if one record is in 2 blocks, your program will
crash
>>>>> since you can not get the whole record.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 1:24 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> If a single file is split of size 129 MB is split in two
>>>>>> halves/blocks of HDFS as the max block size id 128 MB. And each of
the
>>>>>> blocks is read depending on the InputFormat it supports. Thus, what
is the
>>>>>> significance of isSplittable() method then?
>>>>>>
>>>>>> If it is set to false, entire block will be considered as single
>>>>>> input split? How will TextInputFormat react to it?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message