hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <cdoug...@apache.org>
Subject Re: Change proposal for FileInputFormat isSplitable
Date Sat, 31 May 2014 21:12:31 GMT
On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <Niels@basjes.nl> wrote:
> How would someone create the situation you are referring to?

By adopting a naming convention where the filename suffix doesn't
imply that the raw data are compressed with that codec.

For example, if a user named SequenceFiles foo.lzo and foo.gz to
record which codec was used, then isSplittable would spuriously return
false. -C

> On May 31, 2014 1:06 AM, "Doug Cutting" <cutting@apache.org> wrote:
>
>> I was trying to explain my comment, where I stated that, "changing the
>> default implementation to return false would be an incompatible
>> change".  The patch was added 6 months after that comment, so the
>> comment didn't address the patch.
>>
>> The patch does not appear to change the default implementation to
>> return false unless the suffix of the file name is that of a known
>> unsplittable compression format.  So the folks who'd be harmed by this
>> are those who used a suffix like ".gz" for an Avro, Parquet or
>> other-format file.  Their applications might suddenly run much slower
>> and it would be difficult for them to determine why.  Such folks are
>> probably few, but perhaps exist.  I'd prefer a change that avoided
>> that possibility entirely.
>>
>> Doug
>>
>> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <Niels@basjes.nl> wrote:
>> > Hi,
>> >
>> > The way I see the effects of the original patch on existing subclasses:
>> > - implemented isSplitable
>> >    --> no performance difference.
>> > - did not implement isSplitable
>> >    --> then there is no performance difference if the container is either
>> > not compressed or uses a splittable compression.
>> >    --> If it uses a common non splittable compression (like gzip) then
>> the
>> > output will suddenly be different (which is the correct answer) and the
>> > jobs will finish sooner because the input is not processed multiple
>> times.
>> >
>> > Where do you see a performance impact?
>> >
>> > Niels
>> > On May 30, 2014 8:06 PM, "Doug Cutting" <cutting@apache.org> wrote:
>> >
>> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <Niels@basjes.nl> wrote:
>> >> > For arguments I still do not fully understand this was rejected by
>> Todd
>> >> and
>> >> > Doug.
>> >>
>> >> Performance is a part of compatibility.
>> >>
>> >> Doug
>> >>
>>

Mime
View raw message