hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <cdoug...@apache.org>
Subject Re: Change proposal for FileInputFormat isSplitable
Date Sun, 01 Jun 2014 23:21:04 GMT
On Sat, May 31, 2014 at 10:53 PM, Niels Basjes <Niels@basjes.nl> wrote:
> The Hadoop framework uses the filename extension  to automatically insert
> the "right" decompression codec in the read pipeline.

This would be the new behavior, incompatible with existing code.

> So if someone does what you describe then they would need to unload all
> compression codecs or face decompression errors. And if it really was
> gzipped then it would not be splittable at all.

Assume an InputFormat configured for a job assumes that isSplitable
returns true because it extends FileInputFormat. After the change, it
could spuriously return false based on the suffix of the input files.
In the prenominate example, SequenceFile is splittable, even if the
codec used in each block is not. -C

> Niels
> On May 31, 2014 11:12 PM, "Chris Douglas" <cdouglas@apache.org> wrote:
>
>> On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <Niels@basjes.nl> wrote:
>> > How would someone create the situation you are referring to?
>>
>> By adopting a naming convention where the filename suffix doesn't
>> imply that the raw data are compressed with that codec.
>>
>> For example, if a user named SequenceFiles foo.lzo and foo.gz to
>> record which codec was used, then isSplittable would spuriously return
>> false. -C
>>
>> > On May 31, 2014 1:06 AM, "Doug Cutting" <cutting@apache.org> wrote:
>> >
>> >> I was trying to explain my comment, where I stated that, "changing the
>> >> default implementation to return false would be an incompatible
>> >> change".  The patch was added 6 months after that comment, so the
>> >> comment didn't address the patch.
>> >>
>> >> The patch does not appear to change the default implementation to
>> >> return false unless the suffix of the file name is that of a known
>> >> unsplittable compression format.  So the folks who'd be harmed by this
>> >> are those who used a suffix like ".gz" for an Avro, Parquet or
>> >> other-format file.  Their applications might suddenly run much slower
>> >> and it would be difficult for them to determine why.  Such folks are
>> >> probably few, but perhaps exist.  I'd prefer a change that avoided
>> >> that possibility entirely.
>> >>
>> >> Doug
>> >>
>> >> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <Niels@basjes.nl> wrote:
>> >> > Hi,
>> >> >
>> >> > The way I see the effects of the original patch on existing
>> subclasses:
>> >> > - implemented isSplitable
>> >> >    --> no performance difference.
>> >> > - did not implement isSplitable
>> >> >    --> then there is no performance difference if the container
is
>> either
>> >> > not compressed or uses a splittable compression.
>> >> >    --> If it uses a common non splittable compression (like gzip)
then
>> >> the
>> >> > output will suddenly be different (which is the correct answer) and
>> the
>> >> > jobs will finish sooner because the input is not processed multiple
>> >> times.
>> >> >
>> >> > Where do you see a performance impact?
>> >> >
>> >> > Niels
>> >> > On May 30, 2014 8:06 PM, "Doug Cutting" <cutting@apache.org>
wrote:
>> >> >
>> >> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <Niels@basjes.nl>
>> wrote:
>> >> >> > For arguments I still do not fully understand this was rejected
by
>> >> Todd
>> >> >> and
>> >> >> > Doug.
>> >> >>
>> >> >> Performance is a part of compatibility.
>> >> >>
>> >> >> Doug
>> >> >>
>> >>
>>

Mime
View raw message