hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Change proposal for FileInputFormat isSplitable
Date Sat, 31 May 2014 06:05:53 GMT
Ok, got it.

If someone has an Avro file (foo.avro) and gzips that ( foo.avro.gz) then
the frame work will select the GzipCodec which is not capable of splitting
and which will cause the problem. So by gzipping a splittable file it
becomes non splittable.

At my workplace we have applied gzip to avro but then the compression
applies to the blocks inside the avro file. So that are multiple gzipped
blocks inside an avro container which is a splittable file without any
changes.

How would someone create the situation you are referring to?
On May 31, 2014 1:06 AM, "Doug Cutting" <cutting@apache.org> wrote:

> I was trying to explain my comment, where I stated that, "changing the
> default implementation to return false would be an incompatible
> change".  The patch was added 6 months after that comment, so the
> comment didn't address the patch.
>
> The patch does not appear to change the default implementation to
> return false unless the suffix of the file name is that of a known
> unsplittable compression format.  So the folks who'd be harmed by this
> are those who used a suffix like ".gz" for an Avro, Parquet or
> other-format file.  Their applications might suddenly run much slower
> and it would be difficult for them to determine why.  Such folks are
> probably few, but perhaps exist.  I'd prefer a change that avoided
> that possibility entirely.
>
> Doug
>
> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <Niels@basjes.nl> wrote:
> > Hi,
> >
> > The way I see the effects of the original patch on existing subclasses:
> > - implemented isSplitable
> >    --> no performance difference.
> > - did not implement isSplitable
> >    --> then there is no performance difference if the container is either
> > not compressed or uses a splittable compression.
> >    --> If it uses a common non splittable compression (like gzip) then
> the
> > output will suddenly be different (which is the correct answer) and the
> > jobs will finish sooner because the input is not processed multiple
> times.
> >
> > Where do you see a performance impact?
> >
> > Niels
> > On May 30, 2014 8:06 PM, "Doug Cutting" <cutting@apache.org> wrote:
> >
> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <Niels@basjes.nl> wrote:
> >> > For arguments I still do not fully understand this was rejected by
> Todd
> >> and
> >> > Doug.
> >>
> >> Performance is a part of compatibility.
> >>
> >> Doug
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message