hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <cdoug...@apache.org>
Subject Re: Change proposal for FileInputFormat isSplitable
Date Tue, 10 Jun 2014 18:10:20 GMT
On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes <Niels@basjes.nl> wrote:
> and if you then give the file the .gz extension this breaks all common
> sense / conventions about file names.

That the suffix for all compression codecs in every context- and all
future codecs- should determine whether a file can be split is not an
assumption we can make safely. Again, that's not an assumption that
held when people built their current systems, and they would be justly
annoyed with the project for changing it.

> I hold "correct data" much higher than performance and scalability; so the
> performance impact is a concern but it is much less important than the list
> of bugs we are facing right now.

These are not bugs. NLineInputFormat doesn't support compressed input,
and why would it? -C

> The safest way would be either 2 or 4. Solution 3 would effectively be the
> same as the current implementation, yet it would catch the problem
> situations as long as people stick to normal file name conventions.
> Solution 3 would also allow removing some code duplication in several
> subclasses.
>
> I would go for solution 3.
>
> Niels Basjes

Mime
View raw message