hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Roelofs <roel...@yahoo-inc.com>
Subject concatenated gzip support: default on or not?
Date Tue, 15 Jun 2010 19:55:18 GMT
As some folks have found out the hard way, only the first member of a
concatenated gzip file is recognized by current versions of Hadoop,
including trunk; the remainder is silently ignored.  I'm working on
the fix (MAPREDUCE-469), and the question has come up whether to make
the fixed version the default, which would represent a behavior change.

So, three options:

(1) configurable; concatenation support not enabled by default
(2) configurable; concatenation support enabled by default (behavior change)
(3) not configurable; concatenation support always enabled (behavior change)

Opinions?  The current proto-patch makes it configurable but leaves the
default unchanged from previous behavior (option 1).  However, since the
failure is silent (and there doesn't appear to be an easy way to emit a
warning due to buffering effects: MAPREDUCE-1795), a number of users have
argued that this is purely a bug that needs to be fixed, in which case
perhaps (3) would be appropriate.  I'm personally sympathetic to this
view, FWIW; on the other hand, unanticipated, user-visible behavior
changes can lead to unhappiness, too.

Note that concatenated bzip2 streams are not supported in 0.20 but are
in trunk (reportedly--I haven't yet verified for myself), thanks to the
splittable-codec support.  AFAIK, this is not configurable--i.e., it's
similar to option (3) except with the benefit of extra functionality
included on top.


View raw message