hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hong Tang <ht...@yahoo-inc.com>
Subject Re: concatenated gzip support: default on or not?
Date Tue, 15 Jun 2010 23:23:28 GMT
+1 for (3).

On Jun 15, 2010, at 12:55 PM, Greg Roelofs wrote:

> As some folks have found out the hard way, only the first member of a
> concatenated gzip file is recognized by current versions of Hadoop,
> including trunk; the remainder is silently ignored.  I'm working on
> the fix (MAPREDUCE-469), and the question has come up whether to make
> the fixed version the default, which would represent a behavior  
> change.
> So, three options:
> (1) configurable; concatenation support not enabled by default
> (2) configurable; concatenation support enabled by default (behavior  
> change)
> (3) not configurable; concatenation support always enabled (behavior  
> change)
> Opinions?  The current proto-patch makes it configurable but leaves  
> the
> default unchanged from previous behavior (option 1).  However, since  
> the
> failure is silent (and there doesn't appear to be an easy way to  
> emit a
> warning due to buffering effects: MAPREDUCE-1795), a number of users  
> have
> argued that this is purely a bug that needs to be fixed, in which case
> perhaps (3) would be appropriate.  I'm personally sympathetic to this
> view, FWIW; on the other hand, unanticipated, user-visible behavior
> changes can lead to unhappiness, too.
> Note that concatenated bzip2 streams are not supported in 0.20 but are
> in trunk (reportedly--I haven't yet verified for myself), thanks to  
> the
> splittable-codec support.  AFAIK, this is not configurable--i.e., it's
> similar to option (3) except with the benefit of extra functionality
> included on top.
> Thanks,
>  Greg

View raw message