hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6835) Support concatenated gzip and bzip2 files
Date Fri, 25 Jun 2010 18:39:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882674#action_12882674
] 

Chris Douglas commented on HADOOP-6835:
---------------------------------------

bq. Currently the new BuiltInGzipDecompressor class inherits directly from JDK Inflater, but
I suspect I should extend BuiltInZlibInflater instead.

I'd lean the other way. It's not really a supertype of {{BuiltInZlibInflater}} and neither
are public types. It's not really an {{Inflater}}, either; it may be worth either supporting
that interface or using an {{Inflater}} member rather than inheritance, since it only calls
public methods. Either way, it's an existing confusion in the compression type hierarchy,
and if it calls for any additional testing it can be ignored in this issue.

bq. Is it worthwhile to encapsulate the state label and associated variables into a private
inner class (BuiltInGzipDecompressor.java, first FIXME comment)?

Since the code is already implemented and tested, refactoring it for a slightly cleaner implementation
of a user-opaque, RFC-compliant library doesn't seem like a reasonable condition for committing
it.

bq. DecompressorStream currently supports two concatenation modes via a pseudo-ifdef ("final
boolean useResetPartially"): resetPartially(), which avoids any additional buffer copies at
a cost of uglifying the Decompressor interface with this new method; or regular reset() +
setInput() to recopy any "excess" bytes (that is, from stream N+1) at the end of stream N.
The amount of recopying in the latter case is dependent on the buffer sizes (typically 64KB
around here) and sizes of the concatenated gzip streams/members, but in general it won't be
much. Barring strong disagreement, I'll go with the latter approach and clean up all the resetPartially()
stuff in the next (hopefully final) version of the patch.

Agreed; the penalty for re-copying once per stream is light enough that it can be endured
for other codecs' API considerations.

bq. Any last-minute qualms about hardcoding the concatenation behavior? It would simplify
the patch slightly and seems to be the preferred approach, so that's my plan for the next
version.

Sounds fine to me. It may cause faults in some containers, but those are probably bugs covered
over by this one.

> Support concatenated gzip and bzip2 files
> -----------------------------------------
>
>                 Key: HADOOP-6835
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6835
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tom White
>            Assignee: Greg Roelofs
>         Attachments: grr-hadoop-common.dif.20100614c, grr-hadoop-mapreduce.dif.20100614c,
MR-469.v2.yahoo-0.20.2xx-branch.patch
>
>
> When running MapReduce with concatenated gzip files as input only the first part is read,
which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message