hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Roelofs (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
Date Sat, 07 Aug 2010 05:47:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896222#action_12896222

Greg Roelofs commented on HADOOP-6349:

OK, with respect to the 20100730 patch, the biggest issue by far is the buffering, both compression
and decompression, as we've discussed in person:

 * since FastLZ is block-based with an uncompressed block size of 64 KB, it should never require
a buffer larger than 64 KB for either input or output, and certainly never the entirety of
the file
   ** it sounds like a (relatively) quick fix is possible, but otherwise it might be worth
looking at SplittableCompressionCodec and/or BlockCompressorStream, per comments below

The other items are mostly style-related:

 * move imported code to src/contrib, per offline discussion with Owen O'Malley
 * boilerplate should probably mention original MIT license
 * no src/java/core-default.xml change? (add to io.compression.codecs)
 * [cosmetic: nuke commented-out System.currentTimeMillis(), debug printfs, etc.]
 * [cosmetic: strip all trailing blanks on new, non-3rdparty/imported (i.e., non-src/contrib)
 * prefer to drop "J" prefix on fastlz/*.java (consistent with C, other codecs)
 * add package.html
   ** 64 KB block size
   ** mainly level 2 with fallback to level 1 for partial blocks
   ** weird triple-self-inclusion of fastlz.c
   ** possible future speed/etc. improvements:
     *** use existing Adler-32 in java.util.zip or zlib or wherever?
     *** parallel Adler-32 (and CRC-32) capability in zlib, pigz
 * FastLZCodec.java
   ** originally was going to suggest that it should extend DefaultCodec (like GzipCodec)
rather than copy it, but as a block-based codec, Splittable*Codec is better: we'll file separate
JIRA for splittable support later (also/alternatively, per Arun Murthy, BlockCompressorStream?)
 * JFastLZ.java
   ** readU32() BigInteger version should be nuked in favor of (fixed) readU32Fast() (not
used much in any case, but...ewww)
   ** why isn't all of "Moved." block (line 686) commented out/deleted?
   ** invert "Never runs b/c" block so true path is first
     *** does !FASTLZ_STRICT_ALIGN option even work? if not, nuke altogether
   ** no need for double parens on while condition (line 768)
 * JFastLZCompressor.java
   ** lose empty else in static block
   ** "v. low" -> "very low", "b/c" -> "because", etc. (in general, avoid abbreviations
in comments; English isn't everyone's primary language)
   ** had hoped b/iBuff arraycopy ("so CompressorStream doesn't muck w/ our data") was unnecessary,
but doesn't look like it: b[] comes straight from user via write() call
   ** "XXX" ?
   ** growOutputBuffer() used for apparently tiny deltas: should simply double size (unless
strong statistical or theoretical arg why small increments are rare) [this is subsumed by
whole fix-the-buffering issue above]
   ** use "L" for long-constant appendage: lowercase looks too much like "1"
   ** [cosmetic: need spaces on outside of for-loop and if-conditional parens]
   ** why isn't "if (totalCompressed == 26)" block (line 356) deleted?
   ** "enought"
   ** looks like native checksum call is wasted if incompressible data: should be moved down
to other half of "if (!nativeLoaded)" conditional (line 416) (more symmetric that way, too)
   ** [cosmetic: prefer variables at top of class (SAFE_PACK)]
 * JFastLZDecompressor.java
   ** [cosmetic: wrap long lines (native methods)]
   ** lose empty else in static block
   ** "v." -> "very", etc. (as above)
   ** DEFAULT_BLOCK_SIZE hardcoded at 64K ... configured value completely ignored (OK, probably
related to "native" FastLZ block size...never mind)

Btw, this one is clearly much, much simpler than the LZMA patch, so if it comes down to one
or the other, please concentrate on LZMA. If necessary, I (or someone) can pick this up and
finish it without much trouble, I think.

> Implement FastLZCodec for fastlz/lzo algorithm
> ----------------------------------------------
>                 Key: HADOOP-6349
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6349
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: William Kinney
>         Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, hadoop-6349-3.patch, HADOOP-6349-TestFastLZCodec.patch,
HADOOP-6349.patch, TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ is a good
(speed, license) alternative to LZO. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message