hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo
Date Mon, 17 Nov 2008 22:55:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Douglas updated HADOOP-4640:
----------------------------------

    Status: Open  (was: Patch Available)

bq. Will only skip verifying the checksums in the close method if we haven't decompressed
the whole block. That block will be verified by another split later anyway.
The data is already decompressed, but it hasn't been read out of the codec's buffer. Adding
a new, public method instead of calculating the checksum for the remainder of the buffered
block seems like the wrong tradeoff. something like:
{code}
public void close() throws IOException {
  byte[] b = new byte[4096];
  while (!decompressor.finished()) {
    decompressor.decompress(b, 0, b.length);
  }
  super.close();
  verifyChecksums();
}
{code}
should work, right? Allocating in the close is less optimal than, say, passing the Checksum
object to the codec, but this requires fewer changes to the interfaces.

* Using a TreeSet of Long seems unnecessary when the indices are sorted. Since the number
of blocks stored in the index can be calculated from its length, a type wrapping a long[]
seems more appropriate (the member function on said type can use Arrays::binarySearch instead
of TreeSet::ceiling).
* It doesn't need to be part of this patch, but it's worth noting that splittable lzop inputs
will create hot spots of the blocks storing the headers. If this were abstracted, then the
split could be annotated with the properties of the file and the RecordReader initialized
with block properties.
* The count of checksums should include both compressed and decompressed checksums.
* Instead of {{pos + 8}} in createIndex, it would make more sense to record the position in
the stream after reading the two ints (so skipping the block uses the more readable {{pos
+ compressedBlockSize + 4 * numChecksums}}).
* The only termination condition in LzoTextInputFormat::createIndex is uncompressedBlockSize
== 0. Values < 0 for uncompressedBlockSize should throw EOFException while values <=
0 for compressedBlockSize should throw IOException.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame
since the lzo algorithm would be very suitable for large log files and similar common hadoop
data sets. The compression rate is not the best out there but the decompression speed is amazing.
 Since lzo writes compressed data in blocks it would be possible to make an input format that
can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message