hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan Oskarsson (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-4640) Add ability to split text files compressed with lzo
Date Fri, 14 Nov 2008 14:24:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Johan Oskarsson updated HADOOP-4640:
------------------------------------

    Attachment: HADOOP-4640.patch

Updated patch with most of the suggestions incorporated.
* Will continue if the index is missing with the whole file as one split
* Will only skip verifying the checksums in the close method if we haven't decompressed the
whole block. That block will be verified by another split later anyway.
* Removed lzop from the codecs list in the config
* The indexer method is now aware of the number of checksum algorithms used so it seeks to
the next block properly
* Changed the unit test to write a lzop compressed file, index and read it back again
* As suggested the RecordReaders don't have to read the index, it's done when getting the
splits instead

I haven't done any work on an output format, I'd rather leave that for another ticket since
it will require more extensive modifications of the compression classes. The option I'm leaning
towards is to register a class that implements an Indexer interface in the stream classes
(LzopOutputStream and BlockCompressorStream).

As before this will give one findbugs error.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame
since the lzo algorithm would be very suitable for large log files and similar common hadoop
data sets. The compression rate is not the best out there but the decompression speed is amazing.
 Since lzo writes compressed data in blocks it would be possible to make an input format that
can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message