hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4640) Add ability to split text files compressed with lzo
Date Wed, 19 Nov 2008 02:27:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648887#action_12648887
] 

Chris Douglas commented on HADOOP-4640:
---------------------------------------

bq. As for the close() I did as suggested, although it rubs me the wrong way to read all those
bytes without needing to. I guess the practical performance impact will be minimal though.
It's only calculating a checksum of the remaining bytes from a direct buffer. For the default
64k block, I'd guess it adds somewhere between 20 and 50ms in the close. If it had to make
another trip to the native code, I agree that would be improper, but this should be a trivial
cost. 

I'm not sure I follow LzoIndex::findIndexPosition. Given {{\{0, 5, 10, 15\}}} as block positions,
findIndexPosition(1) will return 10, but findIndexPosition(5) returns 5. Should the former
case also return 5? findIndexPosition(11) returns -1, which also seems contrary to its javadoc
explanation.

> Add ability to split text files compressed with lzo
> ---------------------------------------------------
>
>                 Key: HADOOP-4640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4640
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred
>            Reporter: Johan Oskarsson
>            Assignee: Johan Oskarsson
>            Priority: Trivial
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-4640.patch, HADOOP-4640.patch, HADOOP-4640.patch
>
>
> Right now any file compressed with lzop will be processed by one mapper. This is a shame
since the lzo algorithm would be very suitable for large log files and similar common hadoop
data sets. The compression rate is not the best out there but the decompression speed is amazing.
 Since lzo writes compressed data in blocks it would be possible to make an input format that
can split the files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message