hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas Carlini (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6837) Support for LZMA compression
Date Tue, 03 Aug 2010 01:18:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894773#action_12894773
] 

Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

Responding to the major comments -- will upload a patch that fixes these and the smaller comments
soon.

FakeInputStream LinkedList:
This LinkedList can get fairly long, depending on how write is called. Worst case it can have
upwards of 12 million elements, which is far beyond acceptable. This is the case if the write(single_byte)
is called over and over. Each call will add a new link. Looking back at this, linked list
probably wasn't the best way to go.

There are two (obvious) ways that write() could have worked. One is using linked lists as
I did. The other way would be to create a byte array that can hold forceWriteLen bytes and
just copy into it; however this can be as large as 12MB. There are then two other ways to
make this work. The first is just allocating the 12MB right up front. The other way is to
start it with maybe just 64k, and make it grow (by powers of two) until it reaches 12MB, however
this would end up arraycopying a little under 12MB in total more than the other solution.
I will implement one of these for the patch.


FakeOutputStream LinkedList:
This linked list has a more reasonable use. Its purpose is to hold extra bytes just in case
the input stream gives too many. I am fairly confident that at most 272 bytes (maximum number
of fast bytes - 1) can be written to it. The reason I used a linked list, however, is that
I couldn't formally prove this after going through code. I wanted to be safe and just in case
their code doesn't behave as it should, everything will work on the OutputStream end.


Code(..., len)
I think I remember figuring out that Code(...) will return at least (but possibly more than)
len bytes with the one exception that when the end of the stream is reached it will only read
up to the end of the stream. I will modify the decompressor to no longer assume this and use
the actual number of bytes read instead.


Fixed the inStream.read() bug (and will be in patch I upload). Added a while loop to read
until EOF is reached so the assumptions are true.


Tail call recursive methods -> while loop. Java should add tail-call optimizations when
methods only call themselves recursively (which would require no changes to the bytecode).


Fixed memory leaks.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, HADOOP-6837-lzma-1-20100722.patch,
HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves
higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message