Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Message-ID: <4329798.99991280547743556.JavaMail.jira@thor>
Date: Fri, 30 Jul 2010 23:42:23 -0400 (EDT)
From: "Nicholas Carlini (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Subject: [jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo
 algorithm
In-Reply-To: <289817036.1256941439448.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Carlini updated HADOOP-6349:
-------------------------------------

    Attachment: hadoop-6349-3.patch

Another patch. There is still debug code scattered about (commented out), as I might need to put it to use at some point. This code isn't tested as well as the last patch.

Adds support for native compression/decompression. Native compression is 230% faster than java. Native decompression is 70% faster than java.

Somewhat-large redesign of the compressor. Compression is now fifty times faster when compressing around 64MB. The compressor used to keep in memory all input it had previously processed, and arraycopy it to a new array every time it needed more space, so through the process of compressing 64MB of data calling write every 64k, it would end up copying ~32GB through memory (this is how it was for my test case). Instead compress 128MB of data and write every 1k, and you copy 8.8TB through memory.

Also modified compressor to include an end-of-stream marker. This way the decompressor can set to "finished" so the stream can return -1. The end of stream mark is indicated by setting the four unused bytes after the input size to high in the last chunk of length 0. By this way, any decompressor which does not support the end of stream marker will never read those bytes and will just decompress an empty block and not notice anything is wrong.

Adds another method to TestCodecPerformance which haves it load a (relatively small) input file to memory, and from it generate 64MB of data to compress. (It does this by taking random substrings from 16 to 128 bytes at random offsets until there are 64MB.) It then directly compresses the 64MB from memory to memory and times that. These times seem to be more reflective than timing the compression of "key %d value %d" and of timing the compression of random data. Right now this mode is enabled by calling it with the -input flag.

Ported code for Adler32 to C, uses it when using native libraries.

Added a constant in the compressor to allow for uncompressible data to instead be copied over byte for byte. This decreases the speed of the compressor by ~10% as it results in another memcpy, but it can more than double the speed of decompression.  


Here's what the new part of the test codec performance gives when given a log file. For comparison: DefaultCodec gets the size down to 11% and the BZip2Codec down to 8%.

Previous patch:
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB.
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of original).
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total compression time: 381868 ms (1716 KBps).
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total decompression time: 5051 ms (126 MBps).

Current patch:
Native C:
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB.
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of original).
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total compression time: 3314 ms (193 MBps).
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total decompression time: 2861 ms (223 MBps).

Current patch:
Pure Java:
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB.
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of original).
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total compression time: 7891 ms (81 MBps).
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total decompression time: 5077 ms (126 MBps).

> Implement FastLZCodec for fastlz/lzo algorithm
> ----------------------------------------------
>
>                 Key: HADOOP-6349
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6349
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: William Kinney
>         Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, hadoop-6349-3.patch, HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.