hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas Carlini (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
Date Sat, 31 Jul 2010 03:42:23 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nicholas Carlini updated HADOOP-6349:
-------------------------------------

    Attachment: hadoop-6349-3.patch

Another patch. There is still debug code scattered about (commented out), as I might need
to put it to use at some point. This code isn't tested as well as the last patch.

Adds support for native compression/decompression. Native compression is 230% faster than
java. Native decompression is 70% faster than java.

Somewhat-large redesign of the compressor. Compression is now fifty times faster when compressing
around 64MB. The compressor used to keep in memory all input it had previously processed,
and arraycopy it to a new array every time it needed more space, so through the process of
compressing 64MB of data calling write every 64k, it would end up copying ~32GB through memory
(this is how it was for my test case). Instead compress 128MB of data and write every 1k,
and you copy 8.8TB through memory.

Also modified compressor to include an end-of-stream marker. This way the decompressor can
set to "finished" so the stream can return -1. The end of stream mark is indicated by setting
the four unused bytes after the input size to high in the last chunk of length 0. By this
way, any decompressor which does not support the end of stream marker will never read those
bytes and will just decompress an empty block and not notice anything is wrong.

Adds another method to TestCodecPerformance which haves it load a (relatively small) input
file to memory, and from it generate 64MB of data to compress. (It does this by taking random
substrings from 16 to 128 bytes at random offsets until there are 64MB.) It then directly
compresses the 64MB from memory to memory and times that. These times seem to be more reflective
than timing the compression of "key %d value %d" and of timing the compression of random data.
Right now this mode is enabled by calling it with the -input flag.

Ported code for Adler32 to C, uses it when using native libraries.

Added a constant in the compressor to allow for uncompressible data to instead be copied over
byte for byte. This decreases the speed of the compressor by ~10% as it results in another
memcpy, but it can more than double the speed of decompression.  



Here's what the new part of the test codec performance gives when given a log file. For comparison:
DefaultCodec gets the size down to 11% and the BZip2Codec down to 8%.

Previous patch:
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB.
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of
original).
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total compression time: 381868 ms (1716
KBps).
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total decompression time: 5051 ms (126
MBps).

Current patch:
Native C:
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB.
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of
original).
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total compression time: 3314 ms (193
MBps).
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total decompression time: 2861 ms (223
MBps).

Current patch:
Pure Java:
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB.
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of
original).
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total compression time: 7891 ms (81
MBps).
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total decompression time: 5077 ms (126
MBps).

> Implement FastLZCodec for fastlz/lzo algorithm
> ----------------------------------------------
>
>                 Key: HADOOP-6349
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6349
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: William Kinney
>         Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, hadoop-6349-3.patch, HADOOP-6349-TestFastLZCodec.patch,
HADOOP-6349.patch, TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ is a good
(speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message