hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6837) Support for LZMA compression
Date Tue, 29 Jun 2010 18:08:53 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883630#action_12883630
] 

Scott Carey commented on HADOOP-6837:
-------------------------------------

What happens if you compress a tarball of those files instead?

Here are my results, on a directory with 4.1GB of ~64MB files.  The content is mixed binary/text
(key/value data, binary keys, mixed binary/text values).

This is on CentOS 5.5, with the 'xz' and 'bzip2' packages installed via yum.

Compression / decompression speed.  Disk is capable of 200MB/sec read/write, 16GB RAM, Nehalem
based processor (Xeon E5620, 2.4Ghz).
Tests confirmed to be CPU bound with no iowait.   measurements are in MB/sec for the uncompressed
data.
Source tarball, 4130 MB  (100%)  
|| type | compressed size | compressed size as percent | time to compress | compression rate
| time to decompress | decompression rate | 
|gzip -1|  1430MB | (34.6%)| 105 s| (39.3 MB/sec)| 42 s | 98.3 MB/sec  |
|gzip -6|  1240MB | (30.0%)| 251 s| (16.5 MB/sec)| 41.5s  | 99.5 MB/sec |
|bzip2 -2|  1003MB | (24.3%)| 656 s| (6.3 MB/sec)| 168 s | 24.6 MB/sec |
|bzip2 -6| 942MB | (22.8%)| 725 s| (5.7 MB/sec)| 176 s | 23.5 MB/sec |
|bzip2 -9| 926MB | (22.4%)| 763 s| (5.4 MB/sec)| 181 s | 22.8 MB/sec |
|xz -2|  993MB | (24.0%)| 429 s| (9.63 MB/sec)| 95s | 43.5 MB/sec |
|xz -6| 794MB | (19.2%)| 2861 s| (1.44 MB/sec)| 83s | 49.7 MB/sec |

Note that on today's newest processors, gzip decompresses at gigabit ethernet speeds.  xz
is half that, and bzip2 about half that again.  Gzip ane zx decompress faster at higher compression
ratios, bzip2 decompresses slower at higher ratios.  All compress slower the higher the ratio,
but bzip2 only slows down by ~20% or so from the fast to slow settings, while gzip and xz
slow down by a factor of 10+ (I did not do -9 tests here for those, they are very slow).

IMO, since xz-2 is almost 2x as fast at compression and decompression as bzip2, and similar
in compression ratio, it leaves little room for bzip2's use.
At higher compression levels, xz is very slow to compress, but achieves compression ratios
significantly better than anything else and still decompresses very fast, so its great for
archival storage.

For faster compression, gzip -1 or lzo and other compression types without an entropy coder
are the only options.

The link I provided above has several cases where xz is 3 or more times faster than bzip2
at decompression, but my data doesn't behave that way.


Raw Data:

$ time cat packed.tar | gzip -c1 > packed.gz1
real	1m44.938s
user	1m42.200s
sys	0m5.300s

$ time cat packed.tar | gzip -c6 > packed.gz6
real	4m11.051s
user	4m8.438s
sys	0m5.317s

$ time cat packed.tar | bzip2 -2 > packed.bz2-2
real	10m55.795s
user	10m52.989s
sys	0m5.030s

$ time cat packed.tar | bzip2 -6 > packed.bz2-6
real	12m4.847s
user	12m2.049s
sys	0m5.345s

$ time cat packed.tar | bzip2 -9 > packed.bz2-9
real	12m43.063s
user	12m40.353s
sys	0m4.797s


$ time cat packed.tar | xz -zv -2 - > packed.xz
  100.0 %             991.1 MiB / 4,125.0 MiB = 0.240   9.6 MiB/s         7:09
real	7m9.369s
user	7m6.985s
sys	0m7.140s

$ time cat packed.tar | xz -zv -6 - > packed.xz6
  100.0 %             792.6 MiB / 4,125.0 MiB = 0.192   1.4 MiB/s        47:41
real	47m41.033s
user	47m37.794s
sys	0m8.371s

------
Tests of decompression: 

$ time cat packed.gz1 | gunzip  > /dev/null
real	0m42.081s
user	0m41.814s
sys	0m1.361s

$ time cat packed.gz6 | gunzip  > /dev/null
real	0m41.512s
user	0m41.021s
sys	0m1.086s


$ time cat packed.bz2-2 | bunzip2  > /dev/null
real	2m48.528s
user	2m48.014s
sys	0m1.455s

$ time cat packed.bz2-6 | bunzip2  > /dev/null
real	2m56.511s
user	2m55.999s
sys	0m1.302s

$ time cat packed.bz2-9 | bunzip2  > /dev/null
real	3m1.064s
user	3m0.559s
sys	0m1.409s

$ time cat packed.xz | xz -dc  > /dev/null
real	1m35.239s
user	1m34.873s
sys	0m1.301s

$ time cat packed.xz6 | xz -dc  > /dev/null
real	1m23.219s
user	1m22.771s
sys	0m1.126s



> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves
higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message