hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Blue (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13126) Add Brotli compression codec
Date Tue, 10 May 2016 18:51:12 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278673#comment-15278673

Ryan Blue commented on HADOOP-13126:

The results above show the comparison with Snappy. The file is less than half the size and
compression took about the same amount of time. Comparing to LZ4 would be interesting. It
isn't supported by Parquet so it's a bit harder for me to drop into my test case.

> Add Brotli compression codec
> ----------------------------
>                 Key: HADOOP-13126
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13126
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>         Attachments: HADOOP-13126.1.patch
> I've been testing [Brotli|https://github.com/google/brotli/], a new compression library
based on LZ77 from Google. Google's [brotli benchmarks|https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf]
look really good and we're also seeing a significant improvement in compression size, compression
speed, or both.
> {code:title=Brotli preliminary test results}
> [blue@work Downloads]$ time parquet from test.parquet -o test.snappy.parquet --compression-codec
snappy --overwrite                      
> real    1m17.106s
> user    1m30.804s
> sys     0m4.404s
> [blue@work Downloads]$ time parquet from test.parquet -o test.br.parquet --compression-codec
brotli --overwrite                         
> real    1m16.640s
> user    1m24.244s
> sys     0m6.412s
> [blue@work Downloads]$ time parquet from test.parquet -o test.gz.parquet --compression-codec
gzip --overwrite                            
> real    3m39.496s
> user    3m48.736s
> sys     0m3.880s
> [blue@work Downloads]$ ls -l
> -rw-r--r-- 1 blue blue 1068821936 May 10 11:06 test.br.parquet
> -rw-r--r-- 1 blue blue 1421601880 May 10 11:10 test.gz.parquet
> -rw-r--r-- 1 blue blue 2265950833 May 10 10:30 test.snappy.parquet
> {code}
> Brotli, at quality 1, is as fast as snappy and ends up smaller than gzip-9. Another test
resulted in a slightly larger Brotli file than gzip produced, but Brotli was 4x faster. I'd
like to get this compression codec into Hadoop.
> [Brotli is licensed with the MIT license|https://github.com/google/brotli/blob/master/LICENSE],
and the [JNI library jbrotli is ALv2|https://github.com/MeteoGroup/jbrotli/blob/master/LICENSE].

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message