hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2987) Avoid compressing flush files
Date Sat, 11 Sep 2010 19:09:33 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908354#action_12908354

Jonathan Gray commented on HBASE-2987:

I think the update wall can come up when memstore reaches it's max limit and you can't flush.
 You can't flush because you've reached the blockingStoreFiles count and have to wait for
a compaction to complete.  Or is there another situation where you have snapshotted, waiting
for the flush, and the memstore fills before the snapshot gets flushed?

Prioritizing compactions has shown to help the first case.  I don't think I've seen the second
case in practice.

In any case, I'd still be +1 on making these things configurable.

> Avoid compressing flush files
> -----------------------------
>                 Key: HBASE-2987
>                 URL: https://issues.apache.org/jira/browse/HBASE-2987
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Minor
>         Attachments: HBASE-2987.patch
> I've extended Hadoop compression to use the LZMA algorithm and HFile to provide an option
for selecting it. With typical input, the LZMA algorithm produces 30% smaller output than
GZIP at max compression (which is currently the best available option for HFiles) and 15%
smaller output than BZIP2. I'm aware of the "disk is cheap" mantra but for a multi-peta-scale
archival application, where we still want random read and random update capabilities, 30%
less disk is a substantial cost savings. LZMA compression speed is ~1 MB/second on a 2 GHz
CPU, decompression speed is ~20 MB/second. This is 4x slower than BZIP2 to compress but at
least 2x faster to decompress for 15% better results. For an archival application these properties
would be acceptable if not for the very significant problem of flushing. Obviously the low
throughput of the LZMA compressor means it is unsuitable for foreground processing. In HBase
terms, it can be used for compaction but not for flush files. 
> Attached patch, against 0.20 branch, turns off compression for flushes. This could be
implemented as a config option, but I wonder if with the possible exception of LZO should
we be compressing flushes at all? Any significant reduction in flush throughput can stall
writers during periods of high write activity. Maybe globally disabling compression on flush
flies is a good thing? 
> I have tested this and confirmed the result is the desired behavior: 'file' shows flush
files as uncompressed data, compacted files as compressed. Compaction merges files with different
compression properties. LZMA provides rather extreme space savings over the other available
options without slowing down writers if the regionservers are configured with enough write
buffering to ride over the significantly lengthened compaction times.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message