hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13408) HBase In-Memory Memstore Compaction
Date Wed, 29 Jul 2015 14:21:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646101#comment-14646101

Duo Zhang commented on HBASE-13408:

Things we talking here are all 'In Memory', so I do not think we need to modify WAL...

I think all the logic could be down in a special memstore implementation? For example, you
can set the flush-size to 128M, and introduce a compact-size which only consider the active
set size to 32M. When you find the active set reaches 32M then you put it into pipeline and
try to compact segments in pipeline to reduce memory usage. The upper layer does not care
about how many segments you have, it only cares about the total memstore size. If it reaches
128M then a flush request is coming, then you should flush all data to disk. If there are
many redundant cells then the total memstore will never reaches 128M, I think this is exactly
what we want here? And this way you do not change the semantic of flush, the log truncating
should also work as well.

And I think you can use some more compact data structures instead of skip list since the segments
in pipeline are read only? This may bring some benefits even if we do not have many redundant

What do you think? [~eshcar]. Sorry a bit late. Thanks.

> HBase In-Memory Memstore Compaction
> -----------------------------------
>                 Key: HBASE-13408
>                 URL: https://issues.apache.org/jira/browse/HBASE-13408
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Eshcar Hillel
>         Attachments: HBaseIn-MemoryMemstoreCompactionDesignDocument-ver02.pdf, HBaseIn-MemoryMemstoreCompactionDesignDocument.pdf,
> A store unit holds a column family in a region, where the memstore is its in-memory component.
The memstore absorbs all updates to the store; from time to time these updates are flushed
to a file on disk, where they are compacted. Unlike disk components, the memstore is not compacted
until it is written to the filesystem and optionally to block-cache. This may result in underutilization
of the memory due to duplicate entries per row, for example, when hot data is continuously
> Generally, the faster the data is accumulated in memory, more flushes are triggered,
the data sinks to disk more frequently, slowing down retrieval of data, even if very recent.
> In high-churn workloads, compacting the memstore can help maintain the data in memory,
and thereby speed up data retrieval. 
> We suggest a new compacted memstore with the following principles:
> 1.	The data is kept in memory for as long as possible
> 2.	Memstore data is either compacted or in process of being compacted 
> 3.	Allow a panic mode, which may interrupt an in-progress compaction and force a flush
of part of the memstore.
> We suggest applying this optimization only to in-memory column families.
> A design document is attached.
> This feature was previously discussed in HBASE-5311.

This message was sent by Atlassian JIRA

View raw message