Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of igals@wix.com designates
 209.85.212.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAA7+SiDRi_EtjrUPW67BaqeDNXKNPw=3TUnvMYkZc8qLyHYhtA@mail.gmail.com>
References: 
 <CAA7+SiDRi_EtjrUPW67BaqeDNXKNPw=3TUnvMYkZc8qLyHYhtA@mail.gmail.com>
Date: Wed, 2 May 2012 08:11:46 +0300
Message-ID: 
 <CAFebPXAbq45BuAV3TUdyqWwuw4=2PpEZqJx0XTk6M54a4ek4nA@mail.gmail.com>
Subject: Re: Understanding compacting memstore/HLog before flush
From: Igal Shilman <igals@wix.com>
To: dev@hbase.apache.org
Content-Type: multipart/alternative; boundary=f46d044306744b321204bf06b986

--f46d044306744b321204bf06b986
Content-Type: text/plain; charset=ISO-8859-1

Hi Alex,
Have you seen: https://issues.apache.org/jira/browse/HBASE-4241 ?

Igal.
On May 2, 2012 7:01 AM, "Alex Baranau" <alex.baranov.v@gmail.com> wrote:

> Hello,
>
> Could you please tell me if I correctly understand this problem...
>
> Example behavior 1:
> * create table
> * do 10 operations: insert cell, override (given that versions # configured
> to 1) it, override, ... override.
> * after flushing memstore with these edits, all of them getting written to
> hfiles
>
> Ideally, in this situation one edit should be performed (resulting value of
> cell). I.e. only "current visible state" of memstore should be flushed as
> opposed to flushing all the edits from HLog. This will have a lot of
> benefits (e.g. reducing data amount to flush -> may be less frequent
> flushing needing -> less freq compactions, etc. operations), esp in
> particular use-cases (like using counters, or updating some "aggregated
> values").
>
> The problem, as I understand (correct me here, please if I'm wrong) is that
> it is not an easy thing to do, mainly because
> 1) additional resource management burden (flushing large memstore isn't
> cheap)
> 2) compaction may add a lot of unnecessary overhead (so that in some cases
> there will be no actual benefit from it), may make flushing much slower,
> which can bring a lot of issues
> 3) edits flushed from memstore and HLog edits should be kept in sync,
> because we want the flush process to be reliable. I.e. if it fails in the
> middle we should be able to restore the state from HLog. Keeping memstore
> and HLog in sync during compaction (and we would need partial compaction of
> some older data of the memstore) is difficult.
> 4) anything else?
>
> Esp. 3rd point - am I getting it right?
>
> Thanx,
> Alex Baranau
>

--f46d044306744b321204bf06b986--