hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Peterson <kevin...@gmail.com>
Subject Re: hlogs do not get cleared
Date Tue, 15 Dec 2009 23:17:07 GMT
On Tue, Dec 15, 2009 at 10:43 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Too many hlogs means that the inserts are hitting a lot of regions,
> that those regions aren't filled enough to flush so that we have to
> force flush them to give some room. When you added region servers, it
> spread the regions load so that hlogs were getting filled at a slower
> rate.
> Could you tell us more about the rate of insertion, size of data, and
> number of regions per region server?
This makes some sense now. I currently have 2200 regions across 3 tables. My
largest table accounts for about 1600 of those regions and is mostly active
at one end of the keyspace -- our key is based on date, but data only
roughly arrives in order. I also write to two secondary indexes, which have
no pattern to the key at all. One of these secondary tables has 488 regions
and the other has 96 regions.

We write about 10M items per day to the main table (articles). All of these
get written to one of the secondary indexes (article-ids). About a third get
written to the other secondary index. Total volume of data is about 10GB /
day written.

I think the key is as you say that the regions aren't filled enough to
flush. The articles table gets mostly written to near one end and I see
splits happening regularly. The index tables have no pattern so the 10
millions writes get scattered across the different regions. I've looked more
closely at a log file (linked below), and if I forget about my main table
(which would tend to get flushed), and look only at the indexes, this seems
to be what's happening:

1. Up to maxLogs HLogs, it doesn't do any flushes.
2. Once it gets above maxLogs, it will start flushing one region each time
it creates a new HLog.
3. If the first HLog had edits for say 50 regions, it will need to flush the
region with oldest edits 50 times before the HLog can be removed.

If N is the number of regions getting written to, but not getting enough
writes to flush on their own, then I think this converges to maxLogs + N
logs on average. If I think of maxLogs as "number of logs to start flushing
regions at" this makes sense.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message