hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-3323) OOME in master splitting logs
Date Fri, 10 Dec 2010 07:50:01 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon updated HBASE-3323:
-------------------------------

    Attachment: hbase-3323.txt

Here's a patch which basically redoes the way log splitting happens. It needs to be commented
up and I want to rename some things, but the basic architecture is this:

- Main thread reads logs in order and writes into a structure called EntrySink (I want to
rename this to EntryBuffer or sometihng)
- EntrySink maintains some kind of approximate heap size (I don't think I calculated it quite
right, but c'est la vie) and also takes care of managing a RegionEntryBuffer for each region
key.
-- The RegionEntryBuffer just has a LinkedList of Entries right now, but it does size accounting,
and I think we could change these to a fancier data structure for more efficient memory usage
(eg a linked list of 10000-entry arrays)
- If the main thread tries to append into the EntrySink but the heap usage has hit a max threshold,
it waits.

Meanwhile, there are N threads called WriterThread-n which do the following in a loop:
- poll the EntrySink to grab a RegionEntryBuffer
-- The EntrySink returns the one with the most outstanding edits (hope is to write larger
sequential chunks if possible)
-- The EntrySink also keeps track of which regions already have some thread working on them,
so we don't end up with out-of-order appends
- The EntrySink then drains the RegionEntryBuffer into the "OutputSink" which maintains the
map from region key to WriterAndPath (bug in patch uploaded: this map needs to be synchronizedMap)
- Once the buffer is drained, it notifies the EntrySink that the memory is no longer in use
(hence unblocking the producer thread)

In summary, it's a fairly standard producer-consumer pattern with some trickery to make a
separate queue per region so as not to reorder edits.

As a non-scientific test I patched this into my cluster which was getting the OOME on master
startup, and it not only started up fine, the log splits ran about 50% faster than they did
before!

Known bug: the "log N of M" always says "log 1 of M"

Thoughts?

> OOME in master splitting logs
> -----------------------------
>
>                 Key: HBASE-3323
>                 URL: https://issues.apache.org/jira/browse/HBASE-3323
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3323.txt, sizes.png
>
>
> In testing a RS failure under heavy increment workload I ran into an OOME when the master
was splitting the logs.
> In this test case, I have exactly 136 bytes per log entry in all the logs, and the logs
are all around 66-74MB). With a batch size of 3 logs, this means the master is loading about
500K-600K edits per log file. Each edit ends up creating 3 byte[] objects, the references
for which are each 8 bytes of RAM, so we have 160 (136+8*3) bytes per edit used by the byte[].
For each edit we also allocate a bunch of other objects: one HLog$Entry, one WALEdit, one
ArrayList, one LinkedList$Entry, one HLogKey, and one KeyValue. Overall this works out to
400 bytes of overhead per edit. So, with the default settings on this fairly average workload,
the 1.5M log entries takes about 770MB of RAM. Since I had a few log files that were a bit
larger (around 90MB) it exceeded 1GB of RAM and I got an OOME.
> For one, the 400 bytes per edit overhead is pretty bad, and we could probably be a lot
more efficient. For two, we should actually account this rather than simply having a configurable
"batch size" in the master.
> I think this is a blocker because I'm running with fairly default configs here and just
killing one RS made the cluster fall over due to master OOME.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message