hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cosmin Lehene (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3323) OOME in master splitting logs
Date Thu, 09 Dec 2010 19:44:03 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969895#action_12969895
] 

Cosmin Lehene commented on HBASE-3323:
--------------------------------------


Here's the object distribution tlipcon mentioned:

{code}
The values of this map contain the 1.5M+ edits (in Entry objects) tlipcon mentioned

Map<byte[], LinkedList<Entry>> editsByRegion
      |                  |
      |                  |
      |                  |
(encodedRegionName)      |
      .                  |
      .                  |
      .                  |
      .                  |
      .                  |
      .                  | 
      .                  --- WalEdit edit
      .                  |      |
      .                  |      |
      .                  |      |
      .                  |      --- ArrayList<KeyValue> kvs
      .                  |                      |
      .                  |                      |
      .                  |                      |
      .                  |                      --- byte[] bytes
      .                  |                          
      .                  |
      .  ----------------------------------------------------------
      .  |               |                                        |
      .  |               |                                        |
      .  |               --- HLogKey key                          |
      .  |                    |                                   |
      .  |                    |                                   |
      .  |                    |                                   |
      .  |                    |                                   |
      . .| . . . . . . . . . .--- byte[] encodedRegionName        |
         |                    |                                   |
         |                    |                                   |
         |                    |                                   |
         |                    --- byte[] tableName                |
         |                                                        |
         |                                                        |
         | this is useless as we could have this in the map key   |
         ----------------------------------------------------------

{code}

The splitLog workflow loads all the edits in a map indexed by region, and then uses a thread
pool to write them to per region directories.


As you can see from this diagram, each edit duplicates the tableName and the encodedRegionName
(hence the 2 extra byte[]). 

*One simple, partial solution:*
We can reduce the memory footprint by putting the tableName in the map key with the encodedRegionName
(it's free). This would leave us with a LinkedList of WalEdit objects (ArrayList + KeyValue
+ the actual info: byte[]). Of course this could be further compressed, but it might not be
worth it (WalEdit has a replication scope as well IIUC). 
This is a partial solution since we still don't solve the case when we have too much data
in the HLogs.


*A second solution/suggestion:*

We can change the split process a bit. Let me explain how HLogs are organized and how we split
(please correct me if I'm wrong):

*Context:*
* Eeach region server has one HLog directory in HDFS (under /hbase/.logs)
* In each HRegionServer corresponding directory there's a bunch of HLog files. 
* There's a strict order of the HLog files within a region's dir and edits inside are ordered
as well. 
* We read all the files in memory first because we need all the edits for a particular region
and to respect the order of the edits. 
* Only after everything is read, we use a thread pool to distribute the log entries per regions.


*Suggestion:*
We could read the files in parallel, and instead of writing a single file in the HRegion corresponding
directory, we write one file for each HLog. This should keep all the edits in strict order.
Then HRegionServer could safely load them in the same order and apply edits. 

While we read the files in parallel we don't have to read the entire content in memory: we
can just read and write to the corresponding destination file. This should solve the memory
footprint problem. 


I haven't spent too much time analyzing the second option; it might have been discussed in
the past, so if I'm missing something let me know.


Cosmin


> OOME in master splitting logs
> -----------------------------
>
>                 Key: HBASE-3323
>                 URL: https://issues.apache.org/jira/browse/HBASE-3323
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: sizes.png
>
>
> In testing a RS failure under heavy increment workload I ran into an OOME when the master
was splitting the logs.
> In this test case, I have exactly 136 bytes per log entry in all the logs, and the logs
are all around 66-74MB). With a batch size of 3 logs, this means the master is loading about
500K-600K edits per log file. Each edit ends up creating 3 byte[] objects, the references
for which are each 8 bytes of RAM, so we have 160 (136+8*3) bytes per edit used by the byte[].
For each edit we also allocate a bunch of other objects: one HLog$Entry, one WALEdit, one
ArrayList, one LinkedList$Entry, one HLogKey, and one KeyValue. Overall this works out to
400 bytes of overhead per edit. So, with the default settings on this fairly average workload,
the 1.5M log entries takes about 770MB of RAM. Since I had a few log files that were a bit
larger (around 90MB) it exceeded 1GB of RAM and I got an OOME.
> For one, the 400 bytes per edit overhead is pretty bad, and we could probably be a lot
more efficient. For two, we should actually account this rather than simply having a configurable
"batch size" in the master.
> I think this is a blocker because I'm running with fairly default configs here and just
killing one RS made the cluster fall over due to master OOME.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message