hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Yu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2646) Compaction requests should be prioritized to prevent blocking
Date Tue, 28 Sep 2010 22:41:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915952#action_12915952
] 

Ted Yu commented on HBASE-2646:
-------------------------------

Summarizing discussion with Jeff:
Ted: In PriorityCompactionQueue.addToRegionsInQueue(), I noticed the following call which
is not synchronized:
      queue.remove(queuedRequest);

Now suppose before the above is executed, PriorityCompactionQueue.take() is called. So queuedRequest
is returned to the caller of take(). Later, this line in take():
removeFromRegionsInQueue(cr.getHRegion());
would remove the newly added, higher priority request from regionsInQueue.

Jeff:
That is an astute observation.  Stepping through the code with the threads stopping execution
at the points in code you suggest would indeed make it so take() would return the lower priority
compactionRequest, remove the higher priority compaction request from regionsInQueue, and
finally the add() method would complete and add the higher priority compaction onto the queue
with no corresponding entry in the regionsInQueue hash (this is bad).  Even if I move the
queue.remove(queuedRequest) into the synchronized(regionsInQueue) we will run into the same
problem.  

Fortunately the worst thing that can happen is there is a request that doesn't have an entry
in regionsInQueue that will eventually be executed with no adverse problem for the system
other than extra work.  This wont actually cause any problems to the system but PriorityCompactionQueue
will have an inconsistent state which should be fixed.  An immediate solution is not jumping
out at me.  So I need to think through the problem and see if I can't come up with a way to
prevent the inconsistency.

Ted:
Except for remove(Object r), all callers of removeFromRegionsInQueue() have CompactionRequest
information.
So CompactionRequest, cr, can be passed to removeFromRegionsInQueue() so that we can perform
some sanity check.
If cr.getPriority() is lower than the priority of the CompactionRequest currently in regionsInQueue,
removeFromRegionsInQueue() can return null which indicates inconsistency.
The caller can discard cr upon seeing null from removeFromRegionsInQueue() and try to get
the next request from queue.

The above avoids introducing another synchronization between accesses to queue and regionsInQueue.

Jeff:
I was thinking along the same lines.  Adding an additional synchronization didn't seem like
the right approach.  So if we make sure we are taking off what we are expecting to then there
wont be a problem.

> Compaction requests should be prioritized to prevent blocking
> -------------------------------------------------------------
>
>                 Key: HBASE-2646
>                 URL: https://issues.apache.org/jira/browse/HBASE-2646
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.20.4
>         Environment: ubuntu server 10; hbase 0.20.4; 4 machine cluster (each machine
is an 8 core xeon with 16 GB of ram and 6TB of storage); ~250 Million rows;
>            Reporter: Jeff Whiting
>            Priority: Critical
>             Fix For: 0.20.7
>
>         Attachments: 2646-v2.txt, 2646-v3.txt, prioritycompactionqueue-0.20.4.patch,
PriorityQueue-r996664.patch
>
>
> While testing the write capacity of a 4 machine hbase cluster we were getting long and
frequent client pauses as we attempted to load the data.  Looking into the problem we'd get
a relatively large compaction queue and when a region hit the "hbase.hstore.blockingStoreFiles"
limit it would get block the client and the compaction request would get put on the back of
the queue waiting for many other less important compactions.  The client is basically stuck
at that point until a compaction is done.  Prioritizing the compaction requests and allowing
the request that is blocking other actions go first would help solve the problem.
> You can see the problem by looking at our log files:
> You'll first see an event such as a too many HLog which will put a lot of requests on
the compaction queue.
> {noformat}
> 2010-05-25 10:53:26,570 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs:
logs=33, maxlogs=32; forcing flush of 22 regions(s): responseCounts,RS_6eZzLtdwhGiTwHy,1274232223324,
responses,RS_0qhkL5rUmPCbx3K-1274213057242,1274513189592, responses,RS_1ANYnTegjzVIsHW-12742177419
> 21,1274511001873, responses,RS_1HQ4UG5BdOlAyuE-1274216757425,1274726323747, responses,RS_1Y7SbqSTsZrYe7a-1274328697838,1274478031930,
responses,RS_1ZH5TB5OdW4BVLm-1274216239894,1274538267659, responses,RS_3BHc4KyoM3q72Yc-1274290546987,1274502062319,
responses,RS_3ra9BaBMAXFAvbK-127421457
> 9958,1274381552543, responses,RS_6SDrGNuyyLd3oR6-1274219941155,1274385453586, responses,RS_8AGCEMWbI6mZuoQ-1274306857429,1274319602718,
responses,RS_8C8T9DN47uwTG1S-1274215381765,1274289112817, responses,RS_8J5wmdmKmJXzK6g-1274299593861,1274494738952,
responses,RS_8e5Sz0HeFPAdb6c-1274288
> 641459,1274495868557, responses,RS_8rjcnmBXPKzI896-1274306981684,1274403047940, responses,RS_9FS3VedcyrF0KX2-1274245971331,1274754745013,
responses,RS_9oZgPtxO31npv3C-1274214027769,1274396489756, responses,RS_a3FdO2jhqWuy37C-1274209228660,1274399508186,
responses,RS_a3LJVxwTj29MHVa-12742
> {noformat}
> Then you see the too many log files:
> {noformat}
> 2010-05-25 10:53:31,364 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction requested for region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862/783020138
because: regionserver/192.168.0.81:60020.cacheFlusher
> 2010-05-25 10:53:32,364 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862 has too many store files,
putting it back at the end of the flush queue.
> {noformat}
> Which leads to this: 
> {noformat}
> 2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates
for 'IPC Server handler 60 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862:
memstore size 128.0m is >= than blocking 128.0m size
> 2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates
for 'IPC Server handler 84 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862:
memstore size 128.0m is >= than blocking 128.0m size
> 2010-05-25 10:53:27,065 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates
for 'IPC Server handler 1 on 60020' on region responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862:
memstore size 128.0m is >= than blocking 128.0m size
> {noformat}
> Once the compaction / split is done a flush is able to happen which unblocks the IPC
allowing writes to continue.  Unfortunately this process can take upwards of 15+ minutes (the
specific case shown here from our logs took about 4 minutes).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message