hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cosmin Lehene <cleh...@adobe.com>
Subject Re: Hbase corrupts data after reporting MSG_REPORT_CLOSE to master during compaction and split process
Date Tue, 09 Sep 2008 18:08:06 GMT

On 9/9/08 6:56 PM, "Jim Kellerman" <jim@powerset.com> wrote:

> Comments inline below:
>> -----Original Message-----
>> From: Cosmin Lehene [mailto:clehene@adobe.com]
>> Sent: Tuesday, September 09, 2008 7:25 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: Hbase corrupts data after reporting MSG_REPORT_CLOSE to master
>> during compaction and split process
>> Hi,
>> I managed to reproduce the corruption and also have full debug logs, but
>> first I'll explain the whys and hows of the bug and also how I think it can
>> be fixed.
>> ( I'm going to send my takeaways on how we managed to insert 300GB in less
>> then 6 hours on a 5 node cluster and also some advice/issues in another
>> mail.)
>> Next assumptions are based on understanding the actual code (don't worry if I
>> didn't get them all right, please read the entire mail).
>> - The master _assigns_ a region to a server by sending a MSG_REGION_OPEN
>> - On heartbeat region servers report the current load and a list of MLR -
>> most loaded regions (in fact just a list of first N online regions).
>> - Upon opening a newly assigned region, a region server will try to compact
>> and split that region.
>> - The region is NOT marked offline when compaction starts
>> - The region is marked OFFLINE:true, SPLIT:true during a SPLIT
>> Our scenario goes this way:
>> Master (M) assigns region A to region server R1
>> R1 starts compaction and split of A
>> R1 on heart beat sends it's load and a list of MLR that contains A
> This list should only be a list of open regions and should not include any
> regions in the process of being opened.
Yes, however A has been opened. Compact split is part of the opening process
(I think in openRegion method)

> In addition, the region server should
> attach a number of MSG_REPORT_PROCESS_OPEN to the heartbeat (one for each
> region being opened). This should prevent the master from reassigning those
> regions.
True. However this region is OPEN, however the compaction process started.
When starting to compact a region, the region server doesn't change the
state of the region in something like MSG_REPORT_PROCESS_COMPACT, neither
take it off the online regions list. Only when doing a split will
HRegionServer remove the region from online regions (
removeFromOnlineRegions method)

So we are now in the situation of having started to compact a region and
also reported this region as a candidate for reassignment

>> M decides to reassign the extra regions and sends a MSG_CLOSE_REGION A to R1
>> R1 finishes the compaction and splits A into A1 and A2 (A1 has the same start
>> key as A)
> If, in fact, the region server is including regions that are not completely
> open in the load list, this is a bug.
Right, so during the compaction the region server gets a MSG_CLOSE but
finishes the split. Now the region is offline, split, but will be

>> M assigns A a to R2
>> R2 starts compaction and split of A
>> R2 finishes the compaction and splits A into A_clone_1 and A_clone_2
>> (A_clone_1 has the same start key as A and IMPORTANT the same start key as
>> A1)
> Whenever two region servers start working on the same reason, chaos ensues. It
> is rare that corruption *will not* happen in this case.
>> Now we get A1 and A_clone_1 almost identical starting with the same key.
>> Cluster is corrupted. We should care less what happens next. But the ideea is
>> that they are both in .META.
>> I figured several places where this could be avoided and I'm going to state a
>> few disjoint questions. Both Master and Region could be held responsible in
>> my opinion but I guess it's a matter of architectural philosophy. Please note
>> that any of these question would be a starting point for the fix.
>> - Why when getting a MSG_CLOSE_REGION A, the region server doesn't abort the
>> current compact split operation to leaving A in the original state and close
>> it immediately?
> MSG_CLOSE_REGION is sent for various different purposes. Maybe, if the master
> has timed out the region server, it should send something like MSG_ABORT_OPEN.
>> - Why doesn't a region server DELETE a region after a SPLIT?( I guess it
>> could be offline by then and it's not himself to decide that, but still..)
> The reason splits are fast is because the two children use the parent until
> they do a compaction. Thus the parent region must remain around until both
> children are no longer using the parent region. The master then garbage
> collects the parent.
>> - Why when assigning a region to a new region server the master doesn't check
>> the region status? It might be splitting or already split. I guess this would
>> need a new state.
> The master does check to see if a region is split or offline and will not
> assign it. This information is only available after the split is complete.
Actually, looking into assignRegions, it seems the master will compare the
load of the servers to each load, figure out what needs to be reassigned and
then call unnasignSomeRegions with the list that it got from our region
server R1
This adds the region to local closingRegions and sends a message to the
region server.
Then from assignRegions, assignRegionsToMultipleServers is called, that will
send the MSG_OPEN_REGION to R2


>> - Why when opening/compacting/splitting a region server doesn't check if the
>> region is OFFLINE:true or SPLIT:true?
> A region server should never receive an open message for a split or offline
> region. When the region server is told to open a region, it assumes it has
> exclusive rights to all the files of the region.

Well when A reaches R2, A has the properties SPLIT:true and OFFLINE:true
(see above)
When R2 opens it, it can double check. When it starts the compaction it
could double check and most of all when doing a split it could double check
if it hasn't beeen splitted. This would be a dirty way to avoid the whole
>> I have the logs available, however they are pretty large and I might need to
>> clean them a little, but I could make them available if that's really needed.
>> However I think the scenario and questions might be enough for a bug and a
>> fix.
>> Thanks,
>> Cosmin

View raw message