hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cosmin Lehene <cleh...@adobe.com>
Subject Re: Hbase corrupts data after reporting MSG_REPORT_CLOSE to master during compaction and split process
Date Wed, 10 Sep 2008 18:05:20 GMT
Just to follow up on the discussion.

I pasted the logs in the wrong order. There's a lease expired entry followed by a process
server shutdown log entry.

It appears that the region assignment is caused by ProcessServerShutdown.scanMetaRegion()

   // Get regions reassigned    for (HRegionInfo info: regions) {      master.regionManager.setUnassigned(info);
   }

And from here they get reassigned by calling assignRegionsToMultipleServers in RegionManager.assignRegions

If this is correct, than the corruption (in my case) happens when the region server shuts
down.  However, both setUnnasigned() method and other methods that change the state of RegionManager.unnassignedRegions
are called from many places that could be potential points where this scenario could happen.

Can we talk with certainty about a clear bug now?

Thanks,
Cosmin

P.S. All 5 machines were under heavy load for more than 14 hours (30-100% waiting for disk).
I inserted more than 300GB of data during a 6 hour period, however there were only 60-70 regions
at the end of the six hours. After that I restarted the cluster and then it took forever to
compact the data (there were many compactions that had between 50 and 100 files to compact).
This is (I guess) another bug that we hit.




On 9/10/08 2:06 AM, "Cosmin Lehene" <clehene@adobe.com> wrote:





On 9/9/08 11:19 PM, "Jim Kellerman" <jim@powerset.com> wrote:

> I think there is a problem, but maybe not the one we surmised. Are
> there any lease timeout reports in your master log?

Like this?

2008-09-08 10:43:01,640 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.54:60020 lease expired


>
> When a lease is timed out, all the regions being served by the region
> server get reassigned almost immediately, which could be the cause of
> a region being assigned to another server while the original server
> still thought it had exclusive access to the region.


The region A is users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108


Here the master decides that the region server is overloaded

2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Server 10.131.237.54:60020 is overloaded. Server load: 52 avg: 48.0
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Choosing to reassign 4 regions. mostLoadedRegions has 10 regions in it.
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,2FF6F81E-1197-4BEB-97D2-E0AFEB529B22,1220891020447
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,24F026E2-DDE1-469F-8A1A-0638759E6A2A,1220887834020
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,F167933A-BCBE-4202-B453-0EA1A629F8D5,1220888091404

Then the split finishes

2008-09-08 10:01:34,385 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_SPLIT:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [B@5fdeb4f from
10.131.237.54:60020

Then many..

2008-09-08 10:43:13,541 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0A
A6A0A62C', ENCODED => 1126800163, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'users', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'video', BLOOMFILTER => 'false', VERSIONS => '3', COMPRESSION =>
'NONE', LENGTH => '2147483647', TTL => '-1', IN_M
EMORY => 'false', BLOCKCACHE => 'false'}]}}}, SERVER =>
'10.131.237.54:60020', STARTCODE => 1220892350279



Then the server shuts down

2008-09-08 10:44:06,552 DEBUG
org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process
server shutdown scanning root region on 10.131.236.91
...

Then the lease expires

2008-09-08 10:43:01,640 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.54:60020 lease expired



And then it assigns the region ..


2008-09-08 10:44:12,656 INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 to
server 10.131.237.51:60020

Then...

2008-09-08 10:46:41,493 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_PROCESS_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.51:60020
.................................
2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.51:60020
2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,8C4A27C9-B808-406A-A722-F1BE374F89EF,1220857848626 from
10.131.237.51:60020
2008-09-08 10:47:23,753 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Server 10.131.237.51:60020 is overloaded. Server load: 92 avg: 82.0
2008-09-08 10:47:23,753 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Choosing to reassign 10 regions. mostLoadedRegions has 10 regions in it.

This is interesting (notice that everything above and beneath is in the same
second):

The region is open but

2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.RegionManager:
Skipping region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because it is already closing.
...
2008-09-08 10:48:12,140 INFO org.apache.hadoop.hbase.master.RegionManager:
Skipping region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because it is already closing.

The region had been actually opened on the machine but

2008-09-08 10:48:13,374 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0A
A6A0A62C', ENCODED => 1126800163, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'users', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'video', BLOOMFILTER => 'false', VERSIONS => '3', COMPRESSION =>
'NONE', LENGTH => '2147483647', TTL => '-1', IN_M
EMORY => 'false', BLOCKCACHE => 'false'}]}}}, SERVER =>
'10.131.237.51:60020', STARTCODE => 1220882303813



2008-09-08 11:03:19,849 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220893282341 no longer has
references to users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108

And then the second split

2008-09-08 11:07:07,906 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_SPLIT:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [B@2e34c6e4 from
10.131.237.51:60020

And then after a cluster shutdown

2008-09-08 11:24:00,379 DEBUG
org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process
server shutdown scanning root region on 10.131.236.91

The the lease expires

2008-09-08 11:22:49,504 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.51:60020 lease expired

It assigns it again..

2008-09-08 11:24:04,994 INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 to
server 10.131.237.53:60020

2008-09-08 11:26:45,307 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.53:60020


2008-09-08 11:40:14,452 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220897189382 no longer has
references to users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:40:14,455 INFO org.apache.hadoop.hbase.master.BaseScanner:
Deleting region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because daughter splits no longer hold references





Log grep on 10.131.237.54 aka R1

2008-09-08 09:45:57,306 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:46:00,847 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:46:00,848 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 09:46:01,394 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 09:46:01,449 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 09:46:01,449 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:50:41,743 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [B@1ae49bbd
2008-09-08 09:56:34,587 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting compaction on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:21,868 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 in 4mins, 47sec
2008-09-08 10:01:21,938 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting split of region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,342 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,343 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,415 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:31,322 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, META
updated, and report to master all successful. Old region=REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0AA6A0A62C', ENCODED => 1126800163, OFFLINE =>
true, SPLIT => true, TABLE => {{NAME => 'users', IS_ROOT => 'false', IS_META
=> 'false', FAMILIES => [{NAME => 'video', BLOOMFILTER => 'false',
COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1',
IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}, new regions:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220893282341,
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220893282341. Split took 9sec
2008-09-08 10:11:58,606 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [B@1ae49bbd


The grep of the region log on 10.131.237.51 aka R2

2008-09-08 09:22:37,183 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:22:37,196 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:22:37,196 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 09:22:37,468 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 09:22:37,469 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 09:22:37,469 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,447 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [B@2157d3b
2008-09-08 09:39:11,448 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [B@2157d3b
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:44:12,689 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:47:21,796 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:47:21,797 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 10:47:22,308 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 10:47:22,309 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 10:47:22,309 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:23,207 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting compaction on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:24,305 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 in 1sec
2008-09-08 11:06:24,385 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting split of region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,601 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:07:06,367 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, META
updated, and report to master all successful. Old region=REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0AA6A0A62C', ENCODED => 1126800163, OFFLINE =>
true, SPLIT => true, TABLE => {{NAME => 'users', IS_ROOT => 'false', IS_META
=> 'false', FAMILIES => [{NAME => 'video', BLOOMFILTER => 'false',
COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1',
IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}, new regions:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220897189382,
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220897189382. Split took 41sec


I understand now how unassignedRegions gets updated. However on a
leaseExpired situation the only method to modify it is unassignRootRegion
and this should only happen if the server with the expired lease holds the
root region.

Thanks,
Cosmin
>
> There are a couple of problems going on here if you see lease
> timeouts. One is that the region server is not sending its heartbeat
> soon enough (See HBASE-616 "We slept XXXXXX ms, ten times longer than
> scheduled: 3000" happens frequently."). This can happen on both the
> master and the region server meaning either the master does not
> process the heartbeat until after the lease times out, or the region
> server does not send the heartbeat until after the lease times out.
>
> This can be caused by thread starvation due to either too many
> runnable threads on a machine, not enough CPUs to handle the thread
> load, or just a bad thread scheduler.
>
> Hopefully, lease timeout will work better after HBase is integrated
> with Zookeeper.
>
> =====
>
> In the normal case, the master will not reassign a region due to load
> balancing until the region server reports that it has closed the
> region:
>
> Nothing happens with the mostLoadedRegions until it gets to
> RegionManager.unassignSomeRegions which is called by
>   RegionManager.assignRegions
>
> In unassignSomeRegions, any regions selected are put into the
> closingRegions Set and a MSG_REGION_CLOSE will get sent to the region
> server. Candidates for assignment are only taken from the
> unassignedRegions Map.
>
> Not until the master receives a MSG_REPORT_CLOSE does any further
> action take place on that region.
>
> First the region is removed from the closingRegions Set.
>
> If the region was being split, the HRegionInfo received by the master
> will indicate that that the region is offline and split. In this case,
> it does not get reassigned.
>
> Otherwise it is added to the unassignedRegions Map and is now a
> candidate for reassignment.
>



Mime
View raw message