hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tim robertson <timrobertson...@gmail.com>
Subject Checksum exception on batch load
Date Fri, 24 Jul 2009 14:30:12 GMT
Hi all,

I'm loosing RegionServer's when trying to do a batch data load.

My setup (this is a dev environment):
- 174million record sparsely populated tab file (74 or so fields
typically) in HDFS, loading in the Map phase on a mapred job
- 0.20.0 hadoop and HBase built from the trunk (0.20.0) on the 20th July
- 4 mac minis (each running 1 datanode, 1 regionserver, 1 tasktracker,
1 map job)
  - 2000M HBase heap
- 1 mac pro acting as master for all HBase and Hadoop (and ZK)

The error seems to occur during a region split (?) and relates to a
checksum issue.  (perhaps mac minis with inadequate disks / memory?)
I have seen this several times, so is reasonably repeatable.  This
time it occurred after about 1hr and 2 million rows processed in the
map.

I attach the master log below.  Please let me know if I can diagnose
anything that may help.

Any insights are greatly appreciated!

Cheers,
Tim



2009-07-24 16:13:58,522 INFO
org.apache.hadoop.hbase.master.ServerManager: 4 region servers, 0
dead, average load 8.5
2009-07-24 16:13:58,575 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner
scanning meta region {server: 192.168.2.9:60020, regionname:
-ROOT-,,0, startKey: <>}
2009-07-24 16:13:58,580 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner
scanning meta region {server: 192.168.2.10:60020, regionname:
.META.,,1, startKey: <>}
2009-07-24 16:13:58,583 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner
scan of 1 row(s) of meta region {server: 192.168.2.9:60020,
regionname: -ROOT-,,0, startKey: <>} complete
2009-07-24 16:13:58,696 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner
scan of 32 row(s) of meta region {server: 192.168.2.10:60020,
regionname: .META.,,1, startKey: <>} complete
2009-07-24 16:13:58,696 INFO
org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s)
scanned
2009-07-24 16:14:58,523 INFO
org.apache.hadoop.hbase.master.ServerManager: 4 region servers, 0
dead, average load 8.5
2009-07-24 16:14:58,575 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner
scanning meta region {server: 192.168.2.9:60020, regionname:
-ROOT-,,0, startKey: <>}
2009-07-24 16:14:58,580 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner
scanning meta region {server: 192.168.2.10:60020, regionname:
.META.,,1, startKey: <>}
2009-07-24 16:14:58,581 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner
scan of 1 row(s) of meta region {server: 192.168.2.9:60020,
regionname: -ROOT-,,0, startKey: <>} complete
2009-07-24 16:14:58,701 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner
scan of 32 row(s) of meta region {server: 192.168.2.10:60020,
regionname: .META.,,1, startKey: <>} complete
2009-07-24 16:14:58,701 INFO
org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s)
scanned
2009-07-24 16:15:58,522 INFO
org.apache.hadoop.hbase.master.ServerManager: 4 region servers, 0
dead, average load 8.5
2009-07-24 16:15:58,576 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner
scanning meta region {server: 192.168.2.9:60020, regionname:
-ROOT-,,0, startKey: <>}
2009-07-24 16:15:58,580 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner
scanning meta region {server: 192.168.2.10:60020, regionname:
.META.,,1, startKey: <>}
2009-07-24 16:15:58,582 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner
scan of 1 row(s) of meta region {server: 192.168.2.9:60020,
regionname: -ROOT-,,0, startKey: <>} complete
2009-07-24 16:15:58,662 INFO
org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner
scan of 32 row(s) of meta region {server: 192.168.2.10:60020,
regionname: .META.,,1, startKey: <>} complete
2009-07-24 16:15:58,662 INFO
org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s)
scanned
2009-07-24 16:16:24,021 INFO
org.apache.hadoop.hbase.master.ServerManager:
192.168.2.11,60020,1248434121474 znode expired
2009-07-24 16:16:24,043 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown
of server 192.168.2.11,60020,1248434121474: logSplit: false,
rootRescanned: false, numberOfMetaRegions: 1,
onlineMetaRegions.size(): 1
2009-07-24 16:16:24,060 INFO
org.apache.hadoop.hbase.regionserver.HLog: Splitting 12 hlog(s) in
hdfs://hdfs-master.local:8020/hbase/.logs/192.168.2.11,60020,1248434121474
2009-07-24 16:16:32,192 INFO org.apache.hadoop.fs.FSInputChecker:
Found checksum error: b[0,
512]=346133332d343030642d393662322d66363065363236613532356508756e7061727365646f6363757272656e63655f6461746500000122ad1e79c6045c4e000000b0000000593d6f6363757272656e63652c39303236363865632d306335352d343235652d626538622d3266656338396466613737362c313234383434343530363331360a6f6363757272656e63650000000002fbd0fc00000122ad1e79c6000000530000004800000003002439336537386461322d346133332d343030642d393662322d66363065363236613532356508756e706172736564646174615f7265736f757263655f696400000122ad1e79c604353930000000a6000000593d6f6363757272656e63652c39303236363865632d306335352d343235652d626538622d3266656338396466613737362c313234383434343530363331360a6f6363757272656e63650000000002fbd0fd00000122ad1e79c6000000490000003f00000002002439336537386461322d346133332d343030642d393662322d66363065363236613532356508756e70617273656464656c6574656400000122ad1e79c6045c4e000000b1000000593d6f6363757272656e63652c39303236363865632d306335352d343235652d626538622d3266656338396466613737362c313234383434343530363331360a6f6363757272656e63650000000002fbd8fe000001
org.apache.hadoop.fs.ChecksumException: Checksum error:
/blk_5189144202965489285:of:/hbase/.logs/192.168.2.11,60020,1248434121474/hlog.dat.1248444965962
at 25162752
	at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
	at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
	at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
	at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
	at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
	at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1117)
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1666)
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1716)
	at java.io.DataInputStream.readFully(DataInputStream.java:178)
	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:869)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
	at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
	at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:492)
	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:426)
2009-07-24 16:16:32,193 WARN org.apache.hadoop.hdfs.DFSClient: Found
Checksum error for blk_5189144202965489285_3890 from
192.168.2.11:50010 at 25162752
2009-07-24 16:16:47,645 WARN
org.apache.hadoop.hbase.regionserver.HLog: Empty hlog, continuing:
org.apache.hadoop.fs.FileStatus@1b457bed count=0
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at java.io.DataInputStream.readFully(DataInputStream.java:152)
	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:865)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
	at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
	at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:492)
	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:426)
2009-07-24 16:16:48,745 INFO
org.apache.hadoop.hbase.regionserver.HLog: hlog file splitting
completed in 24701 millis for
hdfs://hdfs-master.local:8020/hbase/.logs/192.168.2.11,60020,1248434121474
2009-07-24 16:16:48,746 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Log split
complete, meta reassignment and scanning:
2009-07-24 16:16:50,634 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,a7e9fe0e-2929-49ca-9f4e-2b53fa3dfd2d,1248444145759 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:50,634 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,effca8da-3cd1-486e-9c7d-cc3a8b7a8ef4,1248444411035 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:50,634 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,9809ffb1-58ad-404c-b4fa-302e770a81a3,1248444506316 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:50,634 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,685ba5a9-fdb9-45ff-9f70-ae37048efadd,1248444347119 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:50,634 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,504e736c-d27e-4140-ab31-d2396933f490,1248444202623 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:50,634 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,902668ec-0c55-425e-be8b-2fec89dfa776,1248444506316 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:50,634 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,4853e283-e747-48a6-b125-8d0e91bfa100,1248444182967 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:50,634 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,4048326d-56de-48e6-9da8-36e2d2686f7e,1248444182967 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:50,635 INFO
org.apache.hadoop.hbase.master.RegionManager: Assigning region
occurrence,9ff5dffe-70a1-4a0b-9ff1-cdeb5b142cf3,1248444145759 to
192.168.2.10,60020,1248434113865
2009-07-24 16:16:53,670 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,effca8da-3cd1-486e-9c7d-cc3a8b7a8ef4,1248444411035 from
192.168.2.10,60020,1248434113865; 1 of 8
2009-07-24 16:16:53,671 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,9809ffb1-58ad-404c-b4fa-302e770a81a3,1248444506316 from
192.168.2.10,60020,1248434113865; 2 of 8
2009-07-24 16:16:53,671 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,685ba5a9-fdb9-45ff-9f70-ae37048efadd,1248444347119 from
192.168.2.10,60020,1248434113865; 3 of 8
2009-07-24 16:16:53,671 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,504e736c-d27e-4140-ab31-d2396933f490,1248444202623 from
192.168.2.10,60020,1248434113865; 4 of 8
2009-07-24 16:16:53,671 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,902668ec-0c55-425e-be8b-2fec89dfa776,1248444506316 from
192.168.2.10,60020,1248434113865; 5 of 8
2009-07-24 16:16:53,671 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,4853e283-e747-48a6-b125-8d0e91bfa100,1248444182967 from
192.168.2.10,60020,1248434113865; 6 of 8
2009-07-24 16:16:53,671 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,4048326d-56de-48e6-9da8-36e2d2686f7e,1248444182967 from
192.168.2.10,60020,1248434113865; 7 of 8
2009-07-24 16:16:53,671 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,9ff5dffe-70a1-4a0b-9ff1-cdeb5b142cf3,1248444145759 from
192.168.2.10,60020,1248434113865; 8 of 8
2009-07-24 16:16:56,679 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,effca8da-3cd1-486e-9c7d-cc3a8b7a8ef4,1248444411035 from
192.168.2.10,60020,1248434113865; 1 of 8
2009-07-24 16:16:56,679 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,9809ffb1-58ad-404c-b4fa-302e770a81a3,1248444506316 from
192.168.2.10,60020,1248434113865; 2 of 8
2009-07-24 16:16:56,679 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,685ba5a9-fdb9-45ff-9f70-ae37048efadd,1248444347119 from
192.168.2.10,60020,1248434113865; 3 of 8
2009-07-24 16:16:56,679 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,504e736c-d27e-4140-ab31-d2396933f490,1248444202623 from
192.168.2.10,60020,1248434113865; 4 of 8
2009-07-24 16:16:56,679 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,902668ec-0c55-425e-be8b-2fec89dfa776,1248444506316 from
192.168.2.10,60020,1248434113865; 5 of 8
2009-07-24 16:16:56,679 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,4853e283-e747-48a6-b125-8d0e91bfa100,1248444182967 from
192.168.2.10,60020,1248434113865; 6 of 8
2009-07-24 16:16:56,679 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,4048326d-56de-48e6-9da8-36e2d2686f7e,1248444182967 from
192.168.2.10,60020,1248434113865; 7 of 8
2009-07-24 16:16:56,679 INFO
org.apache.hadoop.hbase.master.ServerManager: Received
MSG_REPORT_PROCESS_OPEN:
occurrence,9ff5dffe-70a1-4a0b-9ff1-cdeb5b142cf3,1248444145759 from
192.168.2.10,60020,1248434113865; 8 of 8
2009-07-24 16:16:58,522 INFO
org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0
dead, average load 8.333333333333334

Mime
View raw message