hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <shv.had...@gmail.com>
Subject Re: HBase 0.92/Hadoop 0.22 test results
Date Tue, 15 Nov 2011 02:19:19 GMT
Guys,

In the log file attached by Roman I see the exception below, which I
think is the reason for failures.
It says "could only be replicated to 0 nodes, instead of 1", and means
that HDFS could not find any targets for the block.
This could happen either if the disks are full, which I think Roman
took care of.
Or if there is a spike in new block creations or generally write activity.
If there is instantaneous increase in block creation, the load will go
over 2 * avgLoad, because DataNodes do not keep up with reporting
their load. It would be interesting to turn on debug level, Then we
should see why locations are not being chosen.
It may also make sense to run the same test with
dfs.namenode.replication.considerLoad = false
Then average load will not be taken into account. By default it is true.

11/11/08 18:50:04 WARN regionserver.SplitLogWorker: log splitting of
hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063
failed, returning error
java.io.IOException: File
/hbase/splitlog/domU-12-31-39-09-E8-31.compute-1.internal,60020,1320792889412_hdfs%3A%2F%2Fip-10-46-114-25.ec2.internal%3A17020%2Fhbase%2F.logs%2Fip-10-245-191-239.ec2.internal%2C60020%2C1320792860210-splitting%2Fip-10-245-191-239.ec2.internal%252C60020%252C1320792860210.1320796004063/TestLoadAndVerify_1320795370905/26bcb17daa237ce131b83e43ee48224c/recovered.edits/0000000000000048153
could only be replicated to 0 nodes, instead of 1
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1646)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:829)
	at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:349)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1482)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1478)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1153)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1476)

	at org.apache.hadoop.ipc.Client.call(Client.java:1028)
	at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
	at $Proxy7.addBlock(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:84)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy7.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:975)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:847)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:447)



On Tue, Nov 8, 2011 at 4:10 PM, Roman Shaposhnik <rvs@apache.org> wrote:
> +Konstantin (there's something weird in append handling)
>
> Some more updates. Hope this will help. I had this hunch that
> I was seeing those weird issues when HDFS DN was at 80%
> capacity (but nowhere near full!). So I quickly spun off a cluster
> that had 5 DNs with modest (and unbalanced!) amount of
> storage. Here's what started happening towards the end of
> loading 2M records into HBase:
>
> On the master:
>
> {"statustimems":-1,"status":"Waiting for distributed tasks to finish.
> scheduled=4 done=0
> error=3","starttimems":1320796207862,"description":"Doing distributed
> log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting
> for distributed tasks to finish. scheduled=4 done=0
> error=1","starttimems":1320796206563,"description":"Doing distributed
> log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting
> for distributed tasks to finish. scheduled=4 done=0
> error=2","starttimems":1320796205304,"description":"Doing distributed
> log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting
> for distributed tasks to finish. scheduled=4 done=0
> error=3","starttimems":1320796203957,"description":"Doing distributed
> log split in [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}]
>
> 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
> distributed log split in
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
> status=Waiting for distributed tasks to finish.  scheduled=4 done=0
> error=3, state=RUNNING, startTime=1320796203957, completionTime=-1
> appears to have been leaked
> 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
> distributed log split in
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
> status=Waiting for distributed tasks to finish.  scheduled=4 done=0
> error=2, state=RUNNING, startTime=1320796205304, completionTime=-1
> appears to have been leaked
> 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
> distributed log split in
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
> status=Waiting for distributed tasks to finish.  scheduled=4 done=0
> error=1, state=RUNNING, startTime=1320796206563, completionTime=-1
> appears to have been leaked
>
> And the behavior on the DNs was even weirder. I'm attaching a log
> from one of the DNs. The last exception is a shocker to me:
>
> 11/11/08 18:51:07 WARN regionserver.SplitLogWorker: log splitting of
> hdfs://ip-10-46-114-25.ec2.internal:17020/
> hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020
> %2C1320792860210.1320796004063 failed, returning error
> java.io.IOException: Failed to open
> hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2
> .internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063
> fo
> r append
>
> But perhaps its is cascading from some of the earlier ones.
>
> Anyway, take a look at the attached log.
>
> Now, this is a tricky issue to reproduce. Just before it started failing
> again I had a completely clean run over here:
>    http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/33/testReport/
>
> Which makes me believe it is NOT configuration related.
>
> Thanks,
> Roman.
>

Mime
View raw message