hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kareem Dana" <kareem.d...@gmail.com>
Subject Re: 答复: HBase PerformanceEvaluation failing
Date Fri, 16 Nov 2007 18:27:07 GMT
I am using xen with Linux 2.6.18. dfs -put works fine. I can read data
I have put and all other dfs operations work. They work before I run
the PE test and then after the PE test fails dfs still works fine on
its own. However I found some more DFS errors in the logs that happen
right before the PE test fails. My DFS datanodes are hadoop08-12

On hadoop08:
2007-11-15 19:13:52,751 INFO org.apache.hadoop.dfs.DataNode: Starting
thread to transfer block blk_6384396336224061547 to
[Lorg.apache.hadoop.dfs.DatanodeInfo;@1d349e2
2007-11-15 19:13:52,755 WARN org.apache.hadoop.dfs.DataNode: Failed to
transfer blk_6384396336224061547 to 172.16.6.56:50010 got
java.net.SocketException: Connection reset

hadoop09:
2007-11-15 19:13:58,788 ERROR org.apache.hadoop.dfs.DataNode:
DataXceiver: java.io.IOException: Block blk_6384396336224061547 has
already been started (thoug
h not completed), and thus cannot be created.

hadoop10:
2007-11-15 19:14:13,119 WARN org.apache.hadoop.dfs.DataNode:
Unexpected error trying to delete block blk_-6070535147471430901.
Block not found in blockMap.
2007-11-15 19:14:13,120 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_4063930368628711897 file
/tmp/hadoop-kcd/dfs/data/current/blk_4063930368628711897
2007-11-15 19:14:13,136 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-2206654761004087942 file
/tmp/hadoop-kcd/dfs/data/current/blk_-2206654761004087942
2007-11-15 19:14:13,157 WARN org.apache.hadoop.dfs.DataNode:
java.io.IOException: Error in deleting blocks.

hadoop12:
2007-11-15 19:14:13,119 WARN org.apache.hadoop.dfs.DataNode:
Unexpected error trying to delete block blk_-6070535147471430901.
Block not found in blockMap.
2007-11-15 19:14:13,120 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_4063930368628711897 file
/tmp/hadoop-kcd/dfs/data/current/blk_4063930368628711897
2007-11-15 19:14:13,136 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-2206654761004087942 file
/tmp/hadoop-kcd/dfs/data/current/blk_-2206654761004087942
2007-11-15 19:14:13,157 WARN org.apache.hadoop.dfs.DataNode:
java.io.IOException: Error in deleting blocks.

hadoop07 Namenode:
2007-11-15 19:10:33,090 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 54310, call
open(/tmp/hadoop-kcd/hbase/hregion_TestTable,4204932,5347114880093364680/info/info/5829185525592087769,
0, 671088640) from 172.16.6.58:57409: error: java.io.IOException:
Cannot open filename
/tmp/hadoop-kcd/hbase/hregion_TestTable,4204932,5347114880093364680/info/info/5829185525592087769
java.io.IOException: Cannot open filename
/tmp/hadoop-kcd/hbase/hregion_TestTable,4204932,5347114880093364680/info/info/5829185525592087769

It looks like something wrong with DFS but DFS is working fine
otherwise and when I run a PE with just 1 client it finishes to
completion. Does that put the same stress on DFS or does a 2 client
test effectively double the IO through DFS?

Regards,
Kareem

On Nov 15, 2007 9:08 PM, 闫雪冰 <yanxuebing@alibaba-inc.com> wrote:
> Are you working on FreeBSD 4.11? Did you ever succeed in doing a 'dfs -put'
> operation?
>
> I went into a very similar trouble a few days ago. In my case, I got an
> "only be replicated to 0 nodes, instead of 1" msg when I tried to run the PE
> program, I found that I couldn't even managed to make a 'dfs -put' which
> would also give me the previous error msg, though I succeeded in doing 'dfs
> -makedir'.
>
> The reason is SecureRandom doesn't work on my FreeBSD 4.11, I finally get
> two solutions:
>         a) Get back to hadoop-0.14.3, which will work fine with the same
> configuration, or
>         b) Comment off the SecureRandom block like below
> ----------------------------------------------------------
>  /*
>     try {
>       rand =
> SecureRandom.getInstance("SHA1PRNG").nextInt(Integer.MAX_VALUE);
>     } catch (NoSuchAlgorithmException e) {
>       LOG.warn("Could not use SecureRandom");
>       rand = (new Random()).nextInt(Integer.MAX_VALUE);
>     }
> */
>     rand = (new Random()).nextInt(Integer.MAX_VALUE);
> ----------------------------------------------------------
> May it help.
> -Xuebing Yan
>
> -----邮件原件-----
> 发件人: Kareem Dana [mailto:kareem.dana@gmail.com]
> 发送时间: 2007年11月16日 9:32
> 收件人: hadoop-user@lucene.apache.org
> 主题: Re: HBase PerformanceEvaluation failing
>
> My DFS appears healthy. After the PE fails, the datanodes are still
> running but all the HRegionServers have exited. My initial concern is
> free harddrive space or memory. Each node has ~1.5GB free space for
> DFS and 400MB ram/256mb swap. Is this enough for the PE? I tried
> monitoring the free space as the PE ran and it never completely filled
> up but it is kind of tight.
>
>
> On Nov 15, 2007 8:01 PM, stack <stack@duboce.net> wrote:
> > Your DFS is healthy?  This seems odd: "File
> >
> /tmp/hadoop-kcd/hbase/hregion_TestTable,2102165,6843477525281170954/info/map
> files/6464987859396543981/datacould
>
> > only be replicated to 0 nodes, instead of 1;"  In my experience, IIRC,
> > it means no datanodes running.
> >
> > (I just tried the PE from TRUNK and it ran to completion).
> >
> > St.Ack
> >
> >
> > Kareem Dana wrote:
> > > I'm trying to run the HBase PerformanceEvaluation program on a cluster
> > > of 5 hadoop nodes (on virtual machines).
> > >
> > > hadoop07 is a DFS Master and HBase master
> > > hadoop08-12 are HBase region servers
> > >
> > > I start the test as follows:
> > >
> > > $ bin/hadoop jar
> > > ${HADOOP_HOME}build/contrib/hbase/hadoop-0.15.0-dev-hbase-test.jar
> > > sequentialWrite 2
> > >
> > > This starts the sequentialWrite test with 2 clients. After about 25
> > > minutes the map tasks are about 25% complete and reduce at 6% the test
> > > fails with the following error:
> > > 2007-11-15 17:06:35,100 INFO org.apache.hadoop.mapred.TaskInProgress:
> > > TaskInProgress tip_200711151626_0001_m_000002 has failed 1 times.
> > > 2007-11-15 17:06:35,100 INFO org.apache.hadoop.mapred.JobInProgress:
> > > Aborting job job_200711151626_0001
> > > 2007-11-15 17:06:35,101 INFO org.apache.hadoop.mapred.TaskInProgress:
> > > Error from task_200711151626_0001_m_000006_0:
> > > org.apache.hadoop.hbase.NoServerForRegionException: failed to find
> > > server for TestTable after 5 retries
> > >       at
> org.apache.hadoop.hbase.HConnectionManager$TableServers.scanOneMetaRegion(HC
> onnectionManager.java:761)
> > >       at
> org.apache.hadoop.hbase.HConnectionManager$TableServers.findServersForTable(
> HConnectionManager.java:521)
> > >       at
> org.apache.hadoop.hbase.HConnectionManager$TableServers.reloadTableServers(H
> ConnectionManager.java:317)
> > >       at org.apache.hadoop.hbase.HTable.commit(HTable.java:671)
> > >       at org.apache.hadoop.hbase.HTable.commit(HTable.java:636)
> > >       at
> org.apache.hadoop.hbase.PerformanceEvaluation$SequentialWriteTest.testRow(Pe
> rformanceEvaluation.java:493)
> > >       at
> org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluatio
> n.java:356)
> > >       at
> org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvalua
> tion.java:529)
> > >       at
> org.apache.hadoop.hbase.PerformanceEvaluation$EvaluationMapTask.map(Performa
> nceEvaluation.java:184)
> > >       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > >       at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
> > >
> > >
> > > An HBase region server log shows these errors:
> > > 2007-11-15 17:03:00,017 ERROR org.apache.hadoop.hbase.HRegionServer:
> > > error closing region TestTable,2102165,6843477525281170954
> > > org.apache.hadoop.hbase.DroppedSnapshotException: java.io.IOException:
> > > File
> /tmp/hadoop-kcd/hbase/hregion_TestTable,2102165,6843477525281170954/info/map
> files/6464987859396543981/data
> > > could only be replicated to 0 nodes, instead of 1
> > >         at
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003
> )
> > >         at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
> > >         at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
> > >         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
> > >         at java.lang.reflect.Method.invoke(Method.java:585)
> > >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
> > >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
> > >
> > >         at
> org.apache.hadoop.hbase.HRegion.internalFlushcache(HRegion.java:886)
> > >         at org.apache.hadoop.hbase.HRegion.close(HRegion.java:388)
> > >         at
> org.apache.hadoop.hbase.HRegionServer.closeAllRegions(HRegionServer.java:978
> )
> > >         at
> org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:593)
> > >         at java.lang.Thread.run(Thread.java:595)
> > > 2007-11-15 17:03:00,615 ERROR org.apache.hadoop.hbase.HRegionServer:
> > > error closing region TestTable,3147654,8929124532081908894
> > > org.apache.hadoop.hbase.DroppedSnapshotException: java.io.IOException:
> > > File
> /tmp/hadoop-kcd/hbase/hregion_TestTable,3147654,8929124532081908894/info/map
> files/3451857497397493742/data
> > > could only be replicated to 0 nodes, instead of 1
> > >         at
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003
> )
> > >         at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
> > >         at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
> > >         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
> > >         at java.lang.reflect.Method.invoke(Method.java:585)
> > >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
> > >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
> > >
> > >         at
> org.apache.hadoop.hbase.HRegion.internalFlushcache(HRegion.java:886)
> > >         at org.apache.hadoop.hbase.HRegion.close(HRegion.java:388)
> > >         at
> org.apache.hadoop.hbase.HRegionServer.closeAllRegions(HRegionServer.java:978
> )
> > >         at
> org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:593)
> > >         at java.lang.Thread.run(Thread.java:595)
> > > 2007-11-15 17:03:00,639 ERROR org.apache.hadoop.hbase.HRegionServer:
> > > Close and delete failed
> > > java.io.IOException: java.io.IOException: File
> > >
> /tmp/hadoop-kcd/hbase/log_172.16.6.57_-3889232888673408171_60020/hlog.dat.00
> 5
> > > could only be replicated to 0 nodes, instead of 1
> > >         at
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1003
> )
> > >         at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
> > >         at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
> > >         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
> > >         at java.lang.reflect.Method.invoke(Method.java:585)
> > >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
> > >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
> > >
> > >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> > >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
> sorImpl.java:39)
> > >         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
> torAccessorImpl.java:27)
> > >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:494)
> > >         at
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteE
> xceptionHandler.java:82)
> > >         at
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExcept
> ionHandler.java:48)
> > >         at
> org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:597)
> > >         at java.lang.Thread.run(Thread.java:595)
> > > 2007-11-15 17:03:00,640 INFO org.apache.hadoop.hbase.HRegionServer:
> > > telling master that region server is shutting down at:
> > > 172.16.6.57:60020
> > > 2007-11-15 17:03:00,643 INFO org.apache.hadoop.hbase.HRegionServer:
> > > stopping server at: 172.16.6.57:60020
> > > 2007-11-15 17:03:00,643 INFO org.apache.hadoop.hbase.HRegionServer:
> > > regionserver/0.0.0.0:60020 exiting
> > >
> > > I can provide some more logs if necessary. Any ideas or suggestions
> > > about how I track this down? Running sequentialWrite test with just 1
> > > client works fine but using 2 or more causes these errors.
> > >
> > > Thanks for any help,
> > > Kareem Dana
> > >
> >
> >
>
Mime
View raw message