hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From schubert zhang <zson...@gmail.com>
Subject Re: Data lost during intensive writes
Date Thu, 26 Mar 2009 11:58:36 GMT
Thanks  Andrew.I will set "dfs.datanode.max.xcievers=1024" (default is 256)

I am using branch-0.19.
Do you think "dfs.datanode.socket.write.timeout=0" is necessary in
release-0.19?

Schubert


On Thu, Mar 26, 2009 at 7:57 AM, Andrew Purtell <apurtell@apache.org> wrote:

>
> You may need to increase the maximum number of xceivers allowed
> on each of your datanodes.
>
> Best regards,
>
>   - Andy
>
> > From: schubert zhang <zsongbo@gmail.com>
> > Subject: Re: Data lost during intensive writes
> > To: hbase-user@hadoop.apache.org
> > Date: Wednesday, March 25, 2009, 2:01 AM
> > Hi all,
> > I also meet such same problems/exceptions.
> > I also have 5+1 machine,e and the system has been running
> > for about 4 days,
> > and there are 512 regions now. But the two
> > exceptions start to happen earlyer.
> >
> > hadoop-0.19
> > hbase-0.19.1 (with patch
> > https://issues.apache.org/jira/browse/HBASE-1008).<
> https://issues.apache.org/jira/browse/HBASE-1008>
> >
> > I want to try to set dfs.datanode.socket.write.timeout=0
> > and watch it later.
> >
> > Schubert
> >
> > On Sat, Mar 7, 2009 at 3:15 AM, stack
> > <stack@duboce.net> wrote:
> >
> > > On Wed, Mar 4, 2009 at 9:18 AM,
> > <jthievre@ina.fr> wrote:
> > >
> > > > <property>
> > > >  <name>dfs.replication</name>
> > > >  <value>2</value>
> > > >  <description>Default block replication.
> > > >  The actual number of replications can be
> > specified when the file is
> > > > created.
> > > >  The default is used if replication is not
> > specified in create time.
> > > >  </description>
> > > > </property>
> > > >
> > > > <property>
> > > >  <name>dfs.block.size</name>
> > > >  <value>8388608</value>
> > > >  <description>The hbase standard size for
> > new files.</description>
> > > > <!--<value>67108864</value>-->
> > > > <!--<description>The default block size
> > for new files.</description>-->
> > > > </property>
> > > >
> > >
> > >
> > > The above are non-standard.  A replication of 3 might
> > lessen the incidence
> > > of HDFS errors seen since there will be another
> > replica to go to.   Why
> > > non-standard block size?
> > >
> > > I did not see *dfs.datanode.socket.write.timeout* set
> > to 0.  Is that
> > > because
> > > you are running w/ 0.19.0?  You might try with it
> > especially because in the
> > > below I see complaint about the timeout (but more
> > below on this).
> > >
> > >
> > >
> > > >  <property>
> > > >
> > <name>hbase.hstore.blockCache.blockSize</name>
> > > >    <value>65536</value>
> > > >    <description>The size of each block in
> > the block cache.
> > > >    Enable blockcaching on a per column family
> > basis; see the BLOCKCACHE
> > > > setting
> > > >    in HColumnDescriptor.  Blocks are kept in a
> > java Soft Reference cache
> > > so
> > > > are
> > > >    let go when high pressure on memory.  Block
> > caching is not enabled by
> > > > default.
> > > >    Default is 16384.
> > > >    </description>
> > > >  </property>
> > > >
> > >
> > >
> > > Are you using blockcaching?  If so, 64k was
> > problematic in my testing
> > > (OOMEing).
> > >
> > >
> > >
> > >
> > > > Case 1:
> > > >
> > > > On HBase Regionserver:
> > > >
> > > > 2009-02-27 04:23:52,185 INFO
> > org.apache.hadoop.hdfs.DFSClient:
> > > > org.apache.hadoop.ipc.RemoteException:
> > > >
> > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException:
> > Not
> > > > replicated
> > > >
> > >
> >
> yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
> > > >        at
> > sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
> > > >        at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at
> > java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at
> > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> > > >        at
> > org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
> > > >
> > > >        at
> > org.apache.hadoop.ipc.Client.call(Client.java:696)
> > > >        at
> > org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> > > >        at $Proxy1.addBlock(Unknown Source)
> > > >        at
> > sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> > > >        at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at
> > java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> > > >        at $Proxy1.addBlock(Unknown Source)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > > >
> > > >
> > > > On Hadoop Datanode:
> > > >
> > > > 2009-02-27 04:22:58,110 WARN
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):Got exception
> > while serving
> > > > blk_5465578316105624003_26301 to /10.1.188.249:
> > > > java.net.SocketTimeoutException: 480000 millis
> > timeout while waiting for
> > > > channel to be ready for write. ch :
> > > > java.nio.channels.SocketChannel[connected
> > local=/10.1.188.249:50010
> > > remote=/
> > > > 10.1.188.249:48326]
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >
> > > > 2009-02-27 04:22:58,110 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):DataXceiver
> > > > java.net.SocketTimeoutException: 480000 millis
> > timeout while waiting for
> > > > channel to be ready for write. ch :
> > > > java.nio.channels.SocketChannel[connected
> > local=/10.1.188.249:50010
> > > remote=/
> > > > 10.1.188.249:48326]
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > >
> > >
> > > Are you sure the regionserver error matches the
> > datanode error?
> > >
> > > My understanding is that in 0.19.0, DFSClient in
> > regionserver is supposed
> > > to
> > > reestablish timed-out connections.  If that is not
> > happening in your case
> > > --
> > > and we've speculated some that there might holes
> > in this mechanism -- try
> > > with timeout set to zero (see citation above; be sure
> > the configuration can
> > > be seen by the DFSClient running in hbase by either
> > adding to
> > > hbase-site.xml
> > > or somehow get the hadoop-site.xml into hbase
> > CLASSPATH
> > > (hbase-env.sh#HBASE_CLASSPATH or with a symlink into
> > the HBASE_HOME/conf
> > > dir).
> > >
> > >
> > >
> > > > Case 2:
> > > >
> > > > HBase Regionserver:
> > > >
> > > > 2009-03-02 09:55:11,929 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > >
> > blk_-6496095407839777264_96895java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-6496095407839777264_96895 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:11,930 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-6496095407839777264_96895
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:11,930 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-6496095407839777264_96895
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.203:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,362 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > >
> > blk_-7585241287138805906_96914java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-7585241287138805906_96914 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:14,362 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-7585241287138805906_96914
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,363 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-7585241287138805906_96914
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.141:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,445 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_8693483996243654850_96912java.io.IOException:
> > Bad response 1 for
> > > block
> > > > blk_8693483996243654850_96912 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:14,446 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_8693483996243654850_96912
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,446 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_8693483996243654850_96912
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.203:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,923 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > >
> > blk_-8939308025013258259_96931java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-8939308025013258259_96931 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:14,935 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-8939308025013258259_96931
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,935 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-8939308025013258259_96931
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.203:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,344 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_7417692418733608681_96934java.io.IOException:
> > Bad response 1 for
> > > block
> > > > blk_7417692418733608681_96934 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:15,344 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_7417692418733608681_96934
> > bad datanode[2]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,344 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_7417692418733608681_96934
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.203:50010,
> > 10.1.188.182:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,579 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_6777180223564108728_96939java.io.IOException:
> > Bad response 1 for
> > > block
> > > > blk_6777180223564108728_96939 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:15,579 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_6777180223564108728_96939
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,579 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_6777180223564108728_96939
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.203:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,930 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > >
> > blk_-6352908575431276531_96948java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-6352908575431276531_96948 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:15,930 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-6352908575431276531_96948
> > bad datanode[2]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,930 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-6352908575431276531_96948
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.30:50010,
> > 10.1.188.182:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,988 INFO
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> > > > MSG_REGION_SPLIT: metadata_table,r:
> > > >
> > >
> >
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
> > > > 2009-03-02<
> > >
> >
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185%0A2009-03-02
> >09:55:16,008
> > > WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> > > > ResponseProcessor exception  for block
> > > >
> > blk_-1071965721931053111_96956java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-1071965721931053111_96956 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:16,008 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-1071965721931053111_96956
> > bad datanode[2]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:16,009 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-1071965721931053111_96956
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.203:50010,
> > 10.1.188.182:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:16,073 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_1004039574836775403_96959java.io.IOException:
> > Bad response 1 for
> > > block
> > > > blk_1004039574836775403_96959 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:16,073 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_1004039574836775403_96959
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > >
> > > >
> > > > Hadoop datanode:
> > > >
> > > > 2009-03-02 09:55:10,201 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder
> > > > blk_-5472632607337755080_96875 1 Exception
> > java.io.EOFException
> > > >        at
> > java.io.DataInputStream.readFully(DataInputStream.java:180)
> > > >        at
> > java.io.DataInputStream.readLong(DataInputStream.java:399)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >
> > > > 2009-03-02 09:55:10,407 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder 1 for
> > > block
> > > > blk_-5472632607337755080_96875 terminating
> > > > 2009-03-02 09:55:10,516 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):Exception writing
> > block
> > > > blk_-5472632607337755080_96875 to mirror
> > 10.1.188.182:50010
> > > > java.io.IOException: Broken pipe
> > > >        at sun.nio.ch.FileDispatcher.write0(Native
> > Method)
> > > >        at
> > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> > > >        at
> > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> > > >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
> > > >        at
> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > > >        at
> > >
> > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > > >        at
> > java.io.DataOutputStream.write(DataOutputStream.java:90)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >
> > > > 2009-03-02 09:55:10,517 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > Exception in
> > > receiveBlock
> > > > for block blk_-5472632607337755080_96875
> > java.io.IOException: Broken pipe
> > > > 2009-03-02 09:55:10,517 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > writeBlock
> > > > blk_-5472632607337755080_96875 received exception
> > java.io.IOException:
> > > > Broken pipe
> > > > 2009-03-02 09:55:10,517 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):DataXceiver
> > > > java.io.IOException: Broken pipe
> > > >        at sun.nio.ch.FileDispatcher.write0(Native
> > Method)
> > > >        at
> > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> > > >        at
> > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> > > >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
> > > >        at
> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > > >        at
> > >
> > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > > >        at
> > java.io.DataOutputStream.write(DataOutputStream.java:90)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > > 2009-03-02 09:55:11,174 INFO
> > > >
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
> > src: /
> > > > 10.1.188.249:49063, dest: /10.1.188.249:50010,
> > bytes: 312, op:
> > > HDFS_WRITE,
> > > > cliID: DFSClient_1091437257, srvID:
> > > > DS-1180278657-127.0.0.1-50010-1235652659245,
> > blockid:
> > > > blk_5027345212081735473_96878
> > > > 2009-03-02 09:55:11,177 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder 2 for
> > > block
> > > > blk_5027345212081735473_96878 terminating
> > > > 2009-03-02 09:55:11,185 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > Receiving block
> > > > blk_-3992843464553216223_96885 src:
> > /10.1.188.249:49069 dest: /
> > > > 10.1.188.249:50010
> > > > 2009-03-02 09:55:11,186 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > Receiving block
> > > > blk_-3132070329589136987_96885 src:
> > /10.1.188.30:33316 dest: /
> > > > 10.1.188.249:50010
> > > > 2009-03-02 09:55:11,187 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > Exception in
> > > receiveBlock
> > > > for block blk_8782629414415941143_96845
> > java.io.IOException: Connection
> > > > reset by peer
> > > > 2009-03-02 09:55:11,187 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder 0 for
> > > block
> > > > blk_8782629414415941143_96845 Interrupted.
> > > > 2009-03-02 09:55:11,187 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder 0 for
> > > block
> > > > blk_8782629414415941143_96845 terminating
> > > > 2009-03-02 09:55:11,187 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > writeBlock
> > > > blk_8782629414415941143_96845 received exception
> > java.io.IOException:
> > > > Connection reset by peer
> > > > 2009-03-02 09:55:11,187 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):DataXceiver
> > > > java.io.IOException: Connection reset by peer
> > > >        at sun.nio.ch.FileDispatcher.read0(Native
> > Method)
> > > >        at
> > sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> > > >        at
> > sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> > > >        at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> > > >        at
> > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > > >        at
> > > >
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
> > > >        at
> > > >
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
> > > >        at
> > java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> > > >        at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> > > >        at
> > java.io.DataInputStream.read(DataInputStream.java:132)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >        etc.............................
> > >
> > >
> > >
> > > This looks like an HDFS issue where it won't move
> > on past the bad server
> > > 182.  On client side, they are reported as WARN in the
> > dfsclient but don't
> > > make it up to regionserver so not much we can do about
> > it.
> > >
> > >
> > > I have others exceptions related to DataXceivers
> > problems. These errors
> > > > doesn't make the region server go down, but I
> > can see that I lost some
> > > > records (about 3.10e6 out of 160.10e6).
> > > >
> > >
> > >
> > > Any regionserver crashes during your upload?  I'd
> > think this more the
> > > reason
> > > for dataloss; i.e. edits that were in memcache
> > didn't make it out to the
> > > filesystem because there is still no working flush in
> > hdfs -- hopefully
> > > 0.21
> > > hadoop... see HADOOP-4379.... (though your scenario 2
> > above looks like we
> > > could have handed hdfs the data but it dropped it
> > anyways....)
> > >
> > >
> > >
> > > >
> > > > As you can see in my conf files, I up the
> > dfs.datanode.max.xcievers to
> > > 8192
> > > > as suggested from several mails.
> > > > And my ulimit -n is at 32768.
> > >
> > >
> > > Make sure you can see that above is for sure in place
> > by looking at the
> > > head
> > > of your regionserver log on startup.
> > >
> > >
> > >
> > > > Do these problems come from my configuration, or
> > my hardware ?
> > > >
> > >
> > >
> > > Lets do some more back and forth and make sure we have
> > done all we can
> > > regards the software configuration.  Its probably not
> > hardware going by the
> > > above.
> > >
> > > Tell us more about your uploading process and your
> > schema.  Did all load?
> > > If so, on your 6 servers, how many regions?  How did
> > you verify how much
> > > was
> > > loaded?
> > >
> > > St.Ack
> > >
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message