hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Florian Leibert <...@leibert.de>
Subject Re: DataXceiver error
Date Thu, 24 Sep 2009 22:39:47 GMT
This happens maybe 4-5 times a day on an arbitrary node - it usually occurs
during very intense jobs where there are 10s of thousands of map tasks
scheduled...
>From what I gather in the code, this results from a write attempt - the
selector seems to wait until it can write to a channel - setting this to 0
might impact our cluster reliability, hence I'm not

On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <amansk@gmail.com> wrote:

> What were you doing when you got this error? Did you monitor the resource
> consumption during whatever you were doing?
>
> Reason I said was that sometimes, file handles are open for longer than the
> timeout for some reason (intended though) and that causes trouble.. So,
> people keep the timeout at 0 to solve this problem.
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <flo@leibert.de> wrote:
>
> > I don't think setting the timeout to 0 is a good idea - after all we have
> a
> > lot writes going on so it should happen at times that a resource isn't
> > available immediately. Am I missing something or what's your reasoning
> for
> > assuming that the timeout value is the problem?
> >
> > On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <amansk@gmail.com>
> > wrote:
> >
> > > When do you get this error?
> > >
> > > Try making the timeout to 0. That'll remove the timeout of 480s.
> Property
> > > name: dfs.datanode.socket.write.timeout
> > >
> > > -ak
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> > >
> > > On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <flo@leibert.de>
> wrote:
> > >
> > > > Hi,
> > > > recently, we're seeing frequent STEs in our datanodes. We had prior
> > fixed
> > > > this issue by upping the handler count max.xciever (note this is
> > > misspelled
> > > > in the code as well - so we're just being consistent).
> > > > We're using 0.19 with a couple of patches - none of which should
> affect
> > > any
> > > > of the areas in the stacktrace.
> > > >
> > > > We've seen this before upping the limits on the xcievers - but these
> > > > settings seem very high already. We're running 102 nodes.
> > > >
> > > > Any hints would be appreciated.
> > > >
> > > >  <property>
> > > >    <name>dfs.datanode.handler.count</name>
> > > >    <value>300</value>
> > > > </property>
> > > > <property>
> > > >   <name>dfs.namenode.handler.count</name>
> > > >    <value>300</value>
> > > >  </property>
> > > >  <property>
> > > >    <name>dfs.datanode.max.xcievers</name>
> > > >    <value>2000</value>
> > > >  </property>
> > > >
> > > >
> > > > 2009-09-24 17:48:13,648 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > > > 10.16.160.79:50010,
> > > > storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
> > infoPort=50075,
> > > > ipcPort=50020):DataXceiver
> > > > java.net.SocketTimeoutException: 480000 millis timeout while waiting
> > for
> > > > channel to be ready for write. ch :
> > > > java.nio.channels.SocketChannel[connected local=/10.16.160.79:50010
> > > remote=/
> > > > 10.16.134.78:34280]
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > >        at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message