hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: DataXceiver error
Date Thu, 24 Sep 2009 22:48:02 GMT
On Thu, Sep 24, 2009 at 3:39 PM, Florian Leibert <flo@leibert.de> wrote:

> This happens maybe 4-5 times a day on an arbitrary node - it usually occurs
> during very intense jobs where there are 10s of thousands of map tasks
> scheduled...
>

Right.. So, the reason most probably is that the particular file being read
is being kept open during the computation and thats causing the timeouts.
You can try to alter your jobs and number of tasks and see if you can come
out with a workaround.


> From what I gather in the code, this results from a write attempt - the
> selector seems to wait until it can write to a channel - setting this to 0
> might impact our cluster reliability, hence I'm not
>
>
Setting the timeout to 0 doesnt impact the cluster reliability. We have it
set to 0 on our clusters as well and its a pretty normal thing to do.
However, we do it because we are using HBase as well and that is known to
keep file handles open for long periods. But, setting the timeout to 0
doesnt impact any of our non-Hbase applications/jobs at all.. So, its not a
problem.


> On Thu, Sep 24, 2009 at 3:16 PM, Amandeep Khurana <amansk@gmail.com>
> wrote:
>
> > What were you doing when you got this error? Did you monitor the resource
> > consumption during whatever you were doing?
> >
> > Reason I said was that sometimes, file handles are open for longer than
> the
> > timeout for some reason (intended though) and that causes trouble.. So,
> > people keep the timeout at 0 to solve this problem.
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Thu, Sep 24, 2009 at 3:12 PM, Florian Leibert <flo@leibert.de> wrote:
> >
> > > I don't think setting the timeout to 0 is a good idea - after all we
> have
> > a
> > > lot writes going on so it should happen at times that a resource isn't
> > > available immediately. Am I missing something or what's your reasoning
> > for
> > > assuming that the timeout value is the problem?
> > >
> > > On Thu, Sep 24, 2009 at 2:19 PM, Amandeep Khurana <amansk@gmail.com>
> > > wrote:
> > >
> > > > When do you get this error?
> > > >
> > > > Try making the timeout to 0. That'll remove the timeout of 480s.
> > Property
> > > > name: dfs.datanode.socket.write.timeout
> > > >
> > > > -ak
> > > >
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Thu, Sep 24, 2009 at 1:36 PM, Florian Leibert <flo@leibert.de>
> > wrote:
> > > >
> > > > > Hi,
> > > > > recently, we're seeing frequent STEs in our datanodes. We had prior
> > > fixed
> > > > > this issue by upping the handler count max.xciever (note this is
> > > > misspelled
> > > > > in the code as well - so we're just being consistent).
> > > > > We're using 0.19 with a couple of patches - none of which should
> > affect
> > > > any
> > > > > of the areas in the stacktrace.
> > > > >
> > > > > We've seen this before upping the limits on the xcievers - but
> these
> > > > > settings seem very high already. We're running 102 nodes.
> > > > >
> > > > > Any hints would be appreciated.
> > > > >
> > > > >  <property>
> > > > >    <name>dfs.datanode.handler.count</name>
> > > > >    <value>300</value>
> > > > > </property>
> > > > > <property>
> > > > >   <name>dfs.namenode.handler.count</name>
> > > > >    <value>300</value>
> > > > >  </property>
> > > > >  <property>
> > > > >    <name>dfs.datanode.max.xcievers</name>
> > > > >    <value>2000</value>
> > > > >  </property>
> > > > >
> > > > >
> > > > > 2009-09-24 17:48:13,648 ERROR
> > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > > 10.16.160.79:50010,
> > > > > storageID=DS-1662533511-10.16.160.79-50010-1219665628349,
> > > infoPort=50075,
> > > > > ipcPort=50020):DataXceiver
> > > > > java.net.SocketTimeoutException: 480000 millis timeout while
> waiting
> > > for
> > > > > channel to be ready for write. ch :
> > > > > java.nio.channels.SocketChannel[connected local=/
> 10.16.160.79:50010
> > > > remote=/
> > > > > 10.16.134.78:34280]
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > > >        at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > > >        at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > > >        at java.lang.Thread.run(Thread.java:619)
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message