hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lohit <lohit...@yahoo.com>
Subject Re: Socket closed Exception
Date Wed, 01 Apr 2009 23:00:08 GMT

Thanks Koji, Raghu.
This seemed to solve our problem, havent seen this happen in the past 2 days. What is the
typical value of ipc.client.idlethreshold on big clusters.
Does default value of 4000 suffice?

Lohit



----- Original Message ----
From: Koji Noguchi <knoguchi@yahoo-inc.com>
To: core-user@hadoop.apache.org
Sent: Monday, March 30, 2009 9:30:04 AM
Subject: RE: Socket closed Exception

Lohit, 

You're right. We saw " java.net.SocketTimeoutException: timed out
waiting for rpc response" and not Socket closed exception.

If you're getting "closed exception", then I don't remember seeing that
problem on our clusters.

Our users often report "Socket closed exception" as a problem, but in
most cases those failures are due to jobs failing with completely
different reasons and race condition between 1) JobTracker removing
directory/killing tasks and 2) tasks failing with closed exception
before they get killed.

Koji



-----Original Message-----
From: lohit [mailto:lohit_bv@yahoo.com] 
Sent: Monday, March 30, 2009 8:51 AM
To: core-user@hadoop.apache.org
Subject: Re: Socket closed Exception


Thanks Koji. 
If I look at the code, NameNode (RPC Server) seems to tear down idle
connections. Did you see 'Socket closed' exception instead of 'timed out
waiting for socket'?
We seem to hit the 'Socket closed' exception where client do not
timeout, but get back socket closed exception when they do RPC for
create/open/getFileInfo.

I will give this a try. Thanks again,
Lohit



----- Original Message ----
From: Koji Noguchi <knoguchi@yahoo-inc.com>
To: core-user@hadoop.apache.org
Sent: Sunday, March 29, 2009 11:44:29 PM
Subject: RE: Socket closed Exception

Hi Lohit,

My initial guess would be
https://issues.apache.org/jira/browse/HADOOP-4040

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '
ipc.client.idlethreshold'.  

  <name>ipc.client.idlethreshold</name>
  <value>4000</value>
  <description>Defines the threshold number of connections after which
               connections will be inspected for idleness.
  </description>

When inspecting for idleness, namenode uses

  <name>ipc.client.maxidletime</name>
  <value>120000</value>
  <description>Defines the maximum idle time for a connected client 
               after which it may be disconnected.
  </description>

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.


# If this solves your problem, Raghu should get the credit. 
  He spent so many hours to solve this mystery for us. :)


Koji


-----Original Message-----
From: lohit [mailto:lohit_bv@yahoo.com] 
Sent: Sunday, March 29, 2009 11:56 AM
To: core-user@hadoop.apache.org
Subject: Socket closed Exception


Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?

Thanks,
Lohit


Mime
View raw message