hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: ipc.client.timeout
Date Thu, 13 Sep 2007 20:49:58 GMT
There is a retry for the 'complete' operation - those are erroring out
as well. (DFSClient.java: methodNameToPolicyMap.put("complete",
methodPolicy);)

Quite likely it's because the namenode is also a data/task node. 

-----Original Message-----
From: Dhruba Borthakur [mailto:dhruba@yahoo-inc.com] 
Sent: Thursday, September 13, 2007 1:38 PM
To: hadoop-user@lucene.apache.org
Subject: RE: ipc.client.timeout

Hi Jaydeep,

The idea is to retry only those operations that are idempotent.
addBlocks
and mkdirs are non-idempotent, and that's why they are no retries for
these
calls. 

Can you tell me if a CPU bottleneck on your Namenode is causing you to
encounter all these timeout?

Thanks,
dhruba


-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com] 
Sent: Thursday, September 13, 2007 12:14 PM
To: hadoop-user@lucene.apache.org
Subject: RE: ipc.client.timeout

I would love to use a lower timeout. It seems that retries are either
buggy or missing in some cases - that cause lots of failures. The cases
I can see right now (0.13.1):

- namenode.complete: looks like it retries - but may not be idempotent?

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
complete write to file
/user/facebook/profiles/binary/users_joined/_task_0018_r_000003_0/.part-
00003.crc by DFSClient_task_0018_r_000003_0
	at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:353)


- namenode.addBlock: no retry policy (looking at DFSClient.java)
- namenode.mkdirs: no retry policy ('')

We see plenty of all of these with a lowered timeout. With a high
timeout - we have seen very slow recovery from some failures (jobs would
hang on submission).

Don't understand the fs protocol well enough - any idea if these are
fixable?

Thx,

Joydeep

-----Original Message-----
From: Devaraj Das [mailto:ddas@yahoo-inc.com] 
Sent: Wednesday, September 05, 2007 1:00 AM
To: hadoop-user@lucene.apache.org
Subject: RE: ipc.client.timeout

This is to take care of cases where a particular server is too loaded to
respond to client RPCs quick enough. Setting the timeout to a large
value
ensures that RPCs won't timeout that often and thereby potentially lead
to
lesser failures (for e.g., a map/reduce task kills itself when it fails
to
invoke an RPC on the tasktracker for three times in a row) and retries. 

> -----Original Message-----
> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com] 
> Sent: Wednesday, September 05, 2007 12:26 PM
> To: hadoop-user@lucene.apache.org
> Subject: ipc.client.timeout
> 
> The default is set to 60s. many of my dfs -put commands would 
> seem to hang - and lowering the timeout (to 1s)  seems to 
> have made things a whole lot better.
> 
>  
> 
> General curiosity - isn't 60s just huge for a rpc timeout? (a 
> web search indicates that nutch may be setting it to 10s - 
> and even that seems fairly large). Would love to get a 
> backgrounder on why the default is set to so large a value ..
> 
>  
> 
> Thanks,
> 
>  
> 
> Joydeep
> 
> 



Mime
View raw message