incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthieu Morel <>
Subject Re: Fault tolerance and communication
Date Thu, 21 Mar 2013 09:25:43 GMT
Please ask user questions on the s4-user list (cced) thanks!

On Mar 21, 2013, at 04:06 , Dingyu Yang wrote:

> Hi,all
> I test the section of fault tolerance, but can not recover the state of
> failed node:
> I have a adapter and one app node, one stand-by node. The checkpoint is
> doing with the baseconfig of 20 seconds.
> When app node is stop, the stand-by node can acquire a task, but the state
> is not recovered.
> You  can check or i have to do some other configs.

This looks like a configuration/environment issue. Which version are you using? (recommended
is S4 0.6 RC3)

If you use the file system checkpointing backend, make sure the files are accessible from
failover nodes.
You can also specify where the containing directory is, e.g. -p=s4.checkpointing.filesystem.storageRootPath=/path/to/shared-dir

> Another problem is that the communication between adapter and app.
> I test the experiment of word count, a 500M file with 80775764 words.
> multiple nodes for app partitions, one node for adapter.
> I test one adatper node and one app node, the adapter sending all the words
> is done with 35 seconds.
> one adatper node and two app node, the adapter is done with 61 seconds.
> one adatper node and three app node, the adapter is done with 95 seconds.
> The adapter node is a same node and same program.
> The time of adapter should be same or less with increasing app nodes, since
> its processing ability has increased.
> I don't know what the problem is.

There were some extra copies in S4 0.5 so if you are using that version it could be an explanation.

The pattern is quite clear though (linear increase with number of nodes) and it should be
easy to spot the issue. Looks like a given operation is repeated for each target node. Are
you broadcasting to all nodes? Are the events from the adapter keyed? Is there something specifically
related in your adapter app code or adapter app graph?



View raw message