reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruv Mahajan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1224) IMRU Fault Tolerance - Separate Data downloading from Task injection
Date Mon, 28 Mar 2016 18:49:25 GMT

    [ https://issues.apache.org/jira/browse/REEF-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214667#comment-15214667
] 

Dhruv Mahajan commented on REEF-1224:
-------------------------------------

So the main issue I am seeing with adding Network service along with Task is following. Till
500 Mappers it runs fine. However, the moment I specify 1000 mappers, one of evaluators is
failing with following log message. The error part is highlighted. The target machine in this
case is driver and I think NameClient is trying to contact NameServer there. On looking on
web at: [http://stackoverflow.com/questions/2972600/no-connection-could-be-made-because-the-target-machine-actively-refused-it],
it happens if there is a lot of backlog at server. Interestingly, this error never happens
when it is injected as separate service or with data but happens only when injected with task
and that also for 1000 nodes or so. I repeated runs multiple times to make sure of this.

INFO: ContextRuntime::StartTask(TaskConfiguration) task is present: False
Org.Apache.REEF.Tang.Implementations.InjectionPlan.InjectorImpl Error: 0 : 2016-03-28T11:19:34.1105970-07:00
0006

ERROR: ExceptionCaught TargetInvocationException encountered error [System.Reflection.TargetInvocationException:
Exception has been thrown by the target of an invocation. ---> System.Net.Sockets.SocketException:
No connection could be made because the target machine actively refused it 10.200.145.212:9009
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at System.Net.Sockets.TcpClient.Connect(IPEndPoint remoteEP)
   at Org.Apache.REEF.Wake.Remote.Impl.Link`1..ctor(IPEndPoint remoteEndpoint, ICodec`1 codec)
   at Org.Apache.REEF.Wake.Remote.Impl.TransportClient`1..ctor(IPEndPoint remoteEndpoint,
ICodec`1 codec)
   at Org.Apache.REEF.Wake.Remote.Impl.TransportClient`1..ctor(IPEndPoint remoteEndpoint,
ICodec`1 codec, IObserver`1 observer)
   at Org.Apache.REEF.Network.Naming.NameClient.Initialize(IPEndPoint serverEndpoint)
   at Org.Apache.REEF.Network.Naming.NameClient..ctor(String remoteAddress, Int32 remotePort,
NameCache cache)
   --- End of inner exception stack trace ---


> IMRU Fault Tolerance - Separate Data downloading from Task injection
> --------------------------------------------------------------------
>
>                 Key: REEF-1224
>                 URL: https://issues.apache.org/jira/browse/REEF-1224
>             Project: REEF
>          Issue Type: Improvement
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Dhruv Mahajan
>
> Currently in IMRU, data downloading happens during the Task injection. It couples the
data and Task object. In Fault tolerant case, we would like to only resubmit a task but use
the data that have been downloaded, That requires us to decouple those two portions. For example,
data downloading portion can be attached to Context, and we can then resubmit a task on the
same context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message