reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Weller <swel...@microsoft.com.INVALID>
Subject Help on reef error messages from log output
Date Tue, 01 Aug 2017 21:42:27 GMT
In attempting to run our reef application in 'yarn' mode on our HDI cluster we are getting
some exceptions
that seem strange. Can anyone help debug these or suggest what we should check on our end?


1).  At the start of the output from the worker node, we are seeing some TangApplication exceptions
like these:

Container: container_1501218565459_0005_01_000004 on workernode0.reefhdijulia1.g10.internal.cloudapp.net_45454
================================================================================================================
LogType:evaluator.stderr
Log Upload Time:Mon Jul 31 22:08:42 +0000 2017
LogLength:0
Log Contents:
End of LogType:evaluator.stderr

LogType:evaluator.stdout
Log Upload Time:Mon Jul 31 22:08:42 +0000 2017
LogLength:37013
Log Contents:
Org.Apache.REEF.Tang.Util.AssemblyLoader Error: 0 : 2017-07-31T22:08:26.7020343+00:00 0001
ERROR: ExceptionThrowing TangApplicationException
Encountered error [Org.Apache.REEF.Tang.Exceptions.TangApplicationException: Not able to get
Type from the name provided: org.apache.reef.runtime.common.evaluator.parameters.ApplicationIdentifier]
Org.Apache.REEF.Tang.Util.AssemblyLoader Error: 0 : 2017-07-31T22:08:26.7176597+00:00 0001
ERROR: ExceptionThrowing TangApplicationException
Encountered error [Org.Apache.REEF.Tang.Exceptions.TangApplicationException: Not able to get
Type from the name provided: org.apache.reef.runtime.common.evaluator.parameters.DriverRemoteIdentifier]
Org.Apache.REEF.Tang.Util.AssemblyLoader Error: 0 : 2017-07-31T22:08:26.7176597+00:00 0001
ERROR: ExceptionThrowing TangApplicationException
Encountered error [Org.Apache.REEF.Tang.Exceptions.TangApplicationException: Not able to get
Type from the name provided: org.apache.reef.runtime.common.evaluator.parameters.EvaluatorConfiguration]
Org.Apache.REEF.Tang.Util.AssemblyLoader Error: 0 : 2017-07-31T22:08:26.7176597+00:00 0001
ERROR: ExceptionThrowing TangApplicationException
Encountered error [Org.Apache.REEF.Tang.Exceptions.TangApplicationException: Not able to get
Type from the name provided: org.apache.reef.runtime.common.evaluator.parameters.EvaluatorIdentifier]


Further down is this exception:
WARNING: ExceptionCaught UnauthorizedAccessException Cannot obtain machine status due to error
Encountered error [System.UnauthorizedAccessException: Access to the registry key 'Global'
is denied.
   at Microsoft.Win32.RegistryKey.Win32Error(Int32 errorCode, String str)
   at Microsoft.Win32.RegistryKey.InternalGetValue(String name, Object defaultValue, Boolean
doNotExpand, Boolean checkSecurity)
   at Microsoft.Win32.RegistryKey.GetValue(String name)
   at System.Diagnostics.PerformanceMonitor.GetData(String item)
   at System.Diagnostics.PerformanceCounterLib.GetPerformanceData(String item)
   at System.Diagnostics.PerformanceCounterLib.get_CategoryTable()
   at System.Diagnostics.PerformanceCounterLib.CounterExists(String category, String counter,
Boolean& categoryExists)
   at System.Diagnostics.PerformanceCounterLib.CounterExists(String machine, String category,
String counter)
   at System.Diagnostics.PerformanceCounter.InitializeImpl()
   at System.Diagnostics.PerformanceCounter.NextSample()
   at System.Diagnostics.PerformanceCounter.NextValue()
   at Org.Apache.REEF.Common.Runtime.MachineStatus.get_CurrentNodeCpuUsage()
   at Org.Apache.REEF.Common.Runtime.MachineStatus.ToString()]
Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Stop: 0 : 2017-07-31T22:08:27.1551796+00:00
0001
EXIT: 7/31/2017 10:08:27 PM HeartBeatManager::HeartBeatManager. Duration: [00:00:00.0192656].


We are running the reef application as a superuser on the cluster with full admin privileges...

Any thoughts on why we are seeing these errors?



2).    We are also getting a severe exception returned from the Bridge by the CLR:

                ul 31, 2017 10:08:34 PM org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener
exceptionCaught
WARNING: ExceptionEvent: local: /10.2.0.8:9769 remote: /10.2.0.8:53595 :: java.io.IOException:
An existing connection was forcibly closed by the remote host
Jul 31, 2017 10:08:36 PM org.apache.reef.javabridge.generic.JobDriver$CompletedTaskHandler
onNext
INFO: Completed task: SpinTask
Jul 31, 2017 10:08:36 PM org.apache.reef.javabridge.generic.JobDriver$CompletedTaskHandler
onNext
INFO: Return results to the client:
ReturnValue
Jul 31, 2017 10:08:36 PM org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler
onNext
INFO: In-process JobStatus:
identifier: "Fluid"
state: RUNNING
message: "\254\355\000\005t\000\vReturnValue"

Jul 31, 2017 10:08:36 PM org.apache.reef.javabridge.generic.JobDriver$CompletedTaskHandler
onNext
INFO: CLR CompletedTaskHandler handler set, handling things with CLR handler.
Jul 31, 2017 10:08:36 PM org.apache.reef.javabridge.NativeBridge onError
SEVERE: Bridge received error from CLR: Exception in Call_ClrSystemCompletedTask_OnNext
Unable to write data to the transport connection: An existing connection was forcibly closed
by the remote host.
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   at Org.Apache.REEF.Wake.Remote.Impl.Channel.Write(Byte[] message)
   at Org.Apache.REEF.Wake.Remote.Impl.Link`1.Write(T value)
   at Org.Apache.REEF.Wake.Remote.Impl.TransportClient`1.Send(T message)
   at Org.Apache.REEF.Wake.Remote.Impl.DefaultRemoteManager`1.ProxyObserver.OnNext(T message)
   at Org.Apache.REEF.Network.NetworkService.NsConnection`1.Write(T message)
   at Org.Apache.REEF.Fluid.Network.MessageService.Send(Object message)
   at Org.Apache.REEF.Fluid.DriverHandler.OnNext(ICompletedTask value)
   at Org.Apache.REEF.Driver.Bridge.ClrSystemHandler`1.OnNext(T value)
   at Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper.Call_ClrSystemCompletedTask_OnNext(UInt64
handle, ICompletedTaskClr2Java clr2Java)
   at Java_org_apache_reef_javabridge_NativeInterop_clrSystemCompletedTaskHandlerOnNext(JNIEnv_*
env, _jclass* cls, Int64 handler, _jobject* jcompletedTask, _jobject* jlogger)
Inner Exception:
An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
Inner Exception: null
Jul 31, 2017 10:08:36 PM org.apache.reef.runtime.common.driver.DriverStatusManager onError
WARNING: Shutting down the Driver with an exception:
java.lang.RuntimeException: Bridge received error from CLR: Exception in Call_ClrSystemCompletedTask_OnNext
Unable to write data to the transport connection: An existing connection was forcibly closed
by the remote host.
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   at Org.Apache.REEF.Wake.Remote.Impl.Channel.Write(Byte[] message)
   at Org.Apache.REEF.Wake.Remote.Impl.Link`1.Write(T value)
   at Org.Apache.REEF.Wake.Remote.Impl.TransportClient`1.Send(T message)
   at Org.Apache.REEF.Wake.Remote.Impl.DefaultRemoteManager`1.ProxyObserver.OnNext(T message)
   at Org.Apache.REEF.Network.NetworkService.NsConnection`1.Write(T message)
   at Org.Apache.REEF.Fluid.Network.MessageService.Send(Object message)
   at Org.Apache.REEF.Fluid.DriverHandler.OnNext(ICompletedTask value)
   at Org.Apache.REEF.Driver.Bridge.ClrSystemHandler`1.OnNext(T value)
   at Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper.Call_ClrSystemCompletedTask_OnNext(UInt64
handle, ICompletedTaskClr2Java clr2Java)
   at Java_org_apache_reef_javabridge_NativeInterop_clrSystemCompletedTaskHandlerOnNext(JNIEnv_*
env, _jclass* cls, Int64 handler, _jobject* jcompletedTask, _jobject* jlogger)
Inner Exception:
An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
Inner Exception: null
                at org.apache.reef.javabridge.NativeBridge.onError(NativeBridge.java:36)
                at org.apache.reef.javabridge.NativeInterop.clrSystemCompletedTaskHandlerOnNext(Native
Method)
                at org.apache.reef.javabridge.generic.JobDriver$CompletedTaskHandler.onNext(JobDriver.java:397)
                at org.apache.reef.javabridge.generic.JobDriver$CompletedTaskHandler.onNext(JobDriver.java:378)
                at org.apache.reef.runtime.common.utils.BroadCastEventHandler.onNext(BroadCastEventHandler.java:40)
                at org.apache.reef.util.ExceptionHandlingEventHandler.onNext(ExceptionHandlingEventHandler.java:46)
                at org.apache.reef.runtime.common.utils.DispatchingEStage$1.onNext(DispatchingEStage.java:72)
                at org.apache.reef.runtime.common.utils.DispatchingEStage$1.onNext(DispatchingEStage.java:69)
                at org.apache.reef.wake.impl.ThreadPoolStage$1.run(ThreadPoolStage.java:182)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
                at java.util.concurrent.FutureTask.run(FutureTask.java:262)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:745)



Any pointers you can provide are appreciated as always...


Thanks!


Stephen Weller






Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message