flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Flink job on secure Yarn fails after many hours
Date Wed, 02 Dec 2015 15:03:56 GMT
No, I was just asking.
No upgrade is possible for the next month or two.

This week is our busiest day of the year ...
Our shop is doing about 10 orders per second these days ...

So they won't upgrade until next January/February

Niels

On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mxm@apache.org> wrote:

> Hi Niels,
>
> You mentioned you have the option to update Hadoop and redeploy the
> job. Would be great if you could do that and let us know how it turns
> out.
>
> Cheers,
> Max
>
> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Niels@basjes.nl> wrote:
> > Hi,
> >
> > I posted the entire log from the first log line at the moment of failure
> to
> > the very end of the logfile.
> > This is all I have.
> >
> > As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn
> is
> > that it catches the "Invalid Token" and then (if keytab) gets a new
> Kerberos
> > ticket (or tgt?).
> > When the new ticket has been obtained it retries the call that previously
> > failed.
> > To me it seemed that this call can fail over the invalid Token yet it
> cannot
> > be retried.
> >
> > At this moment I'm thinking a bug in Hadoop.
> >
> > Niels
> >
> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mxm@apache.org>
> wrote:
> >>
> >> Hi Niels,
> >>
> >> Sorry for hear you experienced this exception. From a first glance, it
> >> looks like a bug in Hadoop to me.
> >>
> >> > "Not retrying because the invoked method is not idempotent, and unable
> >> > to determine whether it was invoked"
> >>
> >> That is nothing to worry about. This is Hadoop's internal retry
> >> mechanism that re-attempts to do actions which previously failed if
> >> that's possible. Since the action is not idempotent (it cannot be
> >> executed again without risking to change the state of the execution)
> >> and it also doesn't track its execution states, it won't be retried
> >> again.
> >>
> >> The main issue is this exception:
> >>
> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> >> > AMRMToken from >appattempt_1443166961758_163901_000001
> >>
> >> From the stack trace it is clear that this exception occurs upon
> >> requesting container status information from the Resource Manager:
> >>
> >> >at
> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >>
> >> Are there any more exceptions in the log? Do you have the complete
> >> logs available and could you share them?
> >>
> >>
> >> Best regards,
> >> Max
> >>
> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Niels@basjes.nl> wrote:
> >> > Hi,
> >> >
> >> >
> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
> with
> >> > Apache Flink on top of that.
> >> >
> >> > A few days ago I started a very simple Flink application (just stream
> >> > the
> >> > time as a String into HBase 10 times per second).
> >> >
> >> > I (deliberately) asked our IT-ops guys to make my account have a max
> >> > ticket
> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
> ridiculously
> >> > low
> >> > timeout values because I needed to validate this
> >> > https://issues.apache.org/jira/browse/FLINK-2977).
> >> >
> >> > This job is started with a keytab file and after running for 31 hours
> it
> >> > suddenly failed with the exception you see below.
> >> >
> >> > I had the same job running for almost 400 hours until that failed too
> (I
> >> > was
> >> > too late to check the logfiles but I suspect the same problem).
> >> >
> >> >
> >> > So in that time span my tickets have expired and new tickets have been
> >> > obtained several hundred times.
> >> >
> >> >
> >> > The main error I see is that in the process of a ticket expiring and
> >> > being
> >> > renewed I see this message:
> >> >
> >> >      Not retrying because the invoked method is not idempotent, and
> >> > unable
> >> > to determine whether it was invoked
> >> >
> >> >
> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
> >> >
> >> > Flink is version 0.10.1
> >> >
> >> >
> >> > How do I fix this?
> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
> wrong?
> >> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
> >> >
> >> >
> >> > Niels Basjes
> >> >
> >> >
> >> >
> >> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >> >
> >> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
> >> > - Exception encountered while connecting to the server :
> >> >
> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >> >
> >> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> > 21:30:27,891 WARN  org.apache.hadoop.io.retry.RetryInvocationHandler
> >> > - Exception while invoking class
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
> >> > Not retrying because the invoked method is not idempotent, and unable
> to
> >> > determine whether it was invoked
> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> >> > AMRMToken from appattempt_1443166961758_163901_000001
> >> >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > Method)
> >> >       at
> >> >
> >> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >> >       at
> >> >
> >> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >> >       at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >> >       at
> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >> >       at
> >> >
> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >> >       at
> >> >
> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >> >       at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >> >       at
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >> >       at
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >> >       at
> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >> >       at
> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >> >       at
> >> >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >> > Caused by:
> >> >
> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >       ... 29 more
> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> >> > AMRMToken from appattempt_1443166961758_163901_000001
> >> >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > Method)
> >> >       at
> >> >
> >> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >> >       at
> >> >
> >> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >> >       at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >> >       at
> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >> >       at
> >> >
> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >> >       at
> >> >
> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >> >       at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >> >       at
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >> >       at
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >> >       at
> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >> >       at
> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >> >       at
> >> >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >> > Caused by:
> >> >
> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >       ... 29 more
> >> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
> >> > - Stopping JobManager
> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
> >> > 21:30:28,088 INFO
> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
> >> > 21:30:28,113 INFO
> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
> >> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
> >> > - Stopped BLOB server at 0.0.0.0:41281
> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
> >> > process...
> >> > 21:30:28,286 INFO
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> >> > - Removing web root dir
> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
> >> >
> >> >
> >> > --
> >> > Best regards / Met vriendelijke groeten,
> >> >
> >> > Niels Basjes
> >
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Mime
View raw message