flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: Flink job on secure Yarn fails after many hours
Date Fri, 14 Apr 2017 13:14:34 GMT
Hi,

No, this issue is now gone for us.
The fixed in 1.2.0 ensured that we are now able to run jobs on our cluster
beyond the 7 days limit.

Niels

On Wed, Apr 12, 2017 at 5:35 PM, Robert Metzger <rmetzger@apache.org> wrote:

> Niels, are you still facing this issue?
>
> As far as I understood it, the security changes in Flink 1.2.0 use a new
> Kerberos mechanism that allows infinite token renewal.
>
> On Thu, Mar 17, 2016 at 7:30 AM, Maximilian Michels <mxm@apache.org>
> wrote:
>
>> Hi Niels,
>>
>> Thanks for the feedback. As far as I know, Hadoop deliberately
>> defaults to the one week maximum life time of delegation tokens. Have
>> you tried increasing the maximum token life time or was that not an
>> option?
>>
>> I wonder why do you use a while loop? Would it be possible to use the
>> Yarn failover mechanism which starts a new ApplicationMaster and
>> resubmits the job?
>>
>> Thanks,
>> Max
>>
>>
>> On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <Niels@basjes.nl> wrote:
>> > Hi,
>> >
>> > In my environment doing the "proxy" thing didn't work.
>> > With an token expire of 168 hours (1 week) the job consistently
>> terminates
>> > at exactly (within a margin of 10 seconds) 173.5 hours.
>> > So far we have not been able to solve this problem.
>> >
>> > Our teams now simply assume the thing fails once in a while and have an
>> > automatic restart feature (i.e. shell script with a while true loop).
>> > The best guess at a root cause is this
>> > https://issues.apache.org/jira/browse/HDFS-9276
>> >
>> > If you have a real solution or a reference to a related bug report to
>> this
>> > problem then please share!
>> >
>> > Niels Basjes
>> >
>> >
>> >
>> > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
>> > <thomas.lamirault@ericsson.com> wrote:
>> >>
>> >> Hi Max,
>> >>
>> >> I will try these workaround.
>> >> Thanks
>> >>
>> >> Thomas
>> >>
>> >> ________________________________________
>> >> De : Maximilian Michels [mxm@apache.org]
>> >> Envoyé : mardi 15 mars 2016 16:51
>> >> À : user@flink.apache.org
>> >> Cc : Niels Basjes
>> >> Objet : Re: Flink job on secure Yarn fails after many hours
>> >>
>> >> Hi Thomas,
>> >>
>> >> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
>> >> to properly run Kerberos applications on Hadoop clusters. Versions
>> >> before that have critical bugs related to the internal security token
>> >> handling that may expire the token although it is still valid.
>> >>
>> >> That said, there is another limitation of Hadoop that the maximum
>> >> internal token life time is one week. To work around this limit, you
>> >> have two options:
>> >>
>> >> a) increasing the maximum token life time
>> >>
>> >> In yarn-site.xml:
>> >>
>> >> <property>
>> >>   <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>> >>   <value>9223372036854775807</value>
>> >> </property>
>> >>
>> >> In hdfs-site.xml
>> >>
>> >> <property>
>> >>   <name>dfs.namenode.delegation.token.max-lifetime</name>
>> >>   <value>9223372036854775807</value>
>> >> </property>
>> >>
>> >>
>> >> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>> >>
>> >> From
>> >> http://www.cloudera.com/documentation/enterprise/5-3-x/
>> topics/cm_sg_yarn_long_jobs.html
>> >>
>> >> "You can work around this by configuring the ResourceManager as a
>> >> proxy user for the corresponding HDFS NameNode so that the
>> >> ResourceManager can request new tokens when the existing ones are past
>> >> their maximum lifetime."
>> >>
>> >> @Nils: Could you comment on what worked best for you?
>> >>
>> >> Best,
>> >> Max
>> >>
>> >>
>> >> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
>> >> <thomas.lamirault@ericsson.com> wrote:
>> >> >
>> >> > Hello everyone,
>> >> >
>> >> >
>> >> >
>> >> > We are facing the same probleme now in our Flink applications, launch
>> >> > using YARN.
>> >> >
>> >> > Just want to know if there is any update about this exception ?
>> >> >
>> >> >
>> >> >
>> >> > Thanks
>> >> >
>> >> >
>> >> >
>> >> > Thomas
>> >> >
>> >> >
>> >> >
>> >> > ________________________________
>> >> >
>> >> > De : niels@basj.es [niels@basj.es] de la part de Niels Basjes
>> >> > [Niels@basjes.nl]
>> >> > Envoyé : vendredi 4 décembre 2015 10:40
>> >> > À : user@flink.apache.org
>> >> > Objet : Re: Flink job on secure Yarn fails after many hours
>> >> >
>> >> > Hi Maximilian,
>> >> >
>> >> > I just downloaded the version from your google drive and used that
to
>> >> > run my test topology that accesses HBase.
>> >> > I deliberately started it twice to double the chance to run into this
>> >> > situation.
>> >> >
>> >> > I'll keep you posted.
>> >> >
>> >> > Niels
>> >> >
>> >> >
>> >> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mxm@apache.org>
>> >> > wrote:
>> >> >>
>> >> >> Hi Niels,
>> >> >>
>> >> >> Just got back from our CI. The build above would fail with a
>> >> >> Checkstyle error. I corrected that. Also I have built the binaries
>> for
>> >> >> your Hadoop version 2.6.0.
>> >> >>
>> >> >> Binaries:
>> >> >>
>> >> >>
>> >> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat
>> -fail-0.10.1.zip
>> >> >>
>> >> >> Thanks,
>> >> >> Max
>> >> >>
>> >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >> >> >>>> >> >> > 21:30:28,185 ERROR
>> >> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912
>> terminated,
>> >> >> >>>> >> >> > stopping
>> >> >> >>>> >> >> > process...
>> >> >> >>>> >> >> > 21:30:28,286 INFO
>> >> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >> >>>> >> >> > - Removing web root dir
>> >> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> > --
>> >> >> >>>> >> >> > Best regards / Met vriendelijke
groeten,
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> > Niels Basjes
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> > --
>> >> >> >>>> >> > Best regards / Met vriendelijke
groeten,
>> >> >> >>>> >> >
>> >> >> >>>> >> > Niels Basjes
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> > --
>> >> >> >>>> > Best regards / Met vriendelijke groeten,
>> >> >> >>>> >
>> >> >> >>>> > Niels Basjes
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> --
>> >> >> >>> Best regards / Met vriendelijke groeten,
>> >> >> >>>
>> >> >> >>> Niels Basjes
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best regards / Met vriendelijke groeten,
>> >> >
>> >> > Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Mime
View raw message