hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Long running Yarn Applications on a secured HA cluster?
Date Thu, 28 Jan 2016 15:13:48 GMT
Hi,

I'm working on a project that uses Apache Flink (stream processing) on top
of a secured HA Yarn cluster.
The test application I've been testing with just uses HBase (it writes the
current time in a column every minute).

The problem I have is that after 173.5 hours (exactly) my application dies.
The best assessment we have right now is that the Hadoop Delegation Tokens
are expiring.
I know for sure the Kerberos tickets are correctly renewed/recreated in the
cluster using my keytab file because I had our IT-ops guys drop the max
ticket life to 5 minutes and the max renew to 10 minutes.

Following what we found on these two web sites we set those settings (I
posted the current settings that seem relevant below).

http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

https://forge.puppetlabs.com/cesnet/hadoop/2.1.0#long-running-applications

Yet this has not changed the situation, the job still dies after 173.5
hours with this exception.

15:47:55,283 INFO  org.apache.flink.yarn.YarnJobManager
          - Status of job 2e4a3516d8e4876b705eaff4a52fc272 (Long
running Flink application) changed to FAILING.
java.lang.Exception: Serialized representation of
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
Failed 1 action: FailedServerException: 1 time,
	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:224)
	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:204)
	at org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1597)
	at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:1069)
	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1344)
	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1001)
	at nl.basjes.flink.experiments.SetHBaseRowSink.invoke(SetHBaseRowSink.java:58)


Me and my colleagues did some searching and these two seem to describe a
similar problem to what we see (just instead of HBase these reports are
about HDFS):

Failed to Update HDFS Delegation Token for long running application in HA
mode
       https://issues.apache.org/jira/browse/HDFS-9276
HDFS Delegation Token will be expired when calling
"UserGroupInformation.getCurrentUser.addCredentials" in HA mode
       https://issues.apache.org/jira/browse/SPARK-11182


My question to you guys is simply put: How do I fix this problem?
How do I figure out what the problem really is?

Thanks for any suggestions you have for us.


<property>
<name>yarn.resourcemanager.proxy-user-privileges.enabled</name>
<value>true</value>
<source>yarn-site.xml</source>
</property>

<property>
<name>dfs.namenode.delegation.token.renew-interval</name>
<value>86400000</value>
<source>hdfs-default.xml</source>
</property>

<property>
<name>dfs.namenode.delegation.key.update-interval</name>
<value>86400000</value>
<source>hdfs-default.xml</source>
</property>

<property>
<name>dfs.namenode.delegation.token.max-lifetime</name>
<value>604800000</value>
<source>hdfs-default.xml</source>
</property>

<property>
<name>
yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled</name>
<value>true</value>
<source>yarn-default.xml</source>
</property>

<property>
<name>
yarn.resourcemanager.delayed.delegation-token.removal-interval-ms</name>
<value>30000</value>
<source>yarn-default.xml</source>
</property>

<property>
<name>hadoop.proxyuser.hbase.hosts</name>
<value>*</value>
<source>core-site.xml</source>
</property>

<property>
<name>hadoop.proxyuser.hbase.groups</name>
<value>*</value>
<source>core-site.xml</source>
</property>


<property>
<name>hadoop.proxyuser.yarn.hosts</name>
<value>*</value>
<source>core-site.xml</source>
</property>

<property>
<name>hadoop.proxyuser.yarn.groups</name>
<value>*</value>
<source>core-site.xml</source>
</property>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Mime
View raw message