hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: NameNode Crashing with "flush failed for required journal" exception
Date Thu, 28 Apr 2016 18:08:25 GMT
A problem I've seen a few times is that slow lookups of the hdfs user's
groups at the JournalNode introduce delays in handling the edit logging
RPC, which then times out at the NameNode side, ultimately causing an
abort and an HA failover.  If your environment is experiencing this, then
you'll see messages in the JournalNode logs about "Potential performance
problem: getGroups".  If this is happening, then there are several
potential fixes.

1. Ultimately, root cause is a performance problem in the infrastructure's
ability to lookup the groups for a user.  This warrants investigation into
whatever that infrastructure is.  (i.e. PAM/LDAP integration with
something like ActiveDirectory is common in a lot of IT shops.)  It's
extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
Service Cache Daemon) to improve performance of group lookups and reduce
load on the infrastructure in this kind of deployment.

2. A potential workaround is to use Hadoop's static group mapping feature
to define the hdfs user's list of groups in configuration.  This way, the
group lookup of the hdfs user performed by the JournalNode never hits the
group lookup infrastructure at all.  The downside is that managing group
memberships in Hadoop configuration files is much more cumbersome than
managing it externally.  For more information, see the documentation of
the configuration property hadoop.user.group.static.mapping.overrides in
core-default.xml. [1]

3. Another potential workaround is to increase the timeouts allowed for
the JournalNode RPC calls.  I haven't had as much success with this
myself, but it's possible.  For more information on how to configure this,
see the documentation of the various dfs.qjournal.*.timeout settings in
hdfs-default.xml. [2]

--Chris Nauroth

[1] https://s.apache.org/kX8D
[2] https://s.apache.org/LzJd

On 4/28/16, 7:32 AM, "Gagan Brahmi" <gaganbrahmi@gmail.com> wrote:

>Hi Shaik,
>The error basically indicates that namenode crashed waiting for the
>write and sync to happen on the quorum of JournalNodes. In your case
>atleast 2 journal nodes should complete the write and sync without the
>timeout period of 20 seconds which does not seems to be the case.
>I will advice you to verify the journal node logs and you should find
>something interesting on them. Maybe some reasons for failures to
>complete the write and sync operation on journal nodes.
>Gagan Brahmi
>On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <munna.hadoop@gmail.com> wrote:
>> Hi All,
>> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
>> Kerberos security.
>> NameNode having  HA and it is crashing at least once in a day with
>> failed for required journal " exception. don't have any network issues
>> between the nodes.
>> I have tried to find the causing the issue,  but, i couldn't able to
>> proper resolution. Please help me to fix this issue.
>> Thank you,
>> Shaik
>> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
>> response for sendEdits. Succeeded so far: []
>> 2016-04-28 05:05:23,483 INFO  BlockStateChange
>> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
>> neededReplications = 0, pendingReplications = 0.
>> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
>> response for sendEdits. Succeeded so far: []
>> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
>> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
>> required journal (JournalAndStream(mgr=QJM to [,
>>,], stream=QuorumOutputStream
>> starting at txid 26198626))
>> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
>> respond.
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
>> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
>> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
>> at txid 26198626
>> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
>>(ExitUtil.java:terminate(124)) -
>> Exiting with status 1
>> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
>>(LogAdapter.java:info(47)) -
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org

To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

View raw message