hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shaik M <munna.had...@gmail.com>
Subject Re: NameNode Crashing with "flush failed for required journal" exception
Date Mon, 02 May 2016 03:07:33 GMT
Hi Chris,

After installing "NSCD" service on Hadoop Cluster, NameNode is running
stable without any downtime from last three days. :)

Thanks you for your help.

Regards,
Shaik



On 29 April 2016 at 11:43, Shaik M <munna.hadoop@gmail.com> wrote:

> Thank you for your suggestions.
>
> I found in logs
> "WARN  security.Groups (Groups.java:fetchGroupList(244)) - Potential
> performance problem: getGroups(user=hdfs) took 15915 milliseconds.
>
> First I'll deploy "nscd" service on all three journal nodes and will
> update you accordingly.
>
> Thanks,
> Shaik
>
> On 29 April 2016 at 02:08, Chris Nauroth <cnauroth@hortonworks.com> wrote:
>
>> A problem I've seen a few times is that slow lookups of the hdfs user's
>> groups at the JournalNode introduce delays in handling the edit logging
>> RPC, which then times out at the NameNode side, ultimately causing an
>> abort and an HA failover.  If your environment is experiencing this, then
>> you'll see messages in the JournalNode logs about "Potential performance
>> problem: getGroups".  If this is happening, then there are several
>> potential fixes.
>>
>> 1. Ultimately, root cause is a performance problem in the infrastructure's
>> ability to lookup the groups for a user.  This warrants investigation into
>> whatever that infrastructure is.  (i.e. PAM/LDAP integration with
>> something like ActiveDirectory is common in a lot of IT shops.)  It's
>> extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
>> Service Cache Daemon) to improve performance of group lookups and reduce
>> load on the infrastructure in this kind of deployment.
>>
>> 2. A potential workaround is to use Hadoop's static group mapping feature
>> to define the hdfs user's list of groups in configuration.  This way, the
>> group lookup of the hdfs user performed by the JournalNode never hits the
>> group lookup infrastructure at all.  The downside is that managing group
>> memberships in Hadoop configuration files is much more cumbersome than
>> managing it externally.  For more information, see the documentation of
>> the configuration property hadoop.user.group.static.mapping.overrides in
>> core-default.xml. [1]
>>
>> 3. Another potential workaround is to increase the timeouts allowed for
>> the JournalNode RPC calls.  I haven't had as much success with this
>> myself, but it's possible.  For more information on how to configure this,
>> see the documentation of the various dfs.qjournal.*.timeout settings in
>> hdfs-default.xml. [2]
>>
>> --Chris Nauroth
>>
>> [1] https://s.apache.org/kX8D
>> [2] https://s.apache.org/LzJd
>>
>>
>>
>> On 4/28/16, 7:32 AM, "Gagan Brahmi" <gaganbrahmi@gmail.com> wrote:
>>
>> >Hi Shaik,
>> >
>> >The error basically indicates that namenode crashed waiting for the
>> >write and sync to happen on the quorum of JournalNodes. In your case
>> >atleast 2 journal nodes should complete the write and sync without the
>> >timeout period of 20 seconds which does not seems to be the case.
>> >
>> >I will advice you to verify the journal node logs and you should find
>> >something interesting on them. Maybe some reasons for failures to
>> >complete the write and sync operation on journal nodes.
>> >
>> >
>> >Regards,
>> >Gagan Brahmi
>> >
>> >On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <munna.hadoop@gmail.com> wrote:
>> >> Hi All,
>> >>
>> >> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
>> >> Kerberos security.
>> >>
>> >> NameNode having  HA and it is crashing at least once in a day with
>> >>"flush
>> >> failed for required journal " exception. don't have any network issues
>> >> between the nodes.
>> >>
>> >> I have tried to find the causing the issue,  but, i couldn't able to
>> >>found
>> >> proper resolution. Please help me to fix this issue.
>> >>
>> >> Thank you,
>> >> Shaik
>> >>
>> >> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
>> >> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
>> >>a
>> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> >> 2016-04-28 05:05:23,483 INFO  BlockStateChange
>> >> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
>> >> neededReplications = 0, pendingReplications = 0.
>> >> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
>> >> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
>> >>a
>> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> >> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
>> >> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
>> >>for
>> >> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
>> >> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
>> >> starting at txid 26198626))
>> >> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
>> >> respond.
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
>> >>AsyncLoggerSet.java:137)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
>> >>orumOutputStream.java:107)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>> >>utputStream.java:113)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>> >>utputStream.java:107)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
>> >>8.apply(JournalSet.java:533)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
>> >>ors(JournalSet.java:393)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
>> >>ava:57)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
>> >>flush(JournalSet.java:529)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
>> >>47)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
>> >>stem.java:3492)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
>> >>eRpcServer.java:787)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
>> >>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
>> >>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
>> >>otobufRpcEngine.java:616)
>> >>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>> >>         at java.security.AccessController.doPrivileged(Native Method)
>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>> >>.java:1657)
>> >>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
>> >> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
>> >> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
>> >>starting
>> >> at txid 26198626
>> >> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
>> >>(ExitUtil.java:terminate(124)) -
>> >> Exiting with status 1
>> >> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
>> >>(LogAdapter.java:info(47)) -
>> >> SHUTDOWN_MSG:
>> >>
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>> >For additional commands, e-mail: user-help@hadoop.apache.org
>> >
>> >
>>
>>
>

Mime
View raw message