hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giridhar Addepalli <giridhar1...@gmail.com>
Subject Re: Regarding Quorum Journal protocol used in HDFS
Date Thu, 19 Jun 2014 04:50:21 GMT
Just wanted to be more clear.

Now when namenode on n1 tried to finalize inprogress log segment ( upon
instruction from standby namenode on n2 after edit log roll over time has
passed ), namenode process on n1 got terminated(*because it could not get
quorum of responses*).

Thanks,
Giridhar.


On Wed, Jun 18, 2014 at 10:08 PM, Giridhar Addepalli <giridhar1202@gmail.com
> wrote:

> Hi,
>
> We are trying to understand Quorum Journal Protocol (HDFS-3077)
>
> Came across a scenario in which active name node is terminated and standby
> namenode took  over as new active namenode. But we could not understand why
> active namenode got terminated in the first place.
>
> Scenario :
>
> We have 3 nodes ( n1, n2, n3 )
>
> n1 acts as Active NameNode, JournalNode
> n2 acts as StandBy NameNode, JournalNode
> n3 acts as JournalNode
>
> JournalNode process on n3 is down when
> segment edits_inprogress_0000000000000000005 is created.
>
> JournalNode process is up on n1 & n2
> n1 and n2 has edits_inprogress_0000000000000000005 & n3 doesn't have it
>
> Now before edit log roll over happened , we started JournalNode process on
> n3 & stopped JournalNode process on n2.
>
> Now when namenode on n1 tried to finalize inprogress log segment ( upon
> instruction from standby namenode on n2 after edit log roll over time has
> passed ), namenode process on n1 got terminated.
> Standy Namenode on n2 took over as Active now.
> After this following are the logs on n1, n2 , n3 in directory ::
> /var/lib/hadoop-hdfs/cache/hdfs/dfs/journal/sample-cluster/current
>
> n1:
>
> -rw-r--r-- 1 hdfs hdfs 1.0M Jun 18 21:07
> edits_0000000000000000005-0000000000000000006
>
> -rw-r--r-- 1 hdfs hdfs 1.0M Jun 18 21:07
> edits_inprogress_0000000000000000007
>
>
> n2:
>
> -rw-r--r-- 1 hdfs hdfs 1.0M Jun 18 21:02
> edits_inprogress_0000000000000000005
>
>
> n3:
>
> -rw-r--r-- 1 hdfs hdfs 1.0M Jun 18 21:07
> edits_0000000000000000005-0000000000000000006
>
> -rw-r--r-- 1 hdfs hdfs 1.0M Jun 18 21:07
> edits_inprogress_0000000000000000007
>
> Please help us understand why NameNode process on n1 got terminated even
> though 2 journal nodes ( n1 & n2 ) were running when n1 tried to finalize
> the log segment.
>
> Even though in the above scenario we configured our 3 node cluster with
> automatic failover, we are only planning for manual failover in our
> production cluster.
>
> Given this, above scenario looks problematic because it requires manual
> intervention in our case.
>
> Is it recommended to have manual failover when using QJM ?
>
> Thanks,
>
> Giridhar.
>

Mime
View raw message