hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandrashekhar Kotekar <shekhar.kote...@gmail.com>
Subject Re: What happens to data nodes when name node has failed for long time?
Date Mon, 15 Dec 2014 05:31:41 GMT
Hi Mark,

Thanks for giving detailed information about name node failure and High
availability feature.

Wish you all the best in your job search.

Thanks again...

Chandrash3khar Kotekar
Mobile - +91 8600011455

On Mon, Dec 15, 2014 at 6:29 AM, mark charts <mcharts@yahoo.com> wrote:
> "Prior to the Hadoop 2.x series, the NameNode was a single point of
> failure in an
> HDFS cluster — in other words, if the machine on which the single NameNode
> was configured became unavailable, the entire cluster would be unavailable
> until the NameNode could be restarted. This was bad news, especially in the
> case of unplanned outages, which could result in significant downtime if
> the
> cluster administrator weren’t available to restart the NameNode.
> The solution to this problem is addressed by the HDFS High Availability
> fea-
> ture. The idea is to run two NameNodes in the same cluster — one active
> NameNode and one hot standby NameNode. If the active NameNode crashes
> or needs to be stopped for planned maintenance, it can be quickly failed
> over
> to the hot standby NameNode, which now becomes the active NameNode.
> The key is to keep the standby node synchronized with the active node; this
> action is now accomplished by having both nodes access a shared NFS direc-
> tory. All namespace changes on the active node are logged in the shared
> directory. The standby node picks up those changes from the directory and
> applies them to its own namespace. In this way, the standby NameNode acts
> as a current backup of the active NameNode. The standby node also has cur-
> rent block location information, because DataNode heartbeats are routinely
> sent to both active and standby NameNodes.
> To ensure that only one NameNode is the “active” node at any given time,
> configure a fencing process for the shared storage directory; then, during
> a
> failover, if it appears that the failed NameNode still carries the active
> state,
> the configured fencing process prevents that node from accessing the shared
> directory and permits the newly active node (the former standby node) to
> complete the failover.
> The machines that will serve as the active and standby NameNodes in your
> High Availability cluster should have equivalent hardware. The shared NFS
> storage directory, which must be accessible to both active and standby
> NameNodes, is usually located on a separate machine and can be mounted on
> each NameNode machine. To prevent this directory from becoming a single
> point of failure, configure multiple network paths to the storage
> directory, and
> ensure that there’s redundancy in the storage itself. Use a dedicated
> network-
> attached storage (NAS) appliance to contain the shared storage directory."
>   *sic*
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk.
> Ps. I am looking for work as Hadoop Admin/Developer (I am an Electrical
> Engr w/ MSEE). I've implemented one 6 node cluster successfully at work a
> few months ago for productivity purposes at work (that's my claim to fame).
> I was laid off shortly afterwards. No correlation I suspect. But I am in FL
> and willing to go anywhere to find contract/permanent work. If anyone knows
> of a position for a tenacious Hadoop engineer, I am interested.
> Thank you.
> Mark Charts
>   On Sunday, December 14, 2014 5:30 PM, daemeon reiydelle <
> daemeonr@gmail.com> wrote:
> I found the terminology of primary and secondary to be a bit confusing in
> describing operation after a failure scenario. Perhaps it is helpful to
> think that the Hadoop instance is guided to select a node as primary for
> normal operation. If that node fails, then the backup becomes the new
> primary. In analyzing traffic it appears that the restored node does not
> become primary again until the whole instance restarts. I myself would
> welcome clarification on this observed behavior.
> *.......*
> *“Life should not be a journey to the grave with the intention of arriving
> safely in apretty and well preserved body, but rather to skid in broadside
> in a cloud of smoke,thoroughly used up, totally worn out, and loudly
> proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
> (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
> On Fri, Dec 12, 2014 at 7:56 AM, Rich Haase <rhaase@pandora.com> wrote:
>   The remaining cluster services will continue to run.  That way when the
> namenode (or other failed processes) is restored the cluster will resume
> healthy operation.  This is part of hadoop’s ability to handle network
> partition events.
>  *Rich Haase* | Sr. Software Engineer | Pandora
> m 303.887.1146 | rhaase@pandora.com
>   From: Chandrashekhar Kotekar <shekhar.kotekar@gmail.com>
> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Date: Friday, December 12, 2014 at 3:57 AM
> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: What happens to data nodes when name node has failed for long
> time?
>   Hi,
>  What happens if name node has crashed for more than one hour but
> secondary name node, all the data nodes, job tracker, task trackers are
> running fine? Do those daemon services also automatically shutdown after
> some time? Or those services keep running hoping for namenode to come back?
> Regards,
> Chandrash3khar Kotekar
> Mobile - +91 8600011455

View raw message