hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba Borthakur" <dhr...@yahoo-inc.com>
Subject RE: long write operations and data recovery
Date Fri, 29 Feb 2008 05:53:59 GMT
I agree with Joydeep. For batch processing, it is sufficient to make the
application not assume that HDFS is always up and active. However, for
real-time applications that are not batch-centric, it might not be
sufficient. There are a few things that HDFS could do to better handle
Namenode outages:

1. Make Clients handle transient Namenode downtime. This requires that
Namenode restarts are fast, clients can handle long Namenode outages,
etc.etc.
2. Design HDFS Namenode to be a set of two, an active one and a passive
one. The active Namenode could continuously forward transactions to the
passive one. In case of failure of the active Namenode, the passive
could take over. This type of High-Availability would probably be very
necessary for non-batch-type-applications.

Thanks,
dhruba

-----Orivery necessaginal Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com] 
Sent: Thursday, February 28, 2008 6:06 PM
To: core-user@hadoop.apache.org
Subject: RE: long write operations and data recovery

We have had a lot of peace of mind by building a data pipeline that does
not assume that hdfs is always up and running. If the application is
primarily non real-time log processing - I would suggest
batch/incremental copies of data to hdfs that can catch up automatically
in case of failures/downtimes.

we have a rsync like map-reduce job that monitors a log directories and
keeps pulling new data in (and suspect lot of other users do similar
stuff as well). Might be a useful notion to generalize and put in
contrib.


-----Original Message-----
From: Steve Sapovits [mailto:ssapovits@invitemedia.com] 
Sent: Thursday, February 28, 2008 4:54 PM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery


> How does replication affect this?  If there's at least one replicated
>  client still running, I assume that takes care of it?

Never mind -- I get this now after reading the docs again.

My remaining point of failure question concerns name nodes.  The docs
say manual 
intervention is still required if a name node goes down.  How is this
typically managed
in production environments?   It would seem even a short name node
outage in a 
data intestive environment would lead to data loss (no name node to give
the data
to).

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com


Mime
View raw message