hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Question about Hadoop
Date Sat, 14 Jun 2008 05:11:26 GMT
Usually hadoop programs are not used interactively since what they excel at
is batch operations on very large collections of data.

It is quite reasonable to store resulting data in hadoop and access those
results using hadoop.  The cleanest way to do that is to have a presentation
layer web server that has all of the UI on it and use http to access the
results file from hadoop via the namenodes data access URL.  This works well
where the results are not particularly voluminous.

For large quantities of data such as the output of a web-crawl, it is
usually better to copy the output out of hadoop and into a clustered system
that supports high speed querying of the data.  This clustered system might
be as simple as a redundant memcache or mySql farm or as fancy as a sharded
and replicated farm of text retrieval engines running under Solr.  What
works for you will vary by what you need to do.

You should keep in mind that hadoop was designed for very long MTBF (for a
cluster), but not designed for zero downtime operation.  At the very least,
you will occasionally want to upgrade the cluster software and that
currently can't be done during normal operations.  Combining hadoop (for
heavy duty computations) with a separate persistence layer (for high
availability web service) is a good hybrid.

On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James <chanch13@gmail.com> wrote:

> Thank you all for the responses.
> So in order to run a web-based application, I just need to put the part of
> the application that needs to make use of distributed computation in HDFS,
> and have the other web site related files access it via Hadoop streaming ?
> Is that how Hadoop is used ?
> Sorry the question may sound too silly.
> Thank you.
> On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > Once it is in HDFS, you already have backups (due to the replicated file
> > system).
> >
> > Your problems with deleting the dfs data directory are likely
> configuration
> > problems combined with versioning of the data store (done to avoid
> > confusion, but usually causes confusion).  Once you get the configuration
> > and operational issues sorted out, you shouldn't lose any data.
> >
> > On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James <chanch13@gmail.com>
> > wrote:
> >
> > >
> > > If I keep all data in HDFS, is there anyway I can back it up regularly.
> > >
> > >
> >


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message