hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goldstone, Robin J." <goldsto...@llnl.gov>
Subject Re: Suitability of HDFS for live file store
Date Mon, 15 Oct 2012 21:35:08 GMT
If the goal is simply an alternative to SAN for cost-effective storage of large files you might
want to take a look at Gluster.  It is an open source scale-out distributed filesystem that
can utilize local storage. Also, it has distributed metadata and a POSIX interface and can
be accessed through a number of clients, including fuse, NFS and CIFS.  Supposedly you can
even run Hadoop on top of Gluster.

I hope I don't start any sort of flame war by mentioning Gluster on a Hadoop mailing list.
 Note I have no vested interest in this particular solution, although I am in the process
of evaluating it myself.

From: Jay Vyas <jayunit100@gmail.com<mailto:jayunit100@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Monday, October 15, 2012 1:21 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Re: Suitability of HDFS for live file store

Seems like a heavyweight solution unless you are actually processing the images?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im surprised that you
are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <harsh@cloudera.com<mailto:harsh@cloudera.com>>
wrote:
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <matt@deity.co.nz<mailto:matt@deity.co.nz>>
wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of image
> data. These images are typically TIFF files of around 50-100Mb each and need
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN so I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely be
> in file access analysis rather than any processing on the files themselves.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing it
> and inviting grief?
>
> M



--
Harsh J



--
Jay Vyas
MMSB/UCHC

Mime
View raw message