hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Painter <m...@deity.co.nz>
Subject Re: Suitability of HDFS for live file store
Date Mon, 15 Oct 2012 20:17:49 GMT
Thanks guys; really appreciated.

I was deliberately vague about the notion of real-time because I didn't
know what the metrics are that made Hadoop be considered a batch system -
if that makes sense!

Essentially, the speed of access to the files stored in HDFS needs to be
comparable to files being read off a native file system in order for
end-user download. Whereas the bulk of the data on disk will be TIFF files,
we will also be including JPEG derivatives which we are intending to be
displaying inline in a web-based application.

We typically have sparse access metrics - we have millions of files, but
each file may be viewed only zero or one time over a year. Therefore,
native in-memory caching isn't so much of an issue.


On 16 October 2012 09:08, Harsh J <harsh@cloudera.com> wrote:

> Hey Matt,
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
> The presence of NameNode audit logging will help your file access
> analysis requirement.
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <matt@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
> --
> Harsh J

Matt Painter
+64 21 115 9378

View raw message