Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr3UXKqyN6ka3ZX_G=9at+vgHCXMJLFiSJ2rbNaqGN51FA@mail.gmail.com>
References: <8A27453B-8891-4263-98BF-E1BA70FD4D38@deity.co.nz>
 <CAOe2jMKBemUqxEQfymQ_ZgqP20+9qhme+aya516yYK1xWCU2Jg@mail.gmail.com>
 <CAOcnVr3UXKqyN6ka3ZX_G=9at+vgHCXMJLFiSJ2rbNaqGN51FA@mail.gmail.com>
From: Matt Painter <matt@deity.co.nz>
Date: Tue, 16 Oct 2012 09:17:49 +1300
Message-ID: 
 <CAOe2jMLoWc96DXb0VXOUJNNWyh-=kavm1fgCv_5n__jJEJmPFA@mail.gmail.com>
Subject: Re: Suitability of HDFS for live file store
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=e89a8ff250ee733b5c04cc1ebc7d

--e89a8ff250ee733b5c04cc1ebc7d
Content-Type: text/plain; charset=ISO-8859-1

Thanks guys; really appreciated.

I was deliberately vague about the notion of real-time because I didn't
know what the metrics are that made Hadoop be considered a batch system -
if that makes sense!

Essentially, the speed of access to the files stored in HDFS needs to be
comparable to files being read off a native file system in order for
end-user download. Whereas the bulk of the data on disk will be TIFF files,
we will also be including JPEG derivatives which we are intending to be
displaying inline in a web-based application.

We typically have sparse access metrics - we have millions of files, but
each file may be viewed only zero or one time over a year. Therefore,
native in-memory caching isn't so much of an issue.

M

On 16 October 2012 09:08, Harsh J <harsh@cloudera.com> wrote:

> Hey Matt,
>
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
>
> The presence of NameNode audit logging will help your file access
> analysis requirement.
>
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <matt@deity.co.nz> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on
> whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of
> image
> > data. These images are typically TIFF files of around 50-100Mb each and
> need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN
> so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern
> is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most
> likely be
> > in file access analysis rather than any processing on the files
> themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing
> it
> > and inviting grief?
> >
> > M
>
>
>
> --
> Harsh J
>


-- 
Matt Painter
matt@deity.co.nz
+64 21 115 9378

--e89a8ff250ee733b5c04cc1ebc7d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks guys; really appreciated.<br><br>I was deliberately vague about the =
notion of real-time because I didn&#39;t know what the metrics are that mad=
e Hadoop be considered a batch system - if that makes sense!<br><br>Essenti=
ally, the speed of access to the files stored in HDFS needs to be comparabl=
e to files being read off a native file system in order for end-user downlo=
ad. Whereas the bulk of the data on disk will be TIFF files, we will also b=
e including JPEG derivatives which we are intending to be displaying inline=
 in a web-based application.<br>

<br>We typically have sparse access metrics - we have millions of files, bu=
t each file may be viewed only zero or one time over a year. Therefore, nat=
ive in-memory caching isn&#39;t so much of an issue.<br><br>M<br><br><div c=
lass=3D"gmail_quote">

On 16 October 2012 09:08, Harsh J <span dir=3D"ltr">&lt;<a href=3D"mailto:h=
arsh@cloudera.com" target=3D"_blank">harsh@cloudera.com</a>&gt;</span> wrot=
e:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex">

Hey Matt,<br>
<br>
What do you mean by &#39;real-time&#39; though? While HDFS has pretty good<=
br>
contiguous data read speeds (and you get N x replicas to read from),<br>
if you&#39;re looking to &quot;cache&quot; frequently accessed files into m=
emory<br>
then HDFS does not natively have support for that. Otherwise, I agree<br>
with Brock, seems like you could make it work with HDFS (sans<br>
MapReduce - no need to run it if you don&#39;t need it).<br>
<br>
The presence of NameNode audit logging will help your file access<br>
analysis requirement.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter &lt;<a href=3D"mailto:matt@de=
ity.co.nz">matt@deity.co.nz</a>&gt; wrote:<br>
&gt; Hi,<br>
&gt;<br>
&gt; I am a new Hadoop user, and would really appreciate your opinions on w=
hether<br>
&gt; Hadoop is the right tool for what I&#39;m thinking of using it for.<br=
>
&gt;<br>
&gt; I am investigating options for scaling an archive of around 100Tb of i=
mage<br>
&gt; data. These images are typically TIFF files of around 50-100Mb each an=
d need<br>
&gt; to be made available online in realtime. Access to the files will be<b=
r>
&gt; sporadic and occasional, but writing the files will be a daily activit=
y.<br>
&gt; Speed of write is not particularly important.<br>
&gt;<br>
&gt; Our previous solution was a monolithic, expensive - and very full - SA=
N so I<br>
&gt; am excited by Hadoop&#39;s distributed, extensible, redundant architec=
ture.<br>
&gt;<br>
&gt; My concern is that a lot of the discussion on and use cases for Hadoop=
 is<br>
&gt; regarding data processing with MapReduce and - from what I understand =
-<br>
&gt; using HDFS for the purpose of input for MapReduce jobs. My other conce=
rn is<br>
&gt; vague indication that it&#39;s not a &#39;real-time&#39; system. We ma=
y be using<br>
&gt; MapReduce in small components of the application, but it will most lik=
ely be<br>
&gt; in file access analysis rather than any processing on the files themse=
lves.<br>
&gt;<br>
&gt; In other words, what I really want is a distributed, resilient, scalab=
le<br>
&gt; filesystem.<br>
&gt;<br>
&gt; Is Hadoop suitable if we just use this facility, or would I be misusin=
g it<br>
&gt; and inviting grief?<br>
&gt;<br>
&gt; M<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br><br clear=3D"all"><br>-- <br>Matt Pain=
ter<br><a href=3D"mailto:matt@deity.co.nz">matt@deity.co.nz</a><br>+64 21 1=
15 9378<br>

--e89a8ff250ee733b5c04cc1ebc7d--