Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of goldstone1@llnl.gov designates
 128.15.143.173 as permitted sender)
From: "Goldstone, Robin J." <goldstone1@llnl.gov>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Re: Suitability of HDFS for live file store
Thread-Topic: Suitability of HDFS for live file store
Thread-Index: AQHNqw4GHneMCtVkOkqPTiB0QvX3M5e7QOYAgAADwAD//58hgA==
Date: Mon, 15 Oct 2012 21:35:08 +0000
Message-ID: <CCA1CE4E.16E43%goldstone1@llnl.gov>
In-Reply-To: 
 <CAAu13zFmHTw1jf3jYF7SLxws1tSdhZ3KMdgV7t1zceqcobTDHA@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.3.120616
Content-Type: multipart/alternative;
	boundary="_000_CCA1CE4E16E43goldstone1llnlgov_"
MIME-Version: 1.0

--_000_CCA1CE4E16E43goldstone1llnlgov_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

If the goal is simply an alternative to SAN for cost-effective storage of l=
arge files you might want to take a look at Gluster.  It is an open source =
scale-out distributed filesystem that can utilize local storage. Also, it h=
as distributed metadata and a POSIX interface and can be accessed through a=
 number of clients, including fuse, NFS and CIFS.  Supposedly you can even =
run Hadoop on top of Gluster.

I hope I don't start any sort of flame war by mentioning Gluster on a Hadoo=
p mailing list.  Note I have no vested interest in this particular solution=
, although I am in the process of evaluating it myself.

From: Jay Vyas <jayunit100@gmail.com<mailto:jayunit100@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@had=
oop.apache.org<mailto:user@hadoop.apache.org>>
Date: Monday, October 15, 2012 1:21 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.ap=
ache.org<mailto:user@hadoop.apache.org>>
Subject: Re: Suitability of HDFS for live file store

Seems like a heavyweight solution unless you are actually processing the im=
ages?

Wow, no mapreduce, no streaming writes, and relatively small files.  Im sur=
prised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all=
 the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <harsh@cloudera.com<mailto:harsh@c=
loudera.com>> wrote:
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <matt@deity.co.nz<mailto:matt=
@deity.co.nz>> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on whet=
her
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of imag=
e
> data. These images are typically TIFF files of around 50-100Mb each and n=
eed
> to be made available online in realtime. Access to the files will be
> sporadic and occasional, but writing the files will be a daily activity.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SAN s=
o I
> am excited by Hadoop's distributed, extensible, redundant architecture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop is
> regarding data processing with MapReduce and - from what I understand -
> using HDFS for the purpose of input for MapReduce jobs. My other concern =
is
> vague indication that it's not a 'real-time' system. We may be using
> MapReduce in small components of the application, but it will most likely=
 be
> in file access analysis rather than any processing on the files themselve=
s.
>
> In other words, what I really want is a distributed, resilient, scalable
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusing i=
t
> and inviting grief?
>
> M


--
Harsh J


--
Jay Vyas
MMSB/UCHC

--_000_CCA1CE4E16E43goldstone1llnlgov_
Content-Type: text/html; charset="us-ascii"
Content-ID: <84A71E5D8A696443A214F005EAFAD7A0@llnl.gov>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif; ">
<div>If the goal is simply an alternative to SAN for cost-effective storage=
 of large files you might want to take a look at Gluster. &nbsp;It is an op=
en source scale-out distributed filesystem that can utilize local storage. =
Also, it has distributed metadata and
 a POSIX interface and can be accessed through a number of clients, includi=
ng fuse, NFS and CIFS. &nbsp;Supposedly you can even run Hadoop on top of G=
luster. &nbsp;</div>
<div><br>
</div>
<div>I hope I don't start any sort of flame war by mentioning Gluster on a =
Hadoop mailing list. &nbsp;Note I have no vested interest in this particula=
r solution, although I am in the process of evaluating it myself.</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Jay Vyas &lt;<a href=3D"mailt=
o:jayunit100@gmail.com">jayunit100@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@hadoop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mail=
to:user@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Monday, October 15, 2012 1:21=
 PM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@ha=
doop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:use=
r@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Re: Suitability of HDFS fo=
r live file store<br>
</div>
<div><br>
</div>
<div>
<div>Seems like a heavyweight solution unless you are actually processing t=
he images?
<br>
<br>
Wow, no mapreduce, no streaming writes, and relatively small files.&nbsp; I=
m surprised that you are considering hadoop at all ?<br>
<br>
Im surprised there isnt a simpler solution that uses redundancy without all=
 the <br>
daemons and name nodes and task trackers and stuff.<br>
<br>
Might make it kind of awkward as a normal file system. <br>
<br>
<div class=3D"gmail_quote">On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <span d=
ir=3D"ltr">
&lt;<a href=3D"mailto:harsh@cloudera.com" target=3D"_blank">harsh@cloudera.=
com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hey Matt,<br>
<br>
What do you mean by 'real-time' though? While HDFS has pretty good<br>
contiguous data read speeds (and you get N x replicas to read from),<br>
if you're looking to &quot;cache&quot; frequently accessed files into memor=
y<br>
then HDFS does not natively have support for that. Otherwise, I agree<br>
with Brock, seems like you could make it work with HDFS (sans<br>
MapReduce - no need to run it if you don't need it).<br>
<br>
The presence of NameNode audit logging will help your file access<br>
analysis requirement.<br>
<div class=3D"HOEnZb">
<div class=3D"h5"><br>
On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter &lt;<a href=3D"mailto:matt@de=
ity.co.nz">matt@deity.co.nz</a>&gt; wrote:<br>
&gt; Hi,<br>
&gt;<br>
&gt; I am a new Hadoop user, and would really appreciate your opinions on w=
hether<br>
&gt; Hadoop is the right tool for what I'm thinking of using it for.<br>
&gt;<br>
&gt; I am investigating options for scaling an archive of around 100Tb of i=
mage<br>
&gt; data. These images are typically TIFF files of around 50-100Mb each an=
d need<br>
&gt; to be made available online in realtime. Access to the files will be<b=
r>
&gt; sporadic and occasional, but writing the files will be a daily activit=
y.<br>
&gt; Speed of write is not particularly important.<br>
&gt;<br>
&gt; Our previous solution was a monolithic, expensive - and very full - SA=
N so I<br>
&gt; am excited by Hadoop's distributed, extensible, redundant architecture=
.<br>
&gt;<br>
&gt; My concern is that a lot of the discussion on and use cases for Hadoop=
 is<br>
&gt; regarding data processing with MapReduce and - from what I understand =
-<br>
&gt; using HDFS for the purpose of input for MapReduce jobs. My other conce=
rn is<br>
&gt; vague indication that it's not a 'real-time' system. We may be using<b=
r>
&gt; MapReduce in small components of the application, but it will most lik=
ely be<br>
&gt; in file access analysis rather than any processing on the files themse=
lves.<br>
&gt;<br>
&gt; In other words, what I really want is a distributed, resilient, scalab=
le<br>
&gt; filesystem.<br>
&gt;<br>
&gt; Is Hadoop suitable if we just use this facility, or would I be misusin=
g it<br>
&gt; and inviting grief?<br>
&gt;<br>
&gt; M<br>
<br>
<br>
<br>
</div>
</div>
<span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote>
</div>
<br>
<br clear=3D"all">
<br>
-- <br>
Jay Vyas<br>
MMSB/UCHC<br>
</div>
</div>
</span>
</body>
</html>

--_000_CCA1CE4E16E43goldstone1llnlgov_--