Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 45163D6D7 for ; Mon, 15 Oct 2012 21:35:44 +0000 (UTC) Received: (qmail 9282 invoked by uid 500); 15 Oct 2012 21:35:38 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 9172 invoked by uid 500); 15 Oct 2012 21:35:38 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 9164 invoked by uid 99); 15 Oct 2012 21:35:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Oct 2012 21:35:38 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of goldstone1@llnl.gov designates 128.15.143.173 as permitted sender) Received: from [128.15.143.173] (HELO prdiron-3.llnl.gov) (128.15.143.173) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Oct 2012 21:35:31 +0000 X-Attachments: Received: from prdcassnat.llnl.gov (HELO PRDEXHUB-06V.the-lab.llnl.gov) ([128.15.143.243]) by prdiron-3.llnl.gov with ESMTP; 15 Oct 2012 14:35:09 -0700 Received: from PRDEXMBX-06.the-lab.llnl.gov ([169.254.6.182]) by PRDEXHUB-06V.the-lab.llnl.gov ([128.15.143.162]) with mapi id 14.02.0247.003; Mon, 15 Oct 2012 14:35:11 -0700 From: "Goldstone, Robin J." To: "user@hadoop.apache.org" Subject: Re: Suitability of HDFS for live file store Thread-Topic: Suitability of HDFS for live file store Thread-Index: AQHNqw4GHneMCtVkOkqPTiB0QvX3M5e7QOYAgAADwAD//58hgA== Date: Mon, 15 Oct 2012 21:35:08 +0000 Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.3.120616 x-originating-ip: [10.253.133.89] Content-Type: multipart/alternative; boundary="_000_CCA1CE4E16E43goldstone1llnlgov_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_CCA1CE4E16E43goldstone1llnlgov_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable If the goal is simply an alternative to SAN for cost-effective storage of l= arge files you might want to take a look at Gluster. It is an open source = scale-out distributed filesystem that can utilize local storage. Also, it h= as distributed metadata and a POSIX interface and can be accessed through a= number of clients, including fuse, NFS and CIFS. Supposedly you can even = run Hadoop on top of Gluster. I hope I don't start any sort of flame war by mentioning Gluster on a Hadoo= p mailing list. Note I have no vested interest in this particular solution= , although I am in the process of evaluating it myself. From: Jay Vyas > Reply-To: "user@hadoop.apache.org" > Date: Monday, October 15, 2012 1:21 PM To: "user@hadoop.apache.org" > Subject: Re: Suitability of HDFS for live file store Seems like a heavyweight solution unless you are actually processing the im= ages? Wow, no mapreduce, no streaming writes, and relatively small files. Im sur= prised that you are considering hadoop at all ? Im surprised there isnt a simpler solution that uses redundancy without all= the daemons and name nodes and task trackers and stuff. Might make it kind of awkward as a normal file system. On Mon, Oct 15, 2012 at 4:08 PM, Harsh J > wrote: Hey Matt, What do you mean by 'real-time' though? While HDFS has pretty good contiguous data read speeds (and you get N x replicas to read from), if you're looking to "cache" frequently accessed files into memory then HDFS does not natively have support for that. Otherwise, I agree with Brock, seems like you could make it work with HDFS (sans MapReduce - no need to run it if you don't need it). The presence of NameNode audit logging will help your file access analysis requirement. On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter > wrote: > Hi, > > I am a new Hadoop user, and would really appreciate your opinions on whet= her > Hadoop is the right tool for what I'm thinking of using it for. > > I am investigating options for scaling an archive of around 100Tb of imag= e > data. These images are typically TIFF files of around 50-100Mb each and n= eed > to be made available online in realtime. Access to the files will be > sporadic and occasional, but writing the files will be a daily activity. > Speed of write is not particularly important. > > Our previous solution was a monolithic, expensive - and very full - SAN s= o I > am excited by Hadoop's distributed, extensible, redundant architecture. > > My concern is that a lot of the discussion on and use cases for Hadoop is > regarding data processing with MapReduce and - from what I understand - > using HDFS for the purpose of input for MapReduce jobs. My other concern = is > vague indication that it's not a 'real-time' system. We may be using > MapReduce in small components of the application, but it will most likely= be > in file access analysis rather than any processing on the files themselve= s. > > In other words, what I really want is a distributed, resilient, scalable > filesystem. > > Is Hadoop suitable if we just use this facility, or would I be misusing i= t > and inviting grief? > > M -- Harsh J -- Jay Vyas MMSB/UCHC --_000_CCA1CE4E16E43goldstone1llnlgov_ Content-Type: text/html; charset="us-ascii" Content-ID: <84A71E5D8A696443A214F005EAFAD7A0@llnl.gov> Content-Transfer-Encoding: quoted-printable
If the goal is simply an alternative to SAN for cost-effective storage= of large files you might want to take a look at Gluster.  It is an op= en source scale-out distributed filesystem that can utilize local storage. = Also, it has distributed metadata and a POSIX interface and can be accessed through a number of clients, includi= ng fuse, NFS and CIFS.  Supposedly you can even run Hadoop on top of G= luster.  

I hope I don't start any sort of flame war by mentioning Gluster on a = Hadoop mailing list.  Note I have no vested interest in this particula= r solution, although I am in the process of evaluating it myself.

From: Jay Vyas <jayunit100@gmail.com>
Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Date: Monday, October 15, 2012 1:21= PM
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Re: Suitability of HDFS fo= r live file store

Seems like a heavyweight solution unless you are actually processing t= he images?

Wow, no mapreduce, no streaming writes, and relatively small files.  I= m surprised that you are considering hadoop at all ?

Im surprised there isnt a simpler solution that uses redundancy without all= the
daemons and name nodes and task trackers and stuff.

Might make it kind of awkward as a normal file system.

On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <harsh@cloudera.= com> wrote:
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good
contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into memor= y
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <matt@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on w= hether
> Hadoop is the right tool for what I'm thinking of using it for.
>
> I am investigating options for scaling an archive of around 100Tb of i= mage
> data. These images are typically TIFF files of around 50-100Mb each an= d need
> to be made available online in realtime. Access to the files will be > sporadic and occasional, but writing the files will be a daily activit= y.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SA= N so I
> am excited by Hadoop's distributed, extensible, redundant architecture= .
>
> My concern is that a lot of the discussion on and use cases for Hadoop= is
> regarding data processing with MapReduce and - from what I understand = -
> using HDFS for the purpose of input for MapReduce jobs. My other conce= rn is
> vague indication that it's not a 'real-time' system. We may be using > MapReduce in small components of the application, but it will most lik= ely be
> in file access analysis rather than any processing on the files themse= lves.
>
> In other words, what I really want is a distributed, resilient, scalab= le
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusin= g it
> and inviting grief?
>
> M



--
Harsh J



--
Jay Vyas
MMSB/UCHC
--_000_CCA1CE4E16E43goldstone1llnlgov_--