Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F8ACD528 for ; Mon, 15 Oct 2012 20:18:42 +0000 (UTC) Received: (qmail 1310 invoked by uid 500); 15 Oct 2012 20:18:38 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 1217 invoked by uid 500); 15 Oct 2012 20:18:37 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 1209 invoked by uid 99); 15 Oct 2012 20:18:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Oct 2012 20:18:37 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Oct 2012 20:18:31 +0000 Received: by mail-ob0-f176.google.com with SMTP id x4so6583006obh.35 for ; Mon, 15 Oct 2012 13:18:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=deity.co.nz; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=rlvk/Jromrt9GWXhkNtOkrhDgWeK0fc6HVYFjx7BZsE=; b=lWpZs/G7FBinzhTITGpuB13zF4S+2QPAvyjXXr0EixO7wjlKhuQe44xsHYGhCuRthm sh7mdlOV41uLFZIM7ni6LxrceeTi8l2knZ0CVOWl7GsmV7fx9fN74R9E2kv2jkyT7Nv4 eXeMVv5fL3o7+OrbNzcID8Hg5ArN77G8wuRPU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=rlvk/Jromrt9GWXhkNtOkrhDgWeK0fc6HVYFjx7BZsE=; b=Uhac+kl7Ctyj5A6FtUVPcZrJ6KoXiMxKY2cPchzQv7dOdvSO/okzKeq4SqSXOP2H/4 hx14rjK+7O7Z/lHX/7LxWARY/eHirGiDb4eA3gUxhCRUeLpCfWQifg7QKwP46+kBhxUG iCaVyUe1ZUAEyAA6MOHW3zM2AV8HFxFfHbF1UEYatoatzCDDR+KqH5LFNVtqOO/XlZR/ Y9cyIm31jO1XHlEdKD3fyuORR5jUXN47ldrEIpI+YX8Viscy/HpV5wYP/ncYHw2aCWPE qBxE++F/MblRNEIVArrWvLW9+fzw35qbveNb3mFDPCMxCOors5IYptlhuJwamQSO3AJc XuMQ== Received: by 10.60.29.228 with SMTP id n4mr10875999oeh.27.1350332289595; Mon, 15 Oct 2012 13:18:09 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.133.65 with HTTP; Mon, 15 Oct 2012 13:17:49 -0700 (PDT) In-Reply-To: References: <8A27453B-8891-4263-98BF-E1BA70FD4D38@deity.co.nz> From: Matt Painter Date: Tue, 16 Oct 2012 09:17:49 +1300 Message-ID: Subject: Re: Suitability of HDFS for live file store To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8ff250ee733b5c04cc1ebc7d X-Gm-Message-State: ALoCoQkITVh83w+IiMHdSxdeErWqNVInQ7YmEvgqgg6W65P/+I34gwzV46Aw9t9iYtYhFZU6T+/X X-Virus-Checked: Checked by ClamAV on apache.org --e89a8ff250ee733b5c04cc1ebc7d Content-Type: text/plain; charset=ISO-8859-1 Thanks guys; really appreciated. I was deliberately vague about the notion of real-time because I didn't know what the metrics are that made Hadoop be considered a batch system - if that makes sense! Essentially, the speed of access to the files stored in HDFS needs to be comparable to files being read off a native file system in order for end-user download. Whereas the bulk of the data on disk will be TIFF files, we will also be including JPEG derivatives which we are intending to be displaying inline in a web-based application. We typically have sparse access metrics - we have millions of files, but each file may be viewed only zero or one time over a year. Therefore, native in-memory caching isn't so much of an issue. M On 16 October 2012 09:08, Harsh J wrote: > Hey Matt, > > What do you mean by 'real-time' though? While HDFS has pretty good > contiguous data read speeds (and you get N x replicas to read from), > if you're looking to "cache" frequently accessed files into memory > then HDFS does not natively have support for that. Otherwise, I agree > with Brock, seems like you could make it work with HDFS (sans > MapReduce - no need to run it if you don't need it). > > The presence of NameNode audit logging will help your file access > analysis requirement. > > On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter wrote: > > Hi, > > > > I am a new Hadoop user, and would really appreciate your opinions on > whether > > Hadoop is the right tool for what I'm thinking of using it for. > > > > I am investigating options for scaling an archive of around 100Tb of > image > > data. These images are typically TIFF files of around 50-100Mb each and > need > > to be made available online in realtime. Access to the files will be > > sporadic and occasional, but writing the files will be a daily activity. > > Speed of write is not particularly important. > > > > Our previous solution was a monolithic, expensive - and very full - SAN > so I > > am excited by Hadoop's distributed, extensible, redundant architecture. > > > > My concern is that a lot of the discussion on and use cases for Hadoop is > > regarding data processing with MapReduce and - from what I understand - > > using HDFS for the purpose of input for MapReduce jobs. My other concern > is > > vague indication that it's not a 'real-time' system. We may be using > > MapReduce in small components of the application, but it will most > likely be > > in file access analysis rather than any processing on the files > themselves. > > > > In other words, what I really want is a distributed, resilient, scalable > > filesystem. > > > > Is Hadoop suitable if we just use this facility, or would I be misusing > it > > and inviting grief? > > > > M > > > > -- > Harsh J > -- Matt Painter matt@deity.co.nz +64 21 115 9378 --e89a8ff250ee733b5c04cc1ebc7d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks guys; really appreciated.

I was deliberately vague about the = notion of real-time because I didn't know what the metrics are that mad= e Hadoop be considered a batch system - if that makes sense!

Essenti= ally, the speed of access to the files stored in HDFS needs to be comparabl= e to files being read off a native file system in order for end-user downlo= ad. Whereas the bulk of the data on disk will be TIFF files, we will also b= e including JPEG derivatives which we are intending to be displaying inline= in a web-based application.

We typically have sparse access metrics - we have millions of files, bu= t each file may be viewed only zero or one time over a year. Therefore, nat= ive in-memory caching isn't so much of an issue.

M

On 16 October 2012 09:08, Harsh J <harsh@cloudera.com> wrot= e:
Hey Matt,

What do you mean by 'real-time' though? While HDFS has pretty good<= br> contiguous data read speeds (and you get N x replicas to read from),
if you're looking to "cache" frequently accessed files into m= emory
then HDFS does not natively have support for that. Otherwise, I agree
with Brock, seems like you could make it work with HDFS (sans
MapReduce - no need to run it if you don't need it).

The presence of NameNode audit logging will help your file access
analysis requirement.

On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <matt@deity.co.nz> wrote:
> Hi,
>
> I am a new Hadoop user, and would really appreciate your opinions on w= hether
> Hadoop is the right tool for what I'm thinking of using it for. >
> I am investigating options for scaling an archive of around 100Tb of i= mage
> data. These images are typically TIFF files of around 50-100Mb each an= d need
> to be made available online in realtime. Access to the files will be > sporadic and occasional, but writing the files will be a daily activit= y.
> Speed of write is not particularly important.
>
> Our previous solution was a monolithic, expensive - and very full - SA= N so I
> am excited by Hadoop's distributed, extensible, redundant architec= ture.
>
> My concern is that a lot of the discussion on and use cases for Hadoop= is
> regarding data processing with MapReduce and - from what I understand = -
> using HDFS for the purpose of input for MapReduce jobs. My other conce= rn is
> vague indication that it's not a 'real-time' system. We ma= y be using
> MapReduce in small components of the application, but it will most lik= ely be
> in file access analysis rather than any processing on the files themse= lves.
>
> In other words, what I really want is a distributed, resilient, scalab= le
> filesystem.
>
> Is Hadoop suitable if we just use this facility, or would I be misusin= g it
> and inviting grief?
>
> M



--
Harsh J



--
Matt Pain= ter
matt@deity.co.nz
+64 21 1= 15 9378
--e89a8ff250ee733b5c04cc1ebc7d--