Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 69501 invoked from network); 7 Jan 2011 13:17:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Jan 2011 13:17:02 -0000 Received: (qmail 12482 invoked by uid 500); 7 Jan 2011 13:16:59 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 12178 invoked by uid 500); 7 Jan 2011 13:16:59 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 12170 invoked by uid 99); 7 Jan 2011 13:16:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jan 2011 13:16:58 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jthievre@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jan 2011 13:16:52 +0000 Received: by qwh6 with SMTP id 6so18416553qwh.35 for ; Fri, 07 Jan 2011 05:16:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:sender:received :in-reply-to:references:from:date:x-google-sender-auth:message-id :subject:to:content-type; bh=8QMD3GtR8CdBniCjWdlCpcTDtKl4F0a+8jEerYR/HXc=; b=npx5q5UORo9aHK0mB6271BPMQmUt1co8RF19AgionKfJFKVNgHgQiJGTXpXr9Pxvn3 5DWU6kxL4xo1S4tEdQgKq/kyWa9AMCvGKhOk4NDGAOaqmYlopxBX63Am8hfYEpoMwo0f iR+uFKa8yeQKcm6LljoqxYEYhKDZNhNCgsXfU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; b=oUTvEGZ+5/St/TrMNNIYByS20kq6Cc9txZIB7Mu9pGgU5mM7zMYPe+oDQozER+is1w jKzLRXau5A0GczLy9+GaAPFukKRkwIgRQUx16koCw31kp9qogSxYom1RfOfDLRMFeVtx 8upYYESx3ecFkXkWDVjlroBzMRFQjZgJ+OZXw= Received: by 10.229.211.6 with SMTP id gm6mr4647846qcb.112.1294406191609; Fri, 07 Jan 2011 05:16:31 -0800 (PST) MIME-Version: 1.0 Sender: jthievre@gmail.com Received: by 10.229.224.129 with HTTP; Fri, 7 Jan 2011 05:16:11 -0800 (PST) In-Reply-To: References: From: =?ISO-8859-1?B?Suly9G1lIFRoaeh2cmUgSU5B?= Date: Fri, 7 Jan 2011 14:16:11 +0100 X-Google-Sender-Auth: aWxd7c4rD_iTO46XtnDwggJYt00 Message-ID: Subject: Re: How to manage large record in MapReduce To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636284a343ef6240499416d23 X-Virus-Checked: Checked by ClamAV on apache.org --001636284a343ef6240499416d23 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Sonal, thank you, I have just implemented a solution similar to yours (without copying to a temp file as suggested in my inital post), and it seems to work. Best Regards, J=E9r=F4me 2011/1/7 Sonal Goyal > Jerome, > > You can take a look at FileStreamInputFormat at > > https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hi= ho/mapreduce/lib/input > > This provides an input stream per file. In our case, we are using the inp= ut > stream to load data into the database directly. Maybe you can use this or= a > similar approach for working with your videos. > > HTH > > Thanks and Regards, > Sonal > Connect Hadoop with databases, > Salesforce, FTP servers and others > Nube Technologies > > > > > > > > On Thu, Jan 6, 2011 at 4:23 PM, J=E9r=F4me Thi=E8vre = wrote: > > > Hi, > > > > we are currently using Hadoop (version 0.20.2) to manage some web > archiving > > processes like fulltext indexing, and it works very well with small > records > > that contains html. > > Now, we would like to work with other type of web data like videos. The= se > > kind of data could be really large and of course these records doesn't > fit > > in memory. > > > > Is it possible to manage record which content doesn't reside in memory > but > > on disk. > > A possibility would be to implements a Writable that read its content > from > > a > > DataInput but doesn't load it in memory, instead it would copy that > content > > to a temporary file in the local file system and allows to stream its > > content using an InputStream (an InputStreamWritable). > > > > Has somebody tested a similar approach, and if not do you think some bi= g > > problems could happen (that impacts performance) with this method ? > > > > Thanks, > > > > J=E9r=F4me Thi=E8vre > > > --001636284a343ef6240499416d23--