Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 69877 invoked from network); 27 Aug 2007 15:32:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Aug 2007 15:32:28 -0000 Received: (qmail 31561 invoked by uid 500); 27 Aug 2007 15:32:22 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 31537 invoked by uid 500); 27 Aug 2007 15:32:22 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 31528 invoked by uid 99); 27 Aug 2007 15:32:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Aug 2007 08:32:22 -0700 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [69.50.2.13] (HELO ex9.myhostedexchange.com) (69.50.2.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Aug 2007 15:32:16 +0000 Received: from 75.80.179.210 ([75.80.179.210]) by ex9.hostedexchange.local ([69.50.2.13]) with Microsoft Exchange Server HTTP-DAV ; Mon, 27 Aug 2007 15:31:55 +0000 User-Agent: Microsoft-Entourage/11.3.3.061214 Date: Mon, 27 Aug 2007 08:31:47 -0700 Subject: Re: Using Map/Reduce without HDFS? From: Ted Dunning To: Message-ID: Thread-Topic: Using Map/Reduce without HDFS? Thread-Index: Acfov2Ern4YWXFSyEdyCTgAWy8rVfQ== In-Reply-To: <12345124.post@talk.nabble.com> Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Yes. And unless you are a very unusual person, it is not all that rare for more than one scan of the consolidated data to be required, especially during development. Can you say "When can we have these new statistics for Wed-Wed unique users"? It is often also possible to merge the receiving of the new data with the appending to a large file. The append nature of the writing makes this very mcuh more efficient than scanning a pile of old files. On 8/27/07 4:16 AM, "mfc" wrote: > > Hi, > > One benefit to the pre-processing step is the random i/o during > pre-processing > is only done on "new" data, i.e. it is incremental. So you only pay > the random i/o cost once when new data is added. This is better than having > to pay > the random i/o cost every time on all the data (old and new) as would be > required if > a map/reduce job where to run directly on the local file system. > > Thanks > > > mfc wrote: >> >> Hi, >> >> I can see a benefit to this approach if it replaces random >> access of a local file system with sequential access to >> large files in HDFS. We are talking about physical disks and >> seek time is expensive. >> >> But the random access of the local file system still happens, >> it just gets moved to the pre-processing step. >> >> How about walking thru the relative cost of this pre-processing step >> (which still must do random access), and some approaches to how >> this could be done. You mentioned cat | gzip (assuming parallel instances >> of this), is that what you do? >> >> Thanks >> >>