Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
User-Agent: Microsoft-Entourage/11.3.3.061214
Date: Mon, 27 Aug 2007 08:31:47 -0700
Subject: Re: Using Map/Reduce without HDFS?
From: Ted Dunning <tdunning@veoh.com>
To: <hadoop-user@lucene.apache.org>
Message-ID: <C2F83C73.1A6F9%tdunning@veoh.com>
Thread-Topic: Using Map/Reduce without HDFS?
Thread-Index: Acfov2Ern4YWXFSyEdyCTgAWy8rVfQ==
In-Reply-To: <12345124.post@talk.nabble.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="US-ASCII"
Content-transfer-encoding: 7bit


Yes.

And unless you are a very unusual person, it is not all that rare for more
than one scan of the consolidated data to be required, especially during
development.  Can you say "When can we have these new statistics for Wed-Wed
unique users"?

It is often also possible to merge the receiving of the new data with the
appending to a large file.  The append nature of the writing makes this very
mcuh more efficient than scanning a pile of old files.
 

On 8/27/07 4:16 AM, "mfc" <mikefconnell@verizon.net> wrote:

> 
> Hi,
> 
> One benefit to the pre-processing step is the random i/o during
> pre-processing
> is only done on "new" data, i.e. it is incremental. So you only pay
> the random i/o cost once when new data is added. This is better than having
> to pay
> the random i/o cost every time on all the data (old and new) as would be
> required if 
> a map/reduce job where to run directly on the local file system.
> 
> Thanks
> 
> 
> mfc wrote:
>> 
>> Hi,
>> 
>> I can see a benefit to this approach if it replaces random
>> access of a local file system with sequential access to
>> large files in HDFS. We are talking about physical disks and
>> seek time is expensive.
>> 
>> But the random access of the local file system still happens,
>> it just gets moved to the pre-processing step.
>> 
>> How about walking thru the relative cost of this pre-processing step
>> (which still must do random access), and some approaches to how
>> this could be done. You mentioned cat | gzip (assuming parallel instances
>> of this), is that what you do?
>> 
>> Thanks
>> 
>>