Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Message-ID: <4475E53F.9010807@apache.org>
Date: Thu, 25 May 2006 10:11:27 -0700
From: Doug Cutting <cutting@apache.org>
User-Agent: Mozilla Thunderbird 1.0.8 (X11/20060502)
MIME-Version: 1.0
To: hadoop-user@lucene.apache.org
Subject: Re: Help with MapReduce
References: <4475CDC4.4070703@dragonflymc.com> <4475D201.1030006@apache.org>
 <4475D4BD.7010700@dragonflymc.com> <4475DA4D.9000707@apache.org>
 <4475E218.5000308@dragonflymc.com>
In-Reply-To: <4475E218.5000308@dragonflymc.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Dennis Kubes wrote:
> Ok.  This is a little different in that I need to start thinking about 
> my algorithms in terms of sequential passes and multiple jobs instead of 
> direct access.  That way I can use the input directories to get the data 
> that I need.  Couldn't I also do it through the MapRunnable interface 
> that creates a reader shared by an inner mapper class or is that hacking 
> the interfaces when I should be thinking about this terms of sequential 
> processing?

You can do it however you like!  I don't know enough about your problem 
to say definitively which is the best approach.  We're working hard on 
Hadoop so that we can scalably stream data through MapReduce at 
megabytes/second per node.  So you might do some back-of-the envelope 
calculations.  Figure at least 10ms per random access.  So your maximum 
random access rate might be around 100/second per drive.  Figure a 
10MB/second transfer rate, so if randomly accessed data is 100kB each, 
then your maximum random access rate drops to 50 items/drive/second. 
Since these are over the network, real performance will probably be much 
worse.  Also, MapFile requires a scan per entry, so you might really end 
up scanning 1MB per access, which would slow random accesses to 10 
items/drive/second.  You might benchmark your random accesss performance 
  to get a better estimate, then compare that to processing the whole 
collection through MapReduce.

Doug