hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <enis.soz.nu...@gmail.com>
Subject Re: "Moving Computation is Cheaper than Moving Data"
Date Fri, 24 Aug 2007 07:45:40 GMT
Hi,

You should not consider implementing distributed search over map/reduce. 
The paradigm is suitable for index searching. Lucene is already quite 
efficient for millions of  documents on one node. As suggested you can 
have a look at nutch's distributed search architecture. It is not 
perfect at all but will be a good starting point.

Samuel LEMOINE wrote:
> Thanks so much, it helps me a lot. I'm actually quite lost with 
> Hadoop's mechanisms.
> The point of my study is to distribute the Lucene searching phase with 
> Hadoop...
> According to what I'v understood, a way to distribute the search over 
> a big Lucene's index would be to put this index on HDFS, and to 
> implement the Lucene search job under the Mapper interface, am I right ?
> But I'm stuck because of Lucene searchable architecture... the 
> IndexReader takes the whole path where's located the index as 
> argument, I don't see how to distribute it...
> Well I guess this issue is quite different of the original subject of 
> this thread, maybe should I post a new message about this issue.
>
>
> Arun C Murthy a écrit :
>> Samuel,
>>
>> Samuel LEMOINE wrote:
>>> Well, I don't get it... when you pass arguments to a map job, you 
>>> just give a key and a value, how can hadoop make the link between 
>>> those arguments and the data's concerned? Really, your answer don't 
>>> help me at all, sorry ^^
>>>
>>
>> The input of a map-reduce job is a file or a bunch of files. These 
>> files are usually stored on HDFS, which splits up a logical file into 
>> physical blocks of fixed size (configurable with default size of 
>> 128MB). Each block is replicated for reliability.
>>
>> The important point to note is that both the HDFS and Map-Reduce 
>> clusters run on the same hardware i.e. a combined data and compute 
>> cluster.
>>
>> Now when you launch a job (with lots of maps and reduces) the inputs 
>> file-sets are split into FileSplits (logical splits, user can control 
>> the splitting). Now the framework schedules as many maps as there are 
>> splits i.e. there is a one-to-one correspondence between maps and 
>> splits and each map processes one input split.
>>
>> The key idea is to try and *schedule* each map on the _datanode_ 
>> (i.e. one among the set of datanodes) which contains the actual block 
>> for the logical input-split that the map is supposed to process. This 
>> is what we refer to as 'data-locality. Hence we move the computation 
>> (the actual map) to the data (input split).
>>
>> This is feasible due to:
>> a) HDFS & Map-Reduce share the same physical cluster.
>> b) HDFS exposes (via relevant apis) the underlying block-locations 
>> where a file is physically stored on the file-system.
>>
>> hth,
>> Arun
>>
>>
>> Essentially what Hadoop's map-reduce tries to do is to schedule 
>> *maps* on
>>> Devaraj Das a écrit :
>>>
>>>> That's the paradigm of Hadoop's Map-Reduce.
>>>>  
>>>>
>>>>> -----Original Message-----
>>>>> From: Samuel LEMOINE [mailto:samuel.lemoine@lingway.com] Sent: 
>>>>> Thursday, August 23, 2007 2:48 PM
>>>>> To: hadoop-user@lucene.apache.org
>>>>> Subject: "Moving Computation is Cheaper than Moving Data"
>>>>>
>>>>> When I read the Hadoop documentation:
>>>>> The Hadoop Distributed File System: Architecture and Design
>>>>> (http://lucene.apache.org/hadoop/hdfs_design.html)
>>>>>
>>>>> a paragraph hold my attention:
>>>>>
>>>>>
>>>>>       "Moving Computation is Cheaper than Moving Data"
>>>>>
>>>>> A computation requested by an application is much more efficient 
>>>>> if it is executed near the data it operates on. This is especially 
>>>>> true when the size of the data set is huge. This minimizes network 
>>>>> congestion and increases the overall throughput of the system. The 
>>>>> assumption is that it is often better to migrate the computation 
>>>>> closer to where the data is located rather than moving the data to 
>>>>> where the application is running. HDFS provides interfaces for 
>>>>> applications to move themselves closer to where the data is located.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I'd like to know how to perform that, espacially with the aim of 
>>>>> distributed Lucene search ? Which Hadoop classes should I use to 
>>>>> do that ?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Samuel
>>>>>
>>>>>     
>>>>
>>>>
>>>>
>>>>   
>>>
>>>
>>
>>
>
>

Mime
View raw message