hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Forcing all blocks to be present "locally"
Date Mon, 25 Sep 2006 21:05:46 GMT

I'm investigating how to implement a map-reduce based searching in 
Nutch. Let me describe my current plan regarding this, and why I need to 
"localize" the data blocks.

A mapred search job would basically run map() in a never-ending loop, 
serving the queries. Each node would get its group of segments, to 
spread the document collection more or less evenly across mapred nodes. 
Nutch uses segments, which are a bunch of MapFile data, plus 
corresponding Lucene indexes. This data consists of relatively few, very 
large files.

Experiments show that using this data directly from DFS is way too slow, 
so currently it always needs to be copied from DFS to local disks. This 
is a very expensive step, which uses up valuable (and limited) local 
disk space, and currently it has to be performed manually (which is even 
more expensive and error-prone).

I'm curious if there is a way to avoid this copying when porting this 
code to run as a mapred job - a way to tell DFS to locate all blocks 
from such files and if necessary over-replicate them in such a way that 
for any given node requesting this sort of access for a specific file, 
all blocks from this file would always be found locally, until a 
"de-localize" request was made, upon which DFS would go back to the 
normal replication policy (and delete spurious blocks).

I was looking at the new filecache code, but it seems geared towards 
handling many small files (such as config files, job jars, etc), and it 
also seems to simply make full local copies of "cached" files.

Any suggestions are welcome ...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

View raw message