hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. P. Pendleton" ...@geekdom.net>
Subject Re: Forcing all blocks to be present "locally"
Date Mon, 25 Sep 2006 21:17:03 GMT
Would the "replication" parameter be sufficient for you? This will allow you
to push the system to make a copy of each block in a file on a higher set of
nodes, possibly equal to the number of nodes in your cluster. Of course,
this saves no space over local copying, but it does mean that you won't have
to do the copy manually, and local-access should be sped up.

Just use "hadoop dfs -setrep -R # /path/to/criticalfiles" where # = your
cluster size. This assumes you're running a DataNode on each node that you
want the copies made to (and, well, that the nodes doing lookups == the
nodes running datanodes, or else you'll end up with extra copies).

On 9/25/06, Andrzej Bialecki <ab@getopt.org> wrote:
>
> Hi,
>
> I'm investigating how to implement a map-reduce based searching in
> Nutch. Let me describe my current plan regarding this, and why I need to
> "localize" the data blocks.
>
> A mapred search job would basically run map() in a never-ending loop,
> serving the queries. Each node would get its group of segments, to
> spread the document collection more or less evenly across mapred nodes.
> Nutch uses segments, which are a bunch of MapFile data, plus
> corresponding Lucene indexes. This data consists of relatively few, very
> large files.
>
> Experiments show that using this data directly from DFS is way too slow,
> so currently it always needs to be copied from DFS to local disks. This
> is a very expensive step, which uses up valuable (and limited) local
> disk space, and currently it has to be performed manually (which is even
> more expensive and error-prone).
>
> I'm curious if there is a way to avoid this copying when porting this
> code to run as a mapred job - a way to tell DFS to locate all blocks
> from such files and if necessary over-replicate them in such a way that
> for any given node requesting this sort of access for a specific file,
> all blocks from this file would always be found locally, until a
> "de-localize" request was made, upon which DFS would go back to the
> normal replication policy (and delete spurious blocks).
>
> I was looking at the new filecache code, but it seems geared towards
> handling many small files (such as config files, job jars, etc), and it
> also seems to simply make full local copies of "cached" files.
>
> Any suggestions are welcome ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


-- 
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message