hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. Pendleton" ...@geekdom.net>
Subject Re: silly question: why http for map output?
Date Thu, 01 Jun 2006 18:07:05 GMT
On 6/1/06, Stefan Groschupf <sg@media-style.com> wrote:
>
> The mapoutput files are not located in DFS, they are on the local disks of
> the mapper that creates them, avoiding the 3X replication overhead of DFS.
>
> Wasn't there an issue to allow defining replication on a file based
> level?



You *could* replicate once over using the current DFS. You probably wouldn't
want to, though: since the current mode of DFS is to chop files up into
blocks, and distribute the blocks in a uniform way across all nodes - you'd
be copying the output of a map across all nodes. This means that a *second*
copy would need to be made (from each of the destinations of the block in
DFS, to the reducer node), doubling the number of times that the block has
to be transferred across the network. And, if a single block gets lost
(remember, your 1x copy is getting distributed across all nodes, including
the possible less-reliable ones, and there are no dups), then you have to
re-run the map.

Plus, right now there's nothing enforcing that tasktracker nodes will always
be running a datanode...

-- 
Bryan A. Pendleton
Ph: (877) geek-1-bp

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message