hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Hadoop processing
Date Thu, 08 Nov 2012 15:03:17 GMT
To go back to the OP's initial position. 
2 new nodes where data hasn't yet been 'balanced'. 

First, that's a small window of time. 

But to answer your question... 

The JT will attempt to schedule work to where the data is. If you're using 3X replication,
there are 3 nodes where the block resides. So you can calculate the odds of getting an open
slot to process your data local to its location. 

However, if there is an open slot where the data is not located, you will still process the
data in that open slot. You lose data locality and that smaller chunk of data will be processed
on that node.  So in that case, yes the data is shipped to the node. If you look at your job
tracker web page for the results of your processing you will see something in terms of what
percentage of the work occurred in terms of data locality. Hadoop is pretty good in that respect.

If you know that the processing time is a couple of orders of magnitude longer than the time
it takes to ship the data to a node, you can override the normal characteristic and force
the processing to be done remotely. (We've done this and there is a paper on this on InfoQ)
[We were bored and didn't like the fact that our Ganglia maps were not all red. We are evil
in that way ;-) ] We really don't recommend doing this unless you are either insane or really
know what you are doing. 



On Nov 8, 2012, at 8:49 AM, Jay Vyas <jayunit100@gmail.com> wrote:

> Hmm this is interesting.  I think that: 
> 1) For the map phases, hadoop is smart enough to try to run mappers locally, but i think
you could force these DNs to actively participate in a Mapper job by decreasing the size of
input splits, which would allow for many more mappers, some of which would be forced to run
on files which were not necessarily local - in this scenario, those DNs don't yet have any
local files on them that would be used for the input. 
> 2) For the reducer phases - since of course the reducers will be copying mapper outputs
from all over the cluster, one would expect that your Data nodes would naturally take part
in this portion of the task if the num.reducers parameter was specified. 
> On Thu, Nov 8, 2012 at 9:35 AM, Kartashov, Andy <Andy.Kartashov@mpac.ca> wrote:
> Hadoopers,
> “Hadoop ships the code to the data instead of sending the data to the code.”
> Say you added two DNs/TTs to the cluster. They have no data at this point, i.e. you have
not ran the balancer.
> In view of the above quoted statement, will these two nodes not participate in the MapReduce
job until you balanced some data onto those nodes? Please kindly elaborate.
> Rgds,
> AK47
> NOTICE: This e-mail message and any attachments are confidential, subject to copyright
and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are
not the intended recipient, please delete and contact the sender immediately. Please consider
the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe
qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts
par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite.
Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
> -- 
> Jay Vyas
> http://jayunit100.blogspot.com

View raw message