hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject RE: Question about job distribution
Date Wed, 15 Jul 2009 06:27:17 GMT
Confused. What do you mean by "query be distributed over all datanodes or just 1 node" . If
your data is small enough so that it fits in just one block ( and replicated by hadoop ),
then just one task will be run ( assuming default input split).
If the data is spread across multiple blocks, you can make it run on just one compute node
by setting your input split to be large enough ( yes there are use cases for this when whole
data is to be fed to a single mapper ). Else, the job will be scheduled on numerous nodes
with each getting a block / chunk (  input split size set ) of your actual data. The nodes
picked for running your job depends on data-locality to reduce network latency.


-----Original Message-----
From: Divij Durve [mailto:divij.tech@gmail.com] 
Sent: Wednesday, July 15, 2009 2:32 AM
To: common-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Question about job distribution

If i have a query that i would normally fire on a database and i want o fire
that using the data loaded into multiple nodes on hadoop. Will the query be
distributed over all the datanodes so it returns results faster or will it
just send it to 1 node? If so is there a way to get it to distribute the
query instead of sending it to 1 node?

View raw message