hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Doddington <a...@doddington.net>
Subject Re: How does sqoop distribute it's data evenly across HDFS?
Date Thu, 17 Mar 2011 09:41:02 GMT
Ok, I understand about the balancer process which can be run manually, but the sqoop documentation
seems to imply that it does balancing for you, based on the split key, as you note.

But what causes the various sqoop data import map jobs to write to different data nodes? I.e.
What stops them all writing to the same node, in the ultimate pathological case?


      Andy D

On 17 Mar 2011, at 00:28, Harsh J <qwertymaniac@gmail.com> wrote:

> There's a balancer available to re-balance DNs across the HDFS cluster
> in general. It is available in the $HADOOP_HOME/bin/ directory as
> start-balancer.sh
> But what I think sqoop implies is that your data is balanced due to
> the map jobs it runs for imports (using a provided split factor
> between maps), which should make it write chunks of data out to
> different DataNodes.
> I guess you could get more information on the Sqoop mailing list
> sqoop-user@cloudera.org,
> https://groups.google.com/a/cloudera.org/group/sqoop-user/topics
> On Thu, Mar 17, 2011 at 5:04 AM, BeThere <andy@doddington.net> wrote:
>> The sqoop documentation seems to imply that it uses the key information provided
to it on the command line to ensure that the SQL data is distributed evenly across the DFS.
However I cannot see any mechanism for achieving this explicitly other than relying on the
implicit distribution provided by default by HDFS. Is this correct or are there methods on
some API that allow me to manage the distribution to ensure that it is balanced across all
nodes in my cluster?
>> Thanks,
>>         Andy D
> -- 
> Harsh J
> http://harshj.com

View raw message