hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guy le Mar <guyle...@gmail.com>
Subject Re: [sqoop-user] Re: How does sqoop distribute it's data evenly across HDFS?
Date Thu, 17 Mar 2011 21:37:59 GMT
Hi Andy,

The Sqoop framework does expose the ability to influence where each
chunk of data goes - provided you're prepared to write your own Sqoop
The DBInputFormat.DBInputSplit class (the thing that represents a
chunk of data during an import) has a method named "GetLocations".
You can use this to nominate the host names within HDFS you want each
split to be written to.

Sqoop does not expose this feature via the command-line, so if you
wanted this ability in Sqoop, you'd need to raise a Jira or implement
it yourself and submit a patch.

If you're using Oracle as your RDBMS, the OraOop plugin
(www.quest.com/Ora-Oop/) to Sqoop does allow you to control the
location of each split via the (undocumented) command-line option
"-Doraoop.locations", which is a comma-separate list of host names.
(This feature was implement so that during performance testing of
OraOop we could investigate the effect of network IO capacity during
Sqoop imports.)

Guy le Mar
Quest Software

On 18 March 2011 04:37, Aaron Kimball <akimball83@gmail.com> wrote:
> Sqoop operates in parallel by spawning multiple map tasks that each read
> different rows of the database, based on the split key. Each map task
> submits a different SQL query which uses the split key column to get a
> subset of the rows you want to read.  Each map task then writes an output
> file to HDFS, covering its subset of the total rows. Hopefully, Sqoop's
> partitioning heuristic has resulted in each of these output files being of
> roughly the same size. If your row range is "lumpy" (e.g., you have a whole
> lot of rows with a column ID=0...10000, then a blank space, then a whole lot
> more rows where 5000000 < ID < 6000000), you'll see a bunch of output files
> where some may be empty, and one or two contain all the rows. If your row
> range is more uniform (e.g., the range of the ID column is more-or-less
> fully-occupied between its maximum and minimum values), you'll get much more
> even file-sizes.
> But assuming the number of rows read in by each map task are more or less
> the same, then the files will be distributed across the cluster using the
> underlying platform. In practice, Sqoop relies on MapReduce and HDFS to make
> things "just work out." For instance, by spawning 4 map tasks (Sqoop's
> default), it is likely that your cluster will have four separate nodes where
> there is a free task slot, and that these tasks will be allocated across
> those four different nodes. HDFS' replica placement algorithm is "one
> replica on the same node as the writer, and the other two replicas
> elsewhere" (assuming a single rack -- if you've configured a multiple rack
> topology, HDFS goes further and ensures allocation on at least two racks).
> So the map tasks will "probably" be spread onto four different nodes, and
> HDFS will "probably" put 3 replicas * 4 tasks' output on a reasonably
> diverse set of machines in the cluster. Note that HDFS block placement is
> actually on a per-block, not a per-file basis, so if each task is writing
> multiple blocks of output, the number of datanodes which are candidates for
> replica targets goes up substantially.
> In theory, you are right: pathological cases can occur, where all Sqoop
> tasks run serially on a single node, making that node hold replica #1 of
> each task's output. The HDFS namenode could then pick the same two foreign
> replica nodes for each of these four tasks' output and have only three nodes
> in the cluster hold all the data from an import job. But this is a very
> unlikely case, and not one worth worrying about in practice. If this
> happens, it's likely because the majority of your nodes are already in a
> disk-full condition, or are otherwise unavailable, and only a specific
> subset of the nodes are capable of actually receiving the writes. But in
> this regard, Sqoop is no different than any other custom MapReduce program
> you might write; it's not particularly more or less resilient to any
> pathological conditions of the underlying system which might arise.
> Hope that helps.
> Cheers,
> - Aaron
> On Thu, Mar 17, 2011 at 2:41 AM, Andy Doddington <andy@doddington.net>
> wrote:
>> Ok, I understand about the balancer process which can be run manually, but
>> the sqoop documentation seems to imply that it does balancing for you, based
>> on the split key, as you note.
>> But what causes the various sqoop data import map jobs to write to
>> different data nodes? I.e. What stops them all writing to the same node, in
>> the ultimate pathological case?
>> Thanks,
>>      Andy D
>> On 17 Mar 2011, at 00:28, Harsh J <qwertymaniac@gmail.com> wrote:
>> > There's a balancer available to re-balance DNs across the HDFS cluster
>> > in general. It is available in the $HADOOP_HOME/bin/ directory as
>> > start-balancer.sh
>> >
>> > But what I think sqoop implies is that your data is balanced due to
>> > the map jobs it runs for imports (using a provided split factor
>> > between maps), which should make it write chunks of data out to
>> > different DataNodes.
>> >
>> > I guess you could get more information on the Sqoop mailing list
>> > sqoop-user@cloudera.org,
>> > https://groups.google.com/a/cloudera.org/group/sqoop-user/topics
>> >
>> > On Thu, Mar 17, 2011 at 5:04 AM, BeThere <andy@doddington.net> wrote:
>> >> The sqoop documentation seems to imply that it uses the key information
>> >> provided to it on the command line to ensure that the SQL data is
>> >> distributed evenly across the DFS. However I cannot see any mechanism for
>> >> achieving this explicitly other than relying on the implicit distribution
>> >> provided by default by HDFS. Is this correct or are there methods on some
>> >> API that allow me to manage the distribution to ensure that it is balanced
>> >> across all nodes in my cluster?
>> >>
>> >> Thanks,
>> >>
>> >>         Andy D
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Harsh J
>> > http://harshj.com

View raw message