hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1502) Sqoop should run mysqldump in a mapper as opposed to a user-side process
Date Thu, 18 Feb 2010 01:11:28 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron Kimball updated MAPREDUCE-1502:
-------------------------------------

    Attachment: MAPREDUCE-1502.patch

The goal of this patch is to run mysqldump instances inside of map tasks on the cluster; previously,
Sqoop special-cased this behavior to use a single thread on the local node where the user's
Sqoop command was executed -- as opposed to using the actual destination cluster.

After this patch, mysqldump instances will run on the cluster itself, potentially eliminating
an extra node from the copy chain (if the client is not also an HDFS datanode). Furthermore,
multiple mysqldump instances can be used to import sections of the table in parallel.

To accomplish this goal, DataDrivenImportJob was refactored into two classes: ImportJobBase,
which launches "some" import job based on MapReduce, and DataDrivenImportJob itself which
subclasses ImportJobBase to use DataDrivenDBInputFormat. I have introduced a new class called
MySQLDumpImportJob which similarly builds on the common logic of ImportJobBase to launch a
mysqldump-based job. This refactoring has the side-benefit of allowing easier dependency injection
for testing. The testutil package now contains classes which build on ImportJobBase to allow
this sort of testing to be done more easily (e.g., for TestImportJob).

This patch looks a bit bigger than it is, because LocalMySQLManager was moved to DirectMySQLManager,
etc.; these refactorings moved several hundred lines of code between files, but did not change
most of it.

A few extra test cases have been added here to test parallel imports via mysqldump. It has
also been made obvious that mysqldump itself cannot select a subset of columns; if you try
to do this, the regular JDBC-based approach will be used instead. Test case included.

After getting this working in unit tests, I performed a test import of 1.5 GB of data on a
five-node Hadoop cluster running CDH2 (0.20.1+169.56). Data was stored in a single MySQL 5
instance using MyISAM for the table layout. The mysql database is on a separate hard drive
from HDFS, but is on the same node as one of the Hadoop instances.

The following table lists the number of seconds required to perform an import under each of
three conditions (lower values are better).

The three import modes tried are:
* JDBC based import (i.e., the existing parallel codepath)
* Using mysqldump; this has to parse the comma-delimited strings emitted by mysql and reencode
them into different delimiters before pushing to hdfs
* mysqldump with --mysql-delimiters; this skips the delimiter parsing / StringBuilder reassembly
entirely.

||num mappers||jdbc||mysqldump||mysqldump w/ same delimiters||
|1|50.56|101.91|45.42|
|2|45.52|66.13|45.53|
|5|45.63|45.58|	|

JDBC import is at 90% of full speed on a single node, but is CPU-bound at that point. After
pushing this to even two tasks, it can get to the point where Sqoop itself isn't the bottleneck
(whether the bottleneck is HDFS writes or mysql reads, I don't know). Running {{top}}, mysql
was cranking about 40% of a CPU and mysqldump itself was using about 25% of a CPU. But whether
it's blocked on reading the disk below that (will all 1.6 GB of the table be buffer cached?)
or outbound disk/network I'm not sure. 

Using mysqldump, copying from that process, parsing delimiters, and reassembling output records
is a CPU-bound process, as evidenced by column two. Somewhere between 2 and 5 parallel copies,
it gets to the speed limit.

Using {{--mysql-delimiters}} on Sqoop to skip the parsing step will be at full speed in a
single process.

45 seconds for the import translates to an import speed of approximately 34 MB/sec. 

I believe this demonstrates acceptable performance. It's encouraging that a single mapper
process using mysqldump can achieve the maximum observed import speed, because it means that
down the line, scalable parallel imports from sharded databases will be achievable.

This also suggests that down the line I should improve the performance of the delimiter parsing
code (I know of some buffer copies that could be eliminated there).

This patch reorganizes a lot of code. I believe you'll need to run something like the following
to commit:

{code}svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/manager/DirectMySQLManager.java
svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/mapreduce/ImportJobBase.java
svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/mapreduce/MySQLDumpImportJob.java
svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/mapreduce/MySQLDumpInputFormat.java
svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/mapreduce/MySQLDumpMapper.java
svn add src/contrib/sqoop/src/test/org/apache/hadoop/sqoop/manager/DirectMySQLTest.java
svn add src/contrib/sqoop/src/test/org/apache/hadoop/sqoop/testutil/InjectableConnManager.java
svn add src/contrib/sqoop/src/test/org/apache/hadoop/sqoop/testutil/InjectableManagerFactory.java

svn rm src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/manager/LocalMySQLManager.java
svn rm src/contrib/sqoop/src/test/org/apache/hadoop/sqoop/manager/LocalMySQLTest.java
{code}


> Sqoop should run mysqldump in a mapper as opposed to a user-side process
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1502
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1502
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1502.patch
>
>
> Sqoop currently runs mysqldump ("direct import mode") in the local user process with
a single thread. Better system performance and reliability could be achieved by running this
in a parallel set of mapper tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message