Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Message-ID: <1162121054.348871266455488066.JavaMail.jira@brutus.apache.org>
Date: Thu, 18 Feb 2010 01:11:28 +0000 (UTC)
From: "Aaron Kimball (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Subject: [jira] Updated: (MAPREDUCE-1502) Sqoop should run mysqldump in a
 mapper as opposed to a user-side process
In-Reply-To: <1360388951.348851266455367938.JavaMail.jira@brutus.apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/MAPREDUCE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1502:
-------------------------------------

    Attachment: MAPREDUCE-1502.patch

The goal of this patch is to run mysqldump instances inside of map tasks on the cluster; previously, Sqoop special-cased this behavior to use a single thread on the local node where the user's Sqoop command was executed -- as opposed to using the actual destination cluster.

After this patch, mysqldump instances will run on the cluster itself, potentially eliminating an extra node from the copy chain (if the client is not also an HDFS datanode). Furthermore, multiple mysqldump instances can be used to import sections of the table in parallel.

To accomplish this goal, DataDrivenImportJob was refactored into two classes: ImportJobBase, which launches "some" import job based on MapReduce, and DataDrivenImportJob itself which subclasses ImportJobBase to use DataDrivenDBInputFormat. I have introduced a new class called MySQLDumpImportJob which similarly builds on the common logic of ImportJobBase to launch a mysqldump-based job. This refactoring has the side-benefit of allowing easier dependency injection for testing. The testutil package now contains classes which build on ImportJobBase to allow this sort of testing to be done more easily (e.g., for TestImportJob).

This patch looks a bit bigger than it is, because LocalMySQLManager was moved to DirectMySQLManager, etc.; these refactorings moved several hundred lines of code between files, but did not change most of it.

A few extra test cases have been added here to test parallel imports via mysqldump. It has also been made obvious that mysqldump itself cannot select a subset of columns; if you try to do this, the regular JDBC-based approach will be used instead. Test case included.

After getting this working in unit tests, I performed a test import of 1.5 GB of data on a five-node Hadoop cluster running CDH2 (0.20.1+169.56). Data was stored in a single MySQL 5 instance using MyISAM for the table layout. The mysql database is on a separate hard drive from HDFS, but is on the same node as one of the Hadoop instances.

The following table lists the number of seconds required to perform an import under each of three conditions (lower values are better).

The three import modes tried are:
* JDBC based import (i.e., the existing parallel codepath)
* Using mysqldump; this has to parse the comma-delimited strings emitted by mysql and reencode them into different delimiters before pushing to hdfs
* mysqldump with --mysql-delimiters; this skips the delimiter parsing / StringBuilder reassembly entirely.

||num mappers||jdbc||mysqldump||mysqldump w/ same delimiters||
|1|50.56|101.91|45.42|
|2|45.52|66.13|45.53|
|5|45.63|45.58|	|

JDBC import is at 90% of full speed on a single node, but is CPU-bound at that point. After pushing this to even two tasks, it can get to the point where Sqoop itself isn't the bottleneck (whether the bottleneck is HDFS writes or mysql reads, I don't know). Running {{top}}, mysql was cranking about 40% of a CPU and mysqldump itself was using about 25% of a CPU. But whether it's blocked on reading the disk below that (will all 1.6 GB of the table be buffer cached?) or outbound disk/network I'm not sure. 

Using mysqldump, copying from that process, parsing delimiters, and reassembling output records is a CPU-bound process, as evidenced by column two. Somewhere between 2 and 5 parallel copies, it gets to the speed limit.

Using {{--mysql-delimiters}} on Sqoop to skip the parsing step will be at full speed in a single process.

45 seconds for the import translates to an import speed of approximately 34 MB/sec. 

I believe this demonstrates acceptable performance. It's encouraging that a single mapper process using mysqldump can achieve the maximum observed import speed, because it means that down the line, scalable parallel imports from sharded databases will be achievable.

This also suggests that down the line I should improve the performance of the delimiter parsing code (I know of some buffer copies that could be eliminated there).

This patch reorganizes a lot of code. I believe you'll need to run something like the following to commit:

{code}svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/manager/DirectMySQLManager.java
svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/mapreduce/ImportJobBase.java
svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/mapreduce/MySQLDumpImportJob.java
svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/mapreduce/MySQLDumpInputFormat.java
svn add src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/mapreduce/MySQLDumpMapper.java
svn add src/contrib/sqoop/src/test/org/apache/hadoop/sqoop/manager/DirectMySQLTest.java
svn add src/contrib/sqoop/src/test/org/apache/hadoop/sqoop/testutil/InjectableConnManager.java
svn add src/contrib/sqoop/src/test/org/apache/hadoop/sqoop/testutil/InjectableManagerFactory.java

svn rm src/contrib/sqoop/src/java/org/apache/hadoop/sqoop/manager/LocalMySQLManager.java
svn rm src/contrib/sqoop/src/test/org/apache/hadoop/sqoop/manager/LocalMySQLTest.java
{code}


> Sqoop should run mysqldump in a mapper as opposed to a user-side process
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1502
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1502
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1502.patch
>
>
> Sqoop currently runs mysqldump ("direct import mode") in the local user process with a single thread. Better system performance and reliability could be achieved by running this in a parallel set of mapper tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.