hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-5815) Sqoop: A database import tool for Hadoop
Date Tue, 12 May 2009 17:55:45 GMT
Sqoop: A database import tool for Hadoop

                 Key: HADOOP-5815
                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
             Project: Hadoop Core
          Issue Type: New Feature
            Reporter: Aaron Kimball
            Assignee: Aaron Kimball
         Attachments: HADOOP-5815.patch


Sqoop is a tool designed to help users import existing relational databases into their Hadoop
clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate
the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read
the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read
into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.

Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data
files directly into the Hive warehouse directory, and also to inject the appropriate table
definition into the metastore.

Some more specifics:

Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop
jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their
tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536
(DBInputFormat). The DBWritable implementation needed to extract a table is generated by this
tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the
table specification and translate this to the appropriate Java types.

The generated classes are provided as .java files for the user to reuse. They are also compiled
into a jar and used to run a MapReduce task to perform the data import. This either results
in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded
into the SequenceFiles as well.

The program will extract a specific table from a database, or optionally, all tables. For
a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting
key be specified for the import task, Sqoop will auto-detect the presence of a primary key
on a table and automatically use it as the sort order; the user can also manually specify
a sorting column.

Example invocations:

To import an entire database:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company

(Requires that all tables have primary keys)

To select a single table:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company
--table employees

To select a subset of columns from a table:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company
--table employees --columns "employee_id,first_name,last_name,salary,start_date"

To explicitly set the sort column, import format, and import destination (the table will go
to /shared/imported_databases/employees):

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company
--table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile

Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this
can also be explicitly set, e.g.:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company
--driver org.postgresql.Driver --all-tables

Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal
of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few
other early test users on "real" databases.

A readme file is included in the patch which contains documentation on how to use the tool.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message