hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-705) User-configurable quote and delimiter characters for Sqoop records and record reparsing
Date Fri, 03 Jul 2009 19:16:47 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727100#action_12727100

Aaron Kimball commented on MAPREDUCE-705:

I have implemented delimiter control for Sqoop. Users can now customize which characters are
used as delimiters between fields and records. Fields can also be enclosed in user-configurable
quote characters, which can be made mandatory (all fields always quoted) or optional (fields
only quoted when they contain a delimiter). An optional user-configurable escape character
allows users to escape quotes (and more escape characters) inside fields.

This modifies the toString() method of generated classes to use the proper quote, delimiter,
and escape characters for each record.

Sqoop-generated record types now also contain a parse() method which can read a record separated
by the specified field delimiters and terminated by the specified record-termination delimiter,
and can contain escapes and quotes as described above. The record will repopulate all its
fields by reading the record. For example, suppose a table was created with:

{{CREATE TABLE mytable (foo INT, bar VARCHAR(32), baz DATE);}}

Importing this table from the database into HDFS via Sqoop would generate a class named {{mytable}}
which can contain a single record consisting of {{foo}}, {{bar}}, and {{baz}} fields. The
text file emitted in HDFS may contain lines like:


The {{mytable.parse()}} method will read a line of this form and populate its {{foo}}, {{bar}}
and {{baz}} variables based on the three fields it can interpret.

The delimiters used when writing text to HDFS are controlled by these options:

Output line formatting options:
--fields-terminated-by (char)    Sets the field separator character
--lines-terminated-by (char)     Sets the end-of-line character
--optionally-enclosed-by (char)  Sets a field enclosing character
--enclosed-by (char)             Sets a required field enclosing char
--escaped-by (char)              Sets the escape character
--mysql-delimiters               Uses MySQL's default delimiter set
  fields: ,  lines: \n  escaped-by: \  optionally-enclosed-by: '

The {{(char)}} argument above can be specified either as a normal character (e.g., {{\-\-fields-terminated-by
,}}) or via an escape sequence. Arguments of the form {{\0xhhh}} will be interpreted as a
hexidecimal representation of a character with hex number _hhh_. Arguments of the form {{\0ooo}}
will be treated as an octal representation of a character represented by octal number _ooo_.
The special escapes {{\n}}, {{\r}}, {{\"}}, and {{\b}} as well as two backslashes (which JIRA
refuses to format...) act as they do inside Java strings. {{\0}} will be treated as {{NUL}}.
This will insert {{NUL}} characters between fields or lines ((if used for {{\-\-fields-terminated-by}}
or {{\-\-lines-terminated-by}}), or will disable enclosing/escaping if used for one of the
{{\-\-enclosed-by}}, {{\-\-optionally-enclosed-by}}, or {{\-\-escaped-by}} arguments. Enclosing
and escaping are disabled by default. For unambiguous parsing, both must be enabled, e.g.,
via {{\-\-mysql-delimiters}}.

Sqoop can also transcode one delimiter set into another. So if your data is currently enclosed
in a certain set of delimiters, you can use Sqoop to help you convert it to a new set of delimiters.

The options described above will set the output delimiters used by {{toString()}}. If none
of the following options are used, they also set the input delimiters used by {{parse()}}.
But the input delimiters can be overridden on an individual basis with the options:

Input parsing options:
--input-fields-terminated-by (char)    Sets the input field separator
--input-lines-terminated-by (char)     Sets the input end-of-line char
--input-optionally-enclosed-by (char)  Sets a field enclosing character
--input-enclosed-by (char)             Sets a required field encloser
--input-escaped-by (char)              Sets the input escape character

If you use these options, then the {{parse()}} method generated for your type will use these
delimiters, while the {{toString()}} method will use the delimiters set with the output options
above. You can then use the generated class in a mapper which calls {{parse()}} on the input
records of the old delimiter set, and emits {{Text}} objects generated with the output delimiter
set from the record's {{toString()}} method.

The default delimiters are {{,}} for fields, {{\n}} for records, no quote character, and no
escape character. Note that this can lead to ambiguous/unparsible records if you import database
records containing commas or newlines in the field data.

This patch includes Hive support for user-defined delimiters. It will print a warning if you
use {{\-\-escaped-by}}, {{\-\-enclosed-by}}, or {{\-\-optionally-enclosed-by}} since Hive
does not know how to parse these. It will pass the field and record terminators through to
Hive. If you do not set any delimiters and do use {{\-\-hive-import}}, the field delimiter
will be set to {{^A}} and the record delimiter will be set to {{\n}} to be consistent with
Hive's defaults.

This patch includes integrated {{mysqldump}} support for user-defined delimiters. If your
delimiters exactly match the delimiters used by {{mysqldump}}, then Sqoop will use a fast-path
that copies the data directly from {{mysqldump}}'s output into HDFS. Otherwise, Sqoop will
parse {{mysqldump}}'s output into fields and transcode them into the user-specified delimiter
set. This is about 50% slower (as measured on a 1.5 GB test dataset import on my machine).
For convenience, the {{\-\-mysql-delimiters}} argument will set all the output delimiters
to be consistent with {{mysqldump}}'s format.

The output of the {{\-\-mysql-delimiters}}-based test says:

{{INFO manager.LocalMySQLManager: Transferred 1.5747 GB in 124.7751 seconds (12.9231 MB/sec)}}

Whereas with the default delimiters, Sqoop reports:

{{INFO manager.LocalMySQLManager: Transferred 1.5328 GB in 181.3032 seconds (8.6571 MB/sec)}}

This patch is based after MAPREDUCE-685. That should be committed to trunk before this one.
This patch adds another file in {{testdata/hive/scripts}} so the number of release audit warnings
is expected to increase by 1.
I have run all the Sqoop unit tests (including the several new tests added by this patch)
and they pass on my machine. This includes the LocalMySQLTest which tests {{mysqldump}} compatibility,
but is not run by Hudson. As noted above, I also run an at-scale test importing 1.5 GB of
data from MySQL to HDFS.

> User-configurable quote and delimiter characters for Sqoop records and record reparsing
> ---------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-705
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-705
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-705.patch
> Sqoop needs a mechanism for users to govern how fields are quoted and what delimiter
characters separate fields and records. With delimiters providing an unambiguous format, a
parse method can reconstitute the generated record data object from a text-based representation
of the same record.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message