Mailing-List: contact dev-help@sqoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@sqoop.apache.org
Date: Wed, 11 Jan 2017 12:43:58 +0000 (UTC)
From: "Hudson (JIRA)" <jira@apache.org>
To: dev@sqoop.apache.org
Message-ID: <JIRA.13025591.1480933092000.708506.1484138638859@Atlassian.JIRA>
In-Reply-To: <JIRA.13025591.1480933092000@Atlassian.JIRA>
References: <JIRA.13025591.1480933092000@Atlassian.JIRA> <JIRA.13025591.1480933092050@arcas>
Subject: [jira] [Commented] (SQOOP-3068) Enhance error (tool.ImportTool:
 Encountered IOException running import job: java.io.IOException: Expected
 schema) to suggest workaround (--map-column-java)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 11 Jan 2017 12:44:00 -0000


    [ https://issues.apache.org/jira/browse/SQOOP-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818208#comment-15818208 ] 

Hudson commented on SQOOP-3068:
-------------------------------

SUCCESS: Integrated in Jenkins build Sqoop-hadoop20 #1085 (See [https://builds.apache.org/job/Sqoop-hadoop20/1085/])
SQOOP-3068: Enhance error (tool.ImportTool: Encountered IOException (maugli: [https://git-wip-us.apache.org/repos/asf?p=sqoop.git&a=commit&h=c0b9bc435396600b7b8798190c67ce77aa81a02c])
* (edit) src/java/org/apache/sqoop/mapreduce/ParquetJob.java
* (edit) src/test/org/apache/sqoop/tool/TestImportTool.java
* (add) src/java/org/apache/sqoop/avro/AvroSchemaMismatchException.java
* (edit) src/java/org/apache/sqoop/tool/ImportTool.java
* (add) src/test/org/apache/sqoop/util/ExpectedLogMessage.java
* (edit) src/test/com/cloudera/sqoop/hive/TestHiveImport.java


> Enhance error (tool.ImportTool: Encountered IOException running import job: java.io.IOException: Expected schema) to suggest workaround (--map-column-java)
> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SQOOP-3068
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3068
>             Project: Sqoop
>          Issue Type: Improvement
>          Components: hive-integration
>    Affects Versions: 1.4.6
>            Reporter: Markus Kemper
>            Assignee: Szabolcs Vasas
>         Attachments: SQOOP-3068.patch, SQOOP-3068.patch, SQOOP-3068.patch, SQOOP-3068.patch
>
>
> Please consider enhancing the error to include more detail and suggest workaround (--map-columns-java).  
> Sqoop (import + --hive-import + --as-parquetfile) can fail due to a mismatch with the json schema that Hive produces vs. the json schema that Sqoop generates.  The test case below demonstrates how to reproduce the issue as well as workaround it.
> SETUP (create parquet table with Sqoop import and Beeline CTAS)
> {noformat}
> STEP 01 - Create MySQL Tables
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "drop table t1"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "create table t1 (c_int int, c_date date, c_timestamp timestamp)"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "describe t1"
> ---------------------------------------------------------------------------------------------------------
> | Field                | Type                 | Null | Key | Default              | Extra                | 
> ---------------------------------------------------------------------------------------------------------
> | c_int                | int(11)              | YES |     | (null)               |                      | 
> | c_date               | date                 | YES |     | (null)               |                      | 
> | c_timestamp          | timestamp            | NO  |     | CURRENT_TIMESTAMP    | on update CURRENT_TIMESTAMP | 
> ---------------------------------------------------------------------------------------------------------
> STEP 02 : Insert and Select Row
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "insert into t1 values (1, current_date(), current_timestamp())"
> sqoop eval --connect $MYCONN --username $MYUSER --password $MYPSWD --query "select * from t1"
> --------------------------------------------------
> | c_int       | c_date     | c_timestamp         | 
> --------------------------------------------------
> | 1           | 2016-10-26 | 2016-10-26 14:30:33.0 | 
> --------------------------------------------------
> STEP 03 : Create Hive Tables
> beeline -u jdbc:hive2:// -e "use default; drop table t1"
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username $MYUSER --password $MYPSWD --table t1 --hcatalog-database default --hcatalog-table t1_text --create-hcatalog-table --hcatalog-storage-stanza 'stored as parquet' --num-mappers 1
> beeline -u jdbc:hive2:// -e "use default; create table t1 stored as parquet as select * from t1_text;show create table t1;"
> +----------------------------------------------------+--+
> |                   createtab_stmt                   |
> +----------------------------------------------------+--+
> | CREATE TABLE `t1`(                                 |
> |   `c_int` int,                                     |
> |   `c_date` string,                                 |
> |   `c_timestamp` string)                            |
> | ROW FORMAT SERDE                                   |
> |   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
> | STORED AS INPUTFORMAT                              |
> |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  |
> | OUTPUTFORMAT                                       |
> |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
> | LOCATION                                           |
> |   'hdfs://<namenode>:8020/user/hive/warehouse/t1' |
> | TBLPROPERTIES (                                    |
> |   'COLUMN_STATS_ACCURATE'='true',                  |
> |   'numFiles'='1',                                  |
> |   'numRows'='2',                                   |
> |   'rawDataSize'='6',                               |
> |   'totalSize'='605',                               |
> |   'transient_lastDdlTime'='1478298298')            |
> +----------------------------------------------------+--+
> {noformat}
> REPRODUCE ISSUE (import --hive-import append and overwrite)
> {noformat}
> STEP 01: Attempt --hive-import --append
> 	
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default --hive-table t1 --as-parquetfile --num-mappers 1 --append
> 16/11/02 20:54:23 WARN mapreduce.DataDrivenImportJob: Target Hive table 't1' exists! Sqoop will append data into the existing Hive table. Consider using --hive-overwrite, if you do NOT intend to do appending.
> 16/11/02 20:54:24 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Expected schema: {"type":"record","name":"t1","fields":[{"name":"c_int","type":["null","int"],"doc":"Converted from 'int'","default":null},{"name":"c_date","type":["null","string"],"doc":"Converted from 'string'","default":null},{"name":"c_timestamp","type":["null","string"],"doc":"Converted from 'string'","default":null}]}
> Actual schema: {"type":"record","name":"t1","doc":"Sqoop import of t1","fields":[{"name":"c_int","type":["null","int"],"default":null,"columnName":"c_int","sqlType":"4"},{"name":"c_date","type":["null","long"],"default":null,"columnName":"c_date","sqlType":"91"},{"name":"c_timestamp","type":["null","long"],"default":null,"columnName":"c_timestamp","sqlType":"93"}],"tableName":"t1"}
> STEP 02: Attempt --hive-import --hive-overwrite
> sqoop import -Dmapreduce.map.log.level=DEBUG --connect $MYCONN --username $MYUSER --password $MYPSWD --table t1 --hive-import --hive-database default --hive-table t1 --as-parquetfile --num-mappers 1 --hive-overwrite
> 16/11/02 20:56:55 INFO hive.metastore: Connected to metastore.
> 16/11/02 20:56:56 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Expected schema: {"type":"record","name":"t1","fields":[{"name":"c_int","type":["null","int"],"doc":"Converted from 'int'","default":null},{"name":"c_date","type":["null","string"],"doc":"Converted from 'string'","default":null},{"name":"c_timestamp","type":["null","string"],"doc":"Converted from 'string'","default":null}]}
> Actual schema: {"type":"record","name":"t1","doc":"Sqoop import of t1","fields":[{"name":"c_int","type":["null","int"],"default":null,"columnName":"c_int","sqlType":"4"},{"name":"c_date","type":["null","long"],"default":null,"columnName":"c_date","sqlType":"91"},{"name":"c_timestamp","type":["null","long"],"default":null,"columnName":"c_timestamp","sqlType":"93"}],"tableName":"t1"}
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)