sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ricardo Gaspar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-3240) Hive table Row count mismatch(more records in hadoop/hive than oracle) when importing data from oracle using sqoop import with direct mode
Date Tue, 07 May 2019 15:51:00 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834903#comment-16834903
] 

Ricardo Gaspar commented on SQOOP-3240:
---------------------------------------

Hi, I'm having a similar problem when using the direct mode against an Oracle database (based
on Oracle 12)  and Sqoop 1.4.6-cdh5.14.4.

In my case I'm not even specifying any query or where clause, just sqooping the full table.


[~infosuresh2k], did you find any solution/fix for this in the meantime? It's been a while
since this issue was posted. 
An obvious but not very efficient solution would be to set mappers to 1 (when no single PK
is available to use with the split-by argument).
 

> Hive table Row count mismatch(more records in hadoop/hive than oracle) when importing
data from oracle using sqoop import with direct mode
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SQOOP-3240
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3240
>             Project: Sqoop
>          Issue Type: Bug
>          Components: connectors/oracle
>    Affects Versions: 1.4.6
>         Environment: Sqoop 1.4.6-cdh5.8.2
> Oracle 11i
> oracle table Sqoop import with direct mode
>            Reporter: Suresh Venkatesan
>            Priority: Minor
>
> Hive/Hadoop row count was not matching on sqoop import of oracle table with direct mode,
it's perfectly matching without direct mode.
> See the below examples.
> 1. Row Count is varying based on fetch-size 10000 and direct mode.
> 2. Row count is not consistent, sometimes it is 630869 and sometimes 639924 etc.,
> 3. We are pulling data with direct mode for 1 year with monthly partition, data mismatch
happening for couple of months not for all months.
> 4. Hive/Hadoop table Count looks good, If we import the data from oracle without direct
mode.
> Actual row count in oracle table for year 2015 and month 12 is : 609924, we are getting
more records on sqoop import with direct mode : 630869. hive table count is matching with
oracle count on sqoop import *without direct* mode : 609924
> *1. sqoop import without direct mode, Row Count : 609924*
> sqoop import -Dmapredue.job.queuename=queue -Doraoop.import.partitions=oracle_table_partition_201512
--connect jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table oracle_db.oracle_table
--split-by split_column --hive-import --hive-table hive_db.hive_table --where "TO_DATE(oracle_timestamp_column)
BETWEEN TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" --fields-terminated-by
'\001' --lines-terminated-by '\n' --as-textfile -m 4 
> *2. sqoop import with direct mode, Row Count : 630869*
> sqoop import -Dmapredue.job.queuename=queue -Doraoop.import.partitions=oracle_table_partition_201512
--connect jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table oracle_db.oracle_table
--split-by split_column --hive-import --hive-table hive_db.hive_table --where "TO_DATE(oracle_timestamp_column)
BETWEEN TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" --fields-terminated-by
'\001' --lines-terminated-by '\n' --as-textfile -m 4 --direct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message