sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Voros (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SQOOP-3267) Incremental import to HBase deletes only last version of column
Date Sun, 21 Jan 2018 12:17:00 GMT

    [ https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333491#comment-16333491

Daniel Voros commented on SQOOP-3267:

[~vasas], [~maugli] thank you both for your replies!

I agree with you on keeping the history. I see two ways to do that.

Option A) is a sort-of workaround. Let users know that Sqoop will delete all previous versions
of columns when updating them to NULL in the source table and ask them to set KEEP_DELETED_CELLS
on the underlying HBase table if they still want to preserve history.

Option B) is inserting an empty string for NULL values. In detail:
 - Whenever we're importing a NULL value (whether if it's from a new row or an updated one)
we insert a special string (let's call it NULL_STRING). Note, that it would be better to insert
NULL, but HBase lacks the notion of NULL.
 - The value of NULL_STRING is "" (empty string) by default but is configurable. (Probably
via the already existing {{--null-string}} argument.)
 - This behavior should NOT depend on the incremental mode ("append" or "lastmodified").

Two notes:
 - Option B) is similar to what was introduced in Phoenix in PHOENIX-1578. (When using the
STORE_NULLS=true table option, there's no way to tell a NULL value from an empty string in
 - I've also tested how Hive treats NULL values when storing a table in HBase. Empty (or deleted)
columns are displayed as NULL when reading the table, but updating a column to NULL in Hive
fails (see HIVE-3336).

> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>            Priority: Major
>         Attachments: SQOOP-3267.1.patch
> Deletes are supported since SQOOP-3149, but we're only deleting the last version of a
column when the corresponding cell was set to NULL in the source table.
> This can lead to unexpected and misleading results if the row has been transferred multiple
times, which can easily happen if it's being modified on the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a single Put per
row as before. This could probably lead to a performance drop for wide tables (for which HBase
is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be the expected
behavior here?

This message was sent by Atlassian JIRA

View raw message