Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Sat, 8 Jul 2017 21:14:00 +0000 (UTC)
From: "Eugene Koifman (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13077771.1496775592000.200835.1499548440028@Atlassian.JIRA>
In-Reply-To: <JIRA.13077771.1496775592000@Atlassian.JIRA>
References: <JIRA.13077771.1496775592000@Atlassian.JIRA> <JIRA.13077771.1496775592457@jira-lw-us.apache.org>
Subject: [jira] [Updated] (HIVE-16832) duplicate ROW__ID possible in multi
 insert into transactional table
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Sat, 08 Jul 2017 21:14:10 -0000


     [ https://issues.apache.org/jira/browse/HIVE-16832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koifman updated HIVE-16832:
----------------------------------
    Description: 
{noformat}
 create table AcidTablePart(a int, b int) partitioned by (p string) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');
 create temporary table if not exists data1 (x int);
 insert into data1 values (1);
 from data1
   insert into AcidTablePart partition(p) select 0, 0, 'p' || x
   insert into AcidTablePart partition(p='p1') select 0, 1
{noformat}

Each branch of this multi-insert create a row in partition p1/bucket0 with ROW__ID=(1,0,0).
The same can happen when running SQL Merge (HIVE-10924) statement that has both Insert and Update clauses when target table has _'transactional'='true','transactional_properties'='default'_  (see HIVE-14035).  This is so because Merge is internally run as a multi-insert statement.

The solution relies on statement ID introduced in HIVE-11030.  Each Insert clause of a multi-insert is gets a unique ID.
The ROW__ID.bucketId now becomes a bit packed triplet (format version, bucketId, statementId).
(Since ORC stores field names in the data file we can't rename ROW__ID.bucketId).
This ensures that there are no collisions and retains desired sort properties of ROW__ID.
In particular _SortedDynPartitionOptimizer_ works w/o any changes even in cases where there fewer reducers than buckets.  


  was:
{noformat}
 create table AcidTablePart(a int, b int) partitioned by (p string) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');
 create temporary table if not exists data1 (x int);
 insert into data1 values (1),(2),(1);
 from data1
   insert into AcidTablePart partition(p) select 0, 0, 'p' || x
   insert into AcidTablePart partition(p='p1') select 0, 1
{noformat}

Each branch of this multi-insert create a row in partition p1/bucket0 with ROW__ID=(1,0,0).
The same can happen when running SQL Merge (HIVE-10924) statement that has both Insert and Update clauses when target table has _'transactional'='true','transactional_properties'='default'_  (see HIVE-14035).  This is so because Merge is internally run as a multi-insert statement.

The solution relies on statement ID introduced in HIVE-11030.  Each Insert clause of a multi-insert is gets a unique ID.
The ROW__ID.bucketId now becomes a bit packed triplet (format version, bucketId, statementId).
(Since ORC stores field names in the data file we can't rename ROW__ID.bucketId).
This ensures that there are no collisions and retains desired sort properties of ROW__ID.
In particular _SortedDynPartitionOptimizer_ works w/o any changes even in cases where there fewer reducers than buckets.  


> duplicate ROW__ID possible in multi insert into transactional table
> -------------------------------------------------------------------
>
>                 Key: HIVE-16832
>                 URL: https://issues.apache.org/jira/browse/HIVE-16832
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 2.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-16832.01.patch, HIVE-16832.03.patch, HIVE-16832.04.patch, HIVE-16832.05.patch, HIVE-16832.06.patch, HIVE-16832.08.patch, HIVE-16832.09.patch, HIVE-16832.10.patch, HIVE-16832.11.patch, HIVE-16832.14.patch, HIVE-16832.15.patch, HIVE-16832.16.patch, HIVE-16832.17.patch, HIVE-16832.18.patch, HIVE-16832.19.patch, HIVE-16832.20.patch, HIVE-16832.20.patch
>
>
> {noformat}
>  create table AcidTablePart(a int, b int) partitioned by (p string) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');
>  create temporary table if not exists data1 (x int);
>  insert into data1 values (1);
>  from data1
>    insert into AcidTablePart partition(p) select 0, 0, 'p' || x
>    insert into AcidTablePart partition(p='p1') select 0, 1
> {noformat}
> Each branch of this multi-insert create a row in partition p1/bucket0 with ROW__ID=(1,0,0).
> The same can happen when running SQL Merge (HIVE-10924) statement that has both Insert and Update clauses when target table has _'transactional'='true','transactional_properties'='default'_  (see HIVE-14035).  This is so because Merge is internally run as a multi-insert statement.
> The solution relies on statement ID introduced in HIVE-11030.  Each Insert clause of a multi-insert is gets a unique ID.
> The ROW__ID.bucketId now becomes a bit packed triplet (format version, bucketId, statementId).
> (Since ORC stores field names in the data file we can't rename ROW__ID.bucketId).
> This ensures that there are no collisions and retains desired sort properties of ROW__ID.
> In particular _SortedDynPartitionOptimizer_ works w/o any changes even in cases where there fewer reducers than buckets.  


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)