hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <>
Subject [jira] [Commented] (HIVE-8371) HCatStorer should fail by default when publishing to an existing partition
Date Tue, 07 Oct 2014 00:47:34 GMT


Sushanth Sowmyan commented on HIVE-8371:

The main goal of HIVE-6405 was to unify expected behaviour between hive and hcatalog, and
making the default for HCatStorer different from the default for hive defeats that purpose.
To that end, I disagree that it should fail by default unless you are also saying hive should
also fail by default inserting into a partition that already exists.

I fully see the need for data quality issues needing the immutability aspect when jobs are
not written assuming idempotency, and that's why HIVE-6406 added a table-wide property to
do exactly that, and default append behaviour can currently be turned off table-wide by setting
"immutable"="true" as a table property, and my suggestion would be to use that on tables with
jobs that you expect to hit this problem.

If your requirement is to have a job-level property that handles this, then, allowing for
the "keeping in-sync with hive default behaviour" principle then leads me to the following
behaviour :

a) If no special argument is provided, stick to defaults for the table - i.e., hive defaults,
and overridable by the "immutable" property, which also overrides the default behaviour for
b) org.apache.hive.hcatalog.pig.HCatStorer('partspec', '', ' -immutable') => ignore immutable
setting, disallow append.
c) org.apache.hive.hcatalog.pig.HCatStorer('partspec', '', ' -append') => ignore immutable
setting, allow append. 

Now, thinking more about this, if we were to have a job-level override, to be honest, I am
not comfortable with (c), since it's possible for a end user to write a pig script that ignores
the table-level immutability property if set, and have it cause data quality issues later,
even if a user tries to control it for that table using the "immutable" property. Thus, I
think we should not implement (c) in this case. I am okay with implementing (b) if you want
to have a safeguard default.

I would further say, btw, that I would also be okay with making the default value for the
"immutable" table property (i.e. what value it'll have if it isn't set) be made configurable
on a warehouse-wide level from hive-site.xml. That would also solve your problem without needing
you to go set it for each table.

> HCatStorer should fail by default when publishing to an existing partition
> --------------------------------------------------------------------------
>                 Key: HIVE-8371
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1
>            Reporter: Thiruvel Thirumoolan
>            Assignee: Thiruvel Thirumoolan
>              Labels: hcatalog, partition
> In Hive-12 and before (on in previous HCatalog releases) HCatStorer would fail if the
partition already exists (whether before launching the job or during commit depending on the
partitioning). HIVE-6406 changed that behavior and by default does an append. This causes
data quality issues since an rerun (or duplicate run) won't fail (when it used to) and will
just append to the partition.
> A preferable approach would be to leave HCatStorer behavior as is (fail during a duplicate
publish) and support append through an option. Overwrite also can be implemented in a similar
fashion. Eg:
> store A into 'db.table' using org.apache.hive.hcatalog.pig.HCatStorer('partspec', '',
' -append');

This message was sent by Atlassian JIRA

View raw message