hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <>
Subject [jira] [Commented] (HIVE-7341) Support for Table replication across HCatalog instances
Date Wed, 13 Aug 2014 20:40:13 GMT


Sushanth Sowmyan commented on HIVE-7341:

Mithun, this is probably the biggest patch I'm going to +1 with near no comments to further
refine! :D

That said, I do have one minor comment :

1766 +    LOG.warn("HiveStorageHandlers can't be instantiated on the client-side. " +
1767 +        "Replication of StorageHandler-based tables is not supported at this time. "
1768 +        "Attempting to derive Input/OutputFormat settings from StorageHandler, on best
effort: ");

That log line referencing replication can be confusing to users simply using HCatTable and
specifying a storage handler, I'd suggest leaving that middle line out. Also, I'd suggest
using the word "reliably" as in "HiveStorageHandlers can't be reliably instantiated on the
client-side", since I do expect that basic functionality used by the create table, etc, can
actually be used just fine for existing storage handlers. It's only if we start attempting
to do things like partition pushdown/etc or actually read data that we'd get into issues.

> Support for Table replication across HCatalog instances
> -------------------------------------------------------
>                 Key: HIVE-7341
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: HCatalog
>    Affects Versions: 0.13.1
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>             Fix For: 0.14.0
>         Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch, HIVE-7341.4.patch
> The HCatClient currently doesn't provide very much support for replicating HCatTable
definitions between 2 HCatalog Server (i.e. Hive metastore) instances. 
> Systems similar to Apache Falcon might find the need to replicate partition data between
2 clusters, and keep the HCatalog metadata in sync between the two. This poses a couple of
> # The definition of the source table might change (in column schema, I/O formats, record-formats,
serde-parameters, etc.) The system will need a way to diff 2 tables and update the target-metastore
with the changes. E.g. 
> {code}
> targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
> hcatClient.updateTableSchema(dbName, tableName, targetTable);
> {code}
> # The current {{HCatClient.addPartitions()}} API requires that the partition's schema
be derived from the table's schema, thereby requiring that the table-schema be resolved *before*
partitions with the new schema are added to the table. This is problematic, because it introduces
race conditions when 2 partitions with differing column-schemas (e.g. right after a schema
change) are copied in parallel. This can be avoided if each HCatAddPartitionDesc kept track
of the partition's schema, in flight.
> # The source and target metastores might be running different/incompatible versions of
> The impending patch attempts to address these concerns (with some caveats).
> # {{HCatTable}} now has 
> ## a {{diff()}} method, to compare against another HCatTable instance
> ## a {{resolve(diff)}} method to copy over specified table-attributes from another HCatTable
> ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and {{HCatClient.deserializeTable()}}),
so that HCatTable instances constructed in other class-loaders may be used for comparison
> # {{HCatPartition}} now provides finer-grained control over a Partition's column-schema,
StorageDescriptor settings, etc. This allows partitions to be copied completely from source,
with the ability to override specific properties if required (e.g. location).
> # {{HCatClient.updateTableSchema()}} can now update the entire table-definition, not
just the column schema.
> # I've cleaned up and removed most of the redundancy between the HCatTable, HCatCreateTableDesc
and HCatCreateTableDesc.Builder. The prior API failed to separate the table-attributes from
the add-table-operation's attributes. By providing fluent-interfaces in HCatTable, and composing
an HCatTable instance in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters
are deprecated, in favour of those in HCatTable. Likewise, HCatPartition and HCatAddPartitionDesc.
> I'll post a patch for trunk shortly.

This message was sent by Atlassian JIRA

View raw message