hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mithun Radhakrishnan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances
Date Wed, 13 Aug 2014 22:20:12 GMT

     [ https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mithun Radhakrishnan updated HIVE-7341:
---------------------------------------

    Status: Patch Available  (was: Open)

> Support for Table replication across HCatalog instances
> -------------------------------------------------------
>
>                 Key: HIVE-7341
>                 URL: https://issues.apache.org/jira/browse/HIVE-7341
>             Project: Hive
>          Issue Type: New Feature
>          Components: HCatalog
>    Affects Versions: 0.13.1
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>             Fix For: 0.14.0
>
>         Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch, HIVE-7341.4.patch,
HIVE-7341.5.patch
>
>
> The HCatClient currently doesn't provide very much support for replicating HCatTable
definitions between 2 HCatalog Server (i.e. Hive metastore) instances. 
> Systems similar to Apache Falcon might find the need to replicate partition data between
2 clusters, and keep the HCatalog metadata in sync between the two. This poses a couple of
problems:
> # The definition of the source table might change (in column schema, I/O formats, record-formats,
serde-parameters, etc.) The system will need a way to diff 2 tables and update the target-metastore
with the changes. E.g. 
> {code}
> targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
> hcatClient.updateTableSchema(dbName, tableName, targetTable);
> {code}
> # The current {{HCatClient.addPartitions()}} API requires that the partition's schema
be derived from the table's schema, thereby requiring that the table-schema be resolved *before*
partitions with the new schema are added to the table. This is problematic, because it introduces
race conditions when 2 partitions with differing column-schemas (e.g. right after a schema
change) are copied in parallel. This can be avoided if each HCatAddPartitionDesc kept track
of the partition's schema, in flight.
> # The source and target metastores might be running different/incompatible versions of
Hive. 
> The impending patch attempts to address these concerns (with some caveats).
> # {{HCatTable}} now has 
> ## a {{diff()}} method, to compare against another HCatTable instance
> ## a {{resolve(diff)}} method to copy over specified table-attributes from another HCatTable
> ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and {{HCatClient.deserializeTable()}}),
so that HCatTable instances constructed in other class-loaders may be used for comparison
> # {{HCatPartition}} now provides finer-grained control over a Partition's column-schema,
StorageDescriptor settings, etc. This allows partitions to be copied completely from source,
with the ability to override specific properties if required (e.g. location).
> # {{HCatClient.updateTableSchema()}} can now update the entire table-definition, not
just the column schema.
> # I've cleaned up and removed most of the redundancy between the HCatTable, HCatCreateTableDesc
and HCatCreateTableDesc.Builder. The prior API failed to separate the table-attributes from
the add-table-operation's attributes. By providing fluent-interfaces in HCatTable, and composing
an HCatTable instance in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters
are deprecated, in favour of those in HCatTable. Likewise, HCatPartition and HCatAddPartitionDesc.
> I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message