hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <>
Subject [jira] [Commented] (HIVE-13652) Import table change order of dynamic partitions
Date Mon, 02 May 2016 18:59:12 GMT


Sushanth Sowmyan commented on HIVE-13652:

Adding some general background info for anyone who wishes to work on this:

(Note, this is not necessarily to do about Hive Export/Import, but about hive managed table
partition creation in general, and the problem is that there isn't a "good" solution to this
that won't bug someone the wrong way)

Given that the source can be any arbitrary table, even ones created by a user outside of hive,
deciding what "order" to retain is tricky, or even difficult to know what "order" was used.
This is so, since the source can have partition year=2012, hour=18, and yet have a directory
that looks like any of the following:


Thus, we do not store the correlation between partition key-values in source and destination,
and the only thing we "know" is that a partition with a set of key-value-pairs is associated
with some data that we read. Thus, in the destination, irrespective of what the source said
about the dir name, we ignore it, and recreate a partition based only on key-value pair info,
and let hive default loading mechanism pick the location for us.


The underlying problem here is this : currently, the list of key-values is stored as a HashMap
which is not ordered, and thus, is not guaranteed to be identical across JDKs or OSes. This
doesn't currently affect us, however, since it's only relevant at the time a partition is
created, and as long as the metadata for the data is consistent to point to the correct location,
hive doesn't care.

Since we don't force an order, that order is whatever native sorting order for that HashMap
would be for those values, on that JDK version + OS. This means that as long as you don't
change JDK version + OS + the keyvalues, it is repeatably consistent. Change even one of those,
however, and you could easily wind up with this differing. This can even happen with Hive
wherein we've done "ALTER TABLE ADD PARTITION" for a while on a cluster, upgrade a jdk, and
then we do another "ALTER TABLE ADD PARTITION", and it picks dd/mm instead of mm/dd that it
has been for a while. Or, if one machine was on ubuntu and the other on centos/etc.

Some possible solutions:
a) We can force order of key-values by order of key occurence in the metastore for all "new"
partitions ever created in hive. The problem with this is that it might force additional metastore
calls to determine this order(adding load).
b) We can force alphabetical order of key-values for all "new" partitions ever created in
hive. The problem with this is that we now get into a notion of what is alphabetical order
in what codepage (although that can still deterministic). It's also possible that going alphabetical
will cause a pretty "dumb" ordering, where "dumb" in this case can mean  (i) non-intuitive
: Say day=23/market_id=45/month=4/year=2016 , or (ii) bad in terms of skew, having a higher
frequency partition separation be a parent of a lower freq one, resulting in a much larger
number of dirs created.

Neither of these solve the original issue of export/import, because all we wind up doing here
is forcing order going forward, and not making sure to "retain" whatever existed. Also, if
a JDK/OS combination resulted in a different default for two different users for similar schema,
then by "standardizing" it going forward, we break convention for one of them, either way.

Even in the cases where currently, export/import has been flipping a mm/dd/yyyy into a dd/mm/yyyy,
for eg., if we standardize to fix it to retain original order, we make it weird for a bunch
of users that have had a mm/dd/yyyy in place, and don't care about the order as long as it
is consistent across the table(a goal I'd argue they shouldn't have/care about, but nevertheless
one that might exist)

Other solutions that are possible:

a) Let a table specify that it cares about its default partition-naming-scheme : Similar to
what  hcat.dynamic.partitioning.custom.pattern does for HCat . The problem with this is it
can introduce complexity to a warehouse if people use this feature extensively - i.e. it does
actually nothing for the data and perf in hive - it's simply for usability with external tools,
and we run into a too-many-configs-why-was-this-feature-even-here scenario, but maybe we can
ignore that.

b) Change export/import to honour existing order in the case of managed tables (but ignore
order or customization for external tables, because we truly cannot determine what patterns
might be used for external tables ) - this does not help existing export/import cases, and
can decide on a different norm for a bunch of users, but does help a little going ahead.

Sorry for the longer than intended ramble, but this problem has been known about for a while
and wasn't fixed because of these, and I wanted to provide context.

> Import table change order of dynamic partitions
> -----------------------------------------------
>                 Key: HIVE-13652
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.2.1
>            Reporter: Lukas Waldmann
> Table with multiple dynamic partitions like year,month, day exported using "export table"
command is imported (using "import table") such a way that order of partitions is changed
to day, month, year.
> Export DB:  Hive 0.14
> Import DB:  Hive 1.2.1000.
> Tables created as:
> create table T1
> ( ... ) PARTITIONED BY (period_year string, period_month string, period_day string) STORED
> export command:
> export table t1 to 'path'
> import command:
> import table t1 from 'path'
> HDFS file structure on both original table location and export path keeps the original
partition order ../year/month/day
> HDFS file structure after import is .../day/month/year

This message was sent by Atlassian JIRA

View raw message