hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mithun Radhakrishnan (JIRA)" <>
Subject [jira] [Commented] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions
Date Tue, 19 Aug 2014 03:43:19 GMT


Mithun Radhakrishnan commented on HIVE-7223:

Hello, Alan. Thanks so much for reviewing. I'm creating the Review Board request right now.
(It looks like a 2MB diff isn't helping.) I'll update this JIRA with the review-board request
as soon as it completes.

On the concerns you've raised:

hive_metastore.thrift - do we need get_partitions_pspec_with_auth?
I didn't know if we needed this right off the bat. I figured we could add this later, if it
was missed.

PartValEqWrapperLite.equals, is values and location all you need to check equality? Are containing
db and table not important?
add_partitions_pspec_core() ensures that all Partitions derived from the PSpec belong to the
same DB/Table. (This keeps it consistent with add_partitions_core().) So a further check on
the DB/Table in PartValEqWrapperLite was moot.

PartValEqWrapperLite.add_partitions_pspec_core, I'm wondering if you should give the caller
an option to have it throw if the partitions already exists... why do you throw on duplicates
in the list but not already exists?
{{ifNotExists}} kinda fills that gap: an exception is thrown if the partition being added
already exists (and {{ifNotExists == false}}). Dupes within the list are a sign of a user/programming
error, which is why I check this condition explicitly. That also aligns with {{add_partitions_core}}'s

> Support generic PartitionSpecs in Metastore partition-functions
> ---------------------------------------------------------------
>                 Key: HIVE-7223
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog, Metastore
>    Affects Versions: 0.12.0, 0.13.0
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>         Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch
> Currently, the functions in the HiveMetaStore API that handle multiple partitions do
so using List<Partition>. E.g. 
> {code}
> public List<Partition> listPartitions(String db_name, String tbl_name, short max_parts);
> public List<Partition> listPartitionsByFilter(String db_name, String tbl_name,
String filter, short max_parts);
> public int add_partitions(List<Partition> new_parts);
> {code}
> Partition objects are fairly heavyweight, since each Partition carries its own copy of
a StorageDescriptor, partition-values, etc. Tables with tens of thousands of partitions take
so long to have their partitions listed that the client times out with default hive.metastore.client.socket.timeout.
There is the additional expense of serializing and deserializing metadata for large sets of
partitions, w.r.t time and heap-space. Reducing the thrift traffic should help in this regard.
> In a date-partitioned table, all sub-partitions for a particular date are *likely* (but
not expected) to have:
> # The same base directory (e.g. {{/feeds/search/20140601/}})
> # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}})
> # The same SerDe/StorageHandler/IOFormat classes
> # Sorting/Bucketing/SkewInfo settings
> In this “most likely” scenario (henceforth termed “normal”), it’s possible
to represent the partition-list (for a date) in a more condensed form: a list of LighterPartition
instances, all sharing a common StorageDescriptor whose location points to the root directory.

> We can go one better for the {{add_partitions()}} case: When adding all partitions for
a given date, the “normal” case affords us the ability to specify the top-level date-directory,
where sub-partitions can be inferred from the HDFS directory-path.
> These extensions are hard to introduce at the metastore-level, since partition-functions
explicitly specify {{List<Partition>}} arguments. I wonder if a {{PartitionSpec}} interface
might help:
> {code}
> public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... ; 
> public int add_partitions( PartitionSpec new_parts ) throws … ;
> {code}
> where the PartitionSpec looks like:
> {code}
> public interface PartitionSpec {
>         public List<Partition> getPartitions();
>         public List<String> getPartNames();
>         public Iterator<Partition> getPartitionIter();
>         public Iterator<String> getPartNameIter();
> }
> {code}
> For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could implement {{PartitionSpec}},
store a top-level directory, and return Partition instances from sub-directory names, while
storing a single StorageDescriptor for all of them.
> Similarly, list_partitions() could return a List<PartitionSpec>, where each PartitionSpec
corresponds to a set or partitions that can share a StorageDescriptor.
> By exposing iterator semantics, neither the client nor the metastore need instantiate
all partitions at once. That should help with memory requirements.
> In case no smart grouping is possible, we could just fall back on a {{DefaultPartitionSpec}}
which composes {{List<Partition>}}, and is no worse than status quo.
> PartitionSpec abstracts away how a set of partitions may be represented. A tighter representation
allows us to communicate metadata for a larger number of Partitions, with less Thrift traffic.
> Given that Thrift doesn’t support polymorphism, we’d have to implement the PartitionSpec
as a Thrift Union of supported implementations. (We could convert from the Thrift PartitionSpec
to the appropriate Java PartitionSpec sub-class.)
> Thoughts?

This message was sent by Atlassian JIRA

View raw message