Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A9B5411884 for ; Tue, 19 Aug 2014 22:15:26 +0000 (UTC) Received: (qmail 97714 invoked by uid 500); 19 Aug 2014 22:15:20 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 97642 invoked by uid 500); 19 Aug 2014 22:15:20 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 97558 invoked by uid 500); 19 Aug 2014 22:15:20 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 97548 invoked by uid 99); 19 Aug 2014 22:15:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Aug 2014 22:15:20 +0000 Date: Tue, 19 Aug 2014 22:15:20 +0000 (UTC) From: "Mithun Radhakrishnan (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-7223) Support generic PartitionSpecs in Metastore partition-functions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-7223?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HIVE-7223: --------------------------------------- Attachment: HIVE-7223.3.patch Still struggling to get this on reviews.apache.org. > Support generic PartitionSpecs in Metastore partition-functions > --------------------------------------------------------------- > > Key: HIVE-7223 > URL: https://issues.apache.org/jira/browse/HIVE-7223 > Project: Hive > Issue Type: Improvement > Components: HCatalog, Metastore > Affects Versions: 0.12.0, 0.13.0 > Reporter: Mithun Radhakrishnan > Assignee: Mithun Radhakrishnan > Attachments: HIVE-7223.1.patch, HIVE-7223.2.patch, HIVE-7223.3.pa= tch > > > Currently, the functions in the HiveMetaStore API that handle multiple pa= rtitions do so using List. E.g.=20 > {code} > public List listPartitions(String db_name, String tbl_name, sh= ort max_parts); > public List listPartitionsByFilter(String db_name, String tbl_= name, String filter, short max_parts); > public int add_partitions(List new_parts); > {code} > Partition objects are fairly heavyweight, since each Partition carries it= s own copy of a StorageDescriptor, partition-values, etc. Tables with tens = of thousands of partitions take so long to have their partitions listed tha= t the client times out with default hive.metastore.client.socket.timeout. T= here is the additional expense of serializing and deserializing metadata fo= r large sets of partitions, w.r.t time and heap-space. Reducing the thrift = traffic should help in this regard. > In a date-partitioned table, all sub-partitions for a particular date are= *likely* (but not expected) to have: > # The same base directory (e.g. {{/feeds/search/20140601/}}) > # Similar directory structure (e.g. {{/feeds/search/20140601/[US,UK,IN]}}= ) > # The same SerDe/StorageHandler/IOFormat classes > # Sorting/Bucketing/SkewInfo settings > In this =E2=80=9Cmost likely=E2=80=9D scenario (henceforth termed =E2=80= =9Cnormal=E2=80=9D), it=E2=80=99s possible to represent the partition-list = (for a date) in a more condensed form: a list of LighterPartition instances= , all sharing a common StorageDescriptor whose location points to the root = directory.=20 > We can go one better for the {{add_partitions()}} case: When adding all p= artitions for a given date, the =E2=80=9Cnormal=E2=80=9D case affords us th= e ability to specify the top-level date-directory, where sub-partitions can= be inferred from the HDFS directory-path. > These extensions are hard to introduce at the metastore-level, since part= ition-functions explicitly specify {{List}} arguments. I wonder = if a {{PartitionSpec}} interface might help: > {code} > public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws = ... ;=20 > public int add_partitions( PartitionSpec new_parts ) throws =E2=80=A6 ; > {code} > where the PartitionSpec looks like: > {code} > public interface PartitionSpec { > public List getPartitions(); > public List getPartNames(); > public Iterator getPartitionIter(); > public Iterator getPartNameIter(); > } > {code} > For addPartitions(), an {{HDFSDirBasedPartitionSpec}} class could impleme= nt {{PartitionSpec}}, store a top-level directory, and return Partition ins= tances from sub-directory names, while storing a single StorageDescriptor f= or all of them. > Similarly, list_partitions() could return a List, where ea= ch PartitionSpec corresponds to a set or partitions that can share a Storag= eDescriptor. > By exposing iterator semantics, neither the client nor the metastore need= instantiate all partitions at once. That should help with memory requireme= nts. > In case no smart grouping is possible, we could just fall back on a {{Def= aultPartitionSpec}} which composes {{List}}, and is no worse tha= n status quo. > PartitionSpec abstracts away how a set of partitions may be represented. = A tighter representation allows us to communicate metadata for a larger num= ber of Partitions, with less Thrift traffic. > Given that Thrift doesn=E2=80=99t support polymorphism, we=E2=80=99d have= to implement the PartitionSpec as a Thrift Union of supported implementati= ons. (We could convert from the Thrift PartitionSpec to the appropriate Jav= a PartitionSpec sub-class.) > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)