hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thiruvel Thirumoolan (JIRA)" <>
Subject [jira] [Updated] (HIVE-7604) Add Metastore API to fetch one or more partition names
Date Wed, 13 Aug 2014 19:40:12 GMT


Thiruvel Thirumoolan updated HIVE-7604:

    Attachment: Design_HIVE_7604.txt

Attaching file that describes the API and rationality behind them.

I have an alpha implementation which obtains distinct values of partition keys. To start with,
this is only ORM and its approach is very similar to (using substring
and indexOf string functions). Tested this with a table containing about a million partitions,
partitioned by 6 keys and using Oracle as backend. It takes 2-4 seconds to obtain unique values
of a partition. Hope this provides a rough idea of latency for large tables.

> Add Metastore API to fetch one or more partition names
> ------------------------------------------------------
>                 Key: HIVE-7604
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: Thiruvel Thirumoolan
>            Assignee: Thiruvel Thirumoolan
>             Fix For: 0.14.0
>         Attachments: Design_HIVE_7604.txt
> We need a new API in Metastore to address the following use cases. Both use cases arise
from having tables with hundreds of thousands or in some cases millions of partitions.
> 1. It should be quick and easy to obtain distinct values of a partition. Eg: Obtain all
dates for which partitions are available. This can be used by tools/frameworks programmatically
to understand gaps in partitions before reprocessing them. Currently one has to run Hive queries
(JDBC or CLI) to obtain this information which is unfriendly and heavy weight. And for tables
which have large number of partitions, it takes a long time to run the queries and it also
requires large heap space.
> 2. Typically users would like to know the list of partitions available and would run
queries that would only involve partition keys (select distinct partkey1 from table) Or to
obtain the latest date partition from a dimension table to join against another fact table
(select * from fact_table join select max(dt) from dimension_table). Those queries (metadata
only queries) can be pushed to metastore and need not be run even locally in Hive. If the
queries can be converted into database based queries, the clients can be light weight and
need not fetch all partition names. The results can be obtained much faster with less resources.

This message was sent by Atlassian JIRA

View raw message