hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <tea...@gmail.com>
Subject Re: Iterating over partitions using the metastore API
Date Thu, 04 Aug 2016 15:16:32 GMT
Thanks for your reply. I hadn't considered driving it from a list of
partition names.

To avoid the N+1 reads I am considering reading in batches like so:

   - Sorting the names
   - Taking every nth name (where n is the batch size) to use a a batch
   boundary.
   - Building a filter derived from boundary_name[n-1] and boundary_name[n].
   - Then selecting the batch using the filter and
   IMSC.listPartitionsWithFilter(...)

A drawback to this approach is that filters only support string key types
IIRC.

Thanks,

Elliot.

On 4 August 2016 at 13:15, Furcy Pin <furcy.pin@flaminem.com> wrote:

> Hi Elliot,
>
> I guess you can use IMetaStoreClient.listPartitionsNames instead, and
> then use IMetaStoreClient.getPartition for each partition.
> This might be slow though, as you will have to make 10 000 calls to get
> them.
>
> Another option I'd consider is connecting directly to the Hive metastore.
> This require a little more configuration (grant read-only access to your
> process to the metastore), and might make your implementation dependent
> on the metastore underlying implementation (mysql, postgres, derby),
> unless you use a ORM to query it.
> Anyway, you could ask the metastore directly via JDBC for all the
> partitions, and get java.sql.ResultSet that can be iterated over.
>
> Regards,
>
> Furcy
>
>
> On Thu, Aug 4, 2016 at 1:29 PM, Elliot West <teabot@gmail.com> wrote:
>
>> Hello,
>>
>> I have a process that needs to iterate over all of the partitions in a
>> table using the metastore API.The process should not need to know about the
>> structure or meaning of the partition key values (i.e. whether they are
>> dates, numbers, country names etc), or be required to know the existing
>> range of partition values. Note that the process only needs to know about
>> one partition at any given time.
>>
>> Currently I am naively using the IMetaStoreClient.listPartitions(String,
>> String, short) method to retrieve all partitions but clearly this is not
>> scalable for tables with many 10,000s of partitions. I'm finding that even
>> with relatively large heaps I'm running into OOM exceptions when the
>> metastore API is building the List<Partition> return value. I've
>> experimented with using IMetaStoreClient.listPartitionSpecs(String,
>> String, int) but this too seems to have high memory requirements.
>>
>> Can anyone suggest how I can better iterate over partitions in a manner
>> that is more considerate of memory usage?
>>
>> Thanks,
>>
>> Elliot.
>>
>>
>

Mime
View raw message