hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao (JIRA)" <>
Subject [jira] [Resolved] (HIVE-7493) Enhance HiveReduceFunction's row clustering
Date Tue, 12 Aug 2014 19:27:13 GMT


Chao resolved HIVE-7493.

    Resolution: Fixed

> Enhance HiveReduceFunction's row clustering
> -------------------------------------------
>                 Key: HIVE-7493
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Chao
> HiveReduceFunction is backed by Hive's ExecReducer, whose reduce function takes an input
in the form of <key, value list>. However, HiveReduceFunction's input is an iterator
over a set of <key, value> pairs. To reuse Hive's ExecReducer, we need to "stage and
cluster" the input rows by key, and then feed the <key, value list> to ExecMapper's
reduce method. There are several problems with the current approach:
> 1. unbounded memory usage.
> 2. memory inefficient: input has be cached until all input is consumed.
> 3. this functionality seems generic enough to have it in Spark itself.
> Thus, we'd like to check:
> 1. Whether Spark can provide a different version of PairFlatMapFunction, where the input
to the call method is an iterator over tuples of <key, iterator<value>>. Something
like this:
> {code}
>   public Iterable<Tuple2<BytesWritable, BytesWritable>> call(Iterator<Tuple2<BytesWritable,
Iterator<BytesWritable>>> it);
> {code}
> 2. If above effort fails, we need to enhance our row clustering mechanism so that it
has bounded memory usage and is able to spill if needed.

This message was sent by Atlassian JIRA

View raw message