crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xavier (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-642) Enable numReducers option for methods in Distinct
Date Wed, 12 Apr 2017 09:15:41 GMT


Xavier updated CRUNCH-642:
    Attachment: CRUNCH-642-Enable-GroupingOptions-for-Distinct-operations.patch

Hey [~joshwills],

I noticed my change introduces a major bug when running the distinct operation with a non-memory
My apologies for this terrible mistake. In attachment is an additional patch that solves this
by passing along the GroupOptions
object instead of a numReducers integer. This will be more flexible and avoids bugs like this
popping up. I also added
tests (both unit and integration tests) to ensure the fix is now working.

> Enable numReducers option for methods in Distinct
> -------------------------------------------------
>                 Key: CRUNCH-642
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.14.0
>            Reporter: Xavier
>            Assignee: Josh Wills
>            Priority: Trivial
>         Attachments: CRUNCH-642-Enable-GroupingOptions-for-Distinct-operations.patch,
> The {{groupByKey}} invocation in the {{Distinct}} class currently uses the default  (recommended)
number of reducers without providing an option to override this:
> {code}
> public static <S> PCollection<S> distinct(PCollection<S> input, int
flushEvery) {
>   Preconditions.checkArgument(flushEvery > 0);
>   PType<S> pt = input.getPType();
>   PTypeFamily ptf = pt.getFamily();
>   return input
>       .parallelDo("pre-distinct", new PreDistinctFn<S>(flushEvery, pt), ptf.tableOf(pt,
>       .groupByKey()
>       .parallelDo("post-distinct", new PostDistinctFn<S>(), pt);
> }
> {code}
> Would it be possible to enhance this method such that it is possible to customize the
number of reducers? Either explicitly or via a {{GroupingOptions}} object.

This message was sent by Atlassian JIRA

View raw message