crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-485) groupByKey on Spark incorrect if key is Avro record with defined sort order
Date Thu, 08 Jan 2015 18:09:35 GMT


Josh Wills commented on CRUNCH-485:

No, nothing else from you-- I need to write a test for it to verify that it works on the kind
of schemas you described, which should be pretty straightforward. If you don't want to wait
for 0.12, you can cut your own version against 0.12.0-SNAPSHOT after I check it in. Thanks
again Tycho!

> groupByKey on Spark incorrect if key is Avro record with defined sort order
> ---------------------------------------------------------------------------
>                 Key: CRUNCH-485
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Tycho Lamerigts
>            Assignee: Josh Wills
>         Attachments: CRUNCH-485.patch
> GroupByKey on Spark is incorrect if the key type is an Avro record with defined sort
order (
> Instead, it serializes the entire avro record to a binary blob (byte array) and groups
identical blobs. This is wrong. By contrast, groupByKey on MapReduce works as expected, so
it does take Avro's sort order into account.
> The culprit is probably the following code from org.apache.crunch.impl.spark.collect.PGroupedTableImpl#getJavaRDDLikeInternal
> {code}
> groupedRDD = PairMapFunction(ptype.getOutputMapFn(), runtime.getRuntimeContext()))
>           .mapToPair(new MapOutputFunction(keySerde, valueSerde))
>           .groupByKey(numPartitions);
> {code}
> where MapOutputFunction simply converts the entire key object to a binary blob, without
taking sort order into account.

This message was sent by Atlassian JIRA

View raw message