crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tycho Lamerigts (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-485) groupByKey on Spark incorrect if key is Avro record with defined sort order
Date Tue, 06 Jan 2015 14:20:34 GMT
Tycho Lamerigts created CRUNCH-485:
--------------------------------------

             Summary: groupByKey on Spark incorrect if key is Avro record with defined sort
order
                 Key: CRUNCH-485
                 URL: https://issues.apache.org/jira/browse/CRUNCH-485
             Project: Crunch
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.11.0
            Reporter: Tycho Lamerigts
            Assignee: Josh Wills


GroupByKey on Spark is incorrect if the key type is an Avro record with defined sort order
(http://avro.apache.org/docs/1.7.7/spec.html#order).

Instead, it serializes the entire avro record to a binary blob (byte array) and groups identical
blobs. This is wrong. By contrast, groupByKey on MapReduce works as expected, so it does take
Avro's sort order into account.

The culprit is probably the following code from org.apache.crunch.impl.spark.collect.PGroupedTableImpl#getJavaRDDLikeInternal

{code}
groupedRDD = parentRDD.map(new PairMapFunction(ptype.getOutputMapFn(), runtime.getRuntimeContext()))
          .mapToPair(new MapOutputFunction(keySerde, valueSerde))
          .groupByKey(numPartitions);
{code}

where MapOutputFunction simply converts the entire key object to a binary blob, without taking
sort order into account.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message