spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Briggs <>
Subject Re: Spark distinct() returns incorrect results for some types?
Date Thu, 11 Jun 2015 19:21:07 GMT
To be fair, this is a long-standing issue due to optimizations for object reuse in the Hadoop
API, and isn't necessarily a failing in Spark - see this blog post (
from 2011 documenting a similar issue.

On June 11, 2015, at 3:17 PM, Sean Owen <> wrote:

Yep you need to use a transformation of the raw value; use toString for example. 

On Thu, Jun 11, 2015, 8:54 PM Crystal Xing <> wrote:

That is a little scary. 
 So you mean in general, we shouldn't use hadoop's writable as Key in RDD? 

Zheng zheng

On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen <> wrote:

Guess: it has something to do with the Text object being reused by Hadoop? You can't in general
keep around refs to them since they change. So you may have a bunch of copies of one object
at the end that become just one in each partition. 

On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <> wrote:

I load a   list of ids from a text file as NLineInputFormat, and when I do distinct(), it
returns incorrect number.

 JavaRDD<Text> idListData = jvc
                .hadoopFile(idList, NLineInputFormat.class,
                        LongWritable.class, Text.class).values().distinct()

I should have 7000K distinct value, how every it only returns 7000 values, which is the same
as number of tasks.  The type I am using is 

However,  if I switch to use String instead of Text, it works correcly. 

I think the Text class should have correct implementation of equals() and hashCode() functions
since it is the hadoop class. 

Does anyone have clue what is going on? 

I am using spark 1.2. 

Zheng zheng

View raw message