spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 尹绪森 <yinxu...@gmail.com>
Subject Re: Non-deterministic behavior in spark
Date Fri, 24 Jan 2014 12:40:15 GMT
 Does there are some non-deterministic codes in filter ? Such as
Random.nextInt(). If so, the program lost the idempotent feature. You
should specify a seed to it.


2014/1/24 Ognen Duzlevski <ognen@nengoiksvelzud.com>

> Hello,
>
> (Sorry for the sensationalist title) :)
>
> If I run Spark on files from S3 and do basic transformation like:
>
> textfile()
> filter
> groupByKey
> count
>
> I get one number (e.g. 40,000).
>
> If I do the same on the same files from HDFS, the number spat out is
> completely different (VERY different - something like 13,000).
>
> What would one do in a situation like this? How do I even go about
> figuring out what the problem is? This is run on a cluster of 15 instances
> on Amazon.
>
> Thanks,
> Ognen
>



-- 
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Beijing Key Laboratory of Intelligent Telecommunications Software and
Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China
Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*

Mime
View raw message