spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lu Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly
Date Fri, 25 May 2018 22:54:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491359#comment-16491359
] 

Lu Wang commented on SPARK-24300:
---------------------------------

I will fix this issue.

> generateLDAData in ml.cluster.LDASuite didn't set seed correctly
> ----------------------------------------------------------------
>
>                 Key: SPARK-24300
>                 URL: https://issues.apache.org/jira/browse/SPARK-24300
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Xiangrui Meng
>            Assignee: Lu Wang
>            Priority: Minor
>
> [https://github.com/apache/spark/blob/0d63eb8888d17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala]
>  
> generateLDAData uses the same RNG in all partitions to generate random data. This either
causes duplicate rows in cluster mode or indeterministic behavior in local mode:
> {code:java}
> scala> val rng = new java.util.Random(10)
> rng: java.util.Random = java.util.Random@78c5ef58
> scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) }.collect().mkString("\n")
> res12: String =
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code}
> We should create one RNG per partition to make it safe.
>  
> cc: [~lu.DB] [~josephkb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message