Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 23 Sep 2015 22:32:04 +0000 (UTC)
From: "Joseph K. Bradley (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12856979.1439920205000.55562.1443047524460@Atlassian.JIRA>
In-Reply-To: <JIRA.12856979.1439920205000@Atlassian.JIRA>
References: <JIRA.12856979.1439920205000@Atlassian.JIRA>
 <JIRA.12856979.1439920205724@arcas>
Subject: [jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in
 PySpark
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley updated SPARK-10086:
--------------------------------------
    Description: 
Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches.  It fails here: [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch.  Here is code which avoids Streaming altogether to identify what batches were processed.

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
    stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
    return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
####################### EXPECTED

[0, 1, 1]                                                                       
[1, 0, 1]

####################### Skip batch 0

[1, 0, 0]
[0, 1, 0]

####################### Skip batch 1

[0, 1, 1]
[1, 0, 1]

####################### Skip both batches  (This is what we see in the test failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
======================================================================
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn
    self._eventually(condition, catch_assertions=True)
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually
    raise lastValue
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually
    lastValue = condition()
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition
    self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
?                 ^^^^

+ [[0, 1, 1], [1, 0, 1]]
?              +++   ^


----------------------------------------------------------------------
Ran 62 tests in 164.188s
{code}


  was:
Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.)

It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches.  It fails here: [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]

I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch.  Here is a reproduction of the failure which avoids Streaming altogether.

{code}
from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel

batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
batches = [sc.parallelize(batch) for batch in batches]

stkm = StreamingKMeans(decayFactor=0.0, k=2)
stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])

# Train
def update(rdd):
    stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)

# Remove one or both of these lines to test skipping batches.
update(batches[0])
update(batches[1])

# Test
def predict(rdd):
    return stkm._model.predict(rdd)

predict(batches[0]).collect()
predict(batches[1]).collect()
{code}

*Results*:
{code}
####################### EXPECTED

[0, 1, 1]                                                                       
[1, 0, 1]

####################### Skip batch 0

[1, 0, 0]
[0, 1, 0]

####################### Skip batch 1

[0, 1, 1]
[1, 0, 1]

####################### Skip both batches  (This is what we see in the test failures.)

[0, 1, 1]
[0, 0, 0]
{code}

Skipping both batches reproduces the failure.  There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized).

CC: [~tdas] [~freeman-lab] [~mengxr]

Failure message:
{code}
======================================================================
FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
Test that prediction happens on the updated model.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn
    self._eventually(condition, catch_assertions=True)
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually
    raise lastValue
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually
    lastValue = condition()
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition
    self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]

First differing element 1:
[0, 0, 0]
[1, 0, 1]

- [[0, 1, 1], [0, 0, 0]]
?                 ^^^^

+ [[0, 1, 1], [1, 0, 1]]
?              +++   ^


----------------------------------------------------------------------
Ran 62 tests in 164.188s
{code}


> Flaky StreamingKMeans test in PySpark
> -------------------------------------
>
>                 Key: SPARK-10086
>                 URL: https://issues.apache.org/jira/browse/SPARK-10086
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark, Streaming, Tests
>    Affects Versions: 1.5.0
>            Reporter: Joseph K. Bradley
>
> Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.)
> It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches.  It fails here: [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144]
> I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch.  Here is code which avoids Streaming altogether to identify what batches were processed.
> {code}
> from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel
> batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]]
> batches = [sc.parallelize(batch) for batch in batches]
> stkm = StreamingKMeans(decayFactor=0.0, k=2)
> stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0])
> # Train
> def update(rdd):
>     stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit)
> # Remove one or both of these lines to test skipping batches.
> update(batches[0])
> update(batches[1])
> # Test
> def predict(rdd):
>     return stkm._model.predict(rdd)
> predict(batches[0]).collect()
> predict(batches[1]).collect()
> {code}
> *Results*:
> {code}
> ####################### EXPECTED
> [0, 1, 1]                                                                       
> [1, 0, 1]
> ####################### Skip batch 0
> [1, 0, 0]
> [0, 1, 0]
> ####################### Skip batch 1
> [0, 1, 1]
> [1, 0, 1]
> ####################### Skip both batches  (This is what we see in the test failures.)
> [0, 1, 1]
> [0, 0, 0]
> {code}
> Skipping both batches reproduces the failure.  There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized).
> CC: [~tdas] [~freeman-lab] [~mengxr]
> Failure message:
> {code}
> ======================================================================
> FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest)
> Test that prediction happens on the updated model.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn
>     self._eventually(condition, catch_assertions=True)
>   File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually
>     raise lastValue
>   File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually
>     lastValue = condition()
>   File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition
>     self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]])
> AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]]
> First differing element 1:
> [0, 0, 0]
> [1, 0, 1]
> - [[0, 1, 1], [0, 0, 0]]
> ?                 ^^^^
> + [[0, 1, 1], [1, 0, 1]]
> ?              +++   ^
> ----------------------------------------------------------------------
> Ran 62 tests in 164.188s
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org