Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9C0B718160 for ; Wed, 23 Sep 2015 22:32:04 +0000 (UTC) Received: (qmail 51594 invoked by uid 500); 23 Sep 2015 22:32:04 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 51565 invoked by uid 500); 23 Sep 2015 22:32:04 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 51555 invoked by uid 99); 23 Sep 2015 22:32:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Sep 2015 22:32:04 +0000 Date: Wed, 23 Sep 2015 22:32:04 +0000 (UTC) From: "Joseph K. Bradley (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in PySpark MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10086: -------------------------------------- Description: Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.) It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches. It fails here: [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144] I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch. Here is code which avoids Streaming altogether to identify what batches were processed. {code} from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]] batches = [sc.parallelize(batch) for batch in batches] stkm = StreamingKMeans(decayFactor=0.0, k=2) stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0]) # Train def update(rdd): stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit) # Remove one or both of these lines to test skipping batches. update(batches[0]) update(batches[1]) # Test def predict(rdd): return stkm._model.predict(rdd) predict(batches[0]).collect() predict(batches[1]).collect() {code} *Results*: {code} ####################### EXPECTED [0, 1, 1] [1, 0, 1] ####################### Skip batch 0 [1, 0, 0] [0, 1, 0] ####################### Skip batch 1 [0, 1, 1] [1, 0, 1] ####################### Skip both batches (This is what we see in the test failures.) [0, 1, 1] [0, 0, 0] {code} Skipping both batches reproduces the failure. There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized). CC: [~tdas] [~freeman-lab] [~mengxr] Failure message: {code} ====================================================================== FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest) Test that prediction happens on the updated model. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn self._eventually(condition, catch_assertions=True) File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually raise lastValue File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually lastValue = condition() File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]]) AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]] First differing element 1: [0, 0, 0] [1, 0, 1] - [[0, 1, 1], [0, 0, 0]] ? ^^^^ + [[0, 1, 1], [1, 0, 1]] ? +++ ^ ---------------------------------------------------------------------- Ran 62 tests in 164.188s {code} was: Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.) It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches. It fails here: [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144] I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch. Here is a reproduction of the failure which avoids Streaming altogether. {code} from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]] batches = [sc.parallelize(batch) for batch in batches] stkm = StreamingKMeans(decayFactor=0.0, k=2) stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0]) # Train def update(rdd): stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit) # Remove one or both of these lines to test skipping batches. update(batches[0]) update(batches[1]) # Test def predict(rdd): return stkm._model.predict(rdd) predict(batches[0]).collect() predict(batches[1]).collect() {code} *Results*: {code} ####################### EXPECTED [0, 1, 1] [1, 0, 1] ####################### Skip batch 0 [1, 0, 0] [0, 1, 0] ####################### Skip batch 1 [0, 1, 1] [1, 0, 1] ####################### Skip both batches (This is what we see in the test failures.) [0, 1, 1] [0, 0, 0] {code} Skipping both batches reproduces the failure. There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized). CC: [~tdas] [~freeman-lab] [~mengxr] Failure message: {code} ====================================================================== FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest) Test that prediction happens on the updated model. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn self._eventually(condition, catch_assertions=True) File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually raise lastValue File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually lastValue = condition() File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]]) AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]] First differing element 1: [0, 0, 0] [1, 0, 1] - [[0, 1, 1], [0, 0, 0]] ? ^^^^ + [[0, 1, 1], [1, 0, 1]] ? +++ ^ ---------------------------------------------------------------------- Ran 62 tests in 164.188s {code} > Flaky StreamingKMeans test in PySpark > ------------------------------------- > > Key: SPARK-10086 > URL: https://issues.apache.org/jira/browse/SPARK-10086 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark, Streaming, Tests > Affects Versions: 1.5.0 > Reporter: Joseph K. Bradley > > Here's a report on investigating test failures in StreamingKMeans in PySpark. (See Jenkins links below.) > It is a StreamingKMeans test which trains on a DStream with 2 batches and then tests on those same 2 batches. It fails here: [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144] > I recreated the same test, with variants training on: (1) the original 2 batches, (2) just the first batch, (3) just the second batch, and (4) neither batch. Here is code which avoids Streaming altogether to identify what batches were processed. > {code} > from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel > batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]] > batches = [sc.parallelize(batch) for batch in batches] > stkm = StreamingKMeans(decayFactor=0.0, k=2) > stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0]) > # Train > def update(rdd): > stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit) > # Remove one or both of these lines to test skipping batches. > update(batches[0]) > update(batches[1]) > # Test > def predict(rdd): > return stkm._model.predict(rdd) > predict(batches[0]).collect() > predict(batches[1]).collect() > {code} > *Results*: > {code} > ####################### EXPECTED > [0, 1, 1] > [1, 0, 1] > ####################### Skip batch 0 > [1, 0, 0] > [0, 1, 0] > ####################### Skip batch 1 > [0, 1, 1] > [1, 0, 1] > ####################### Skip both batches (This is what we see in the test failures.) > [0, 1, 1] > [0, 0, 0] > {code} > Skipping both batches reproduces the failure. There is no randomness in the StreamingKMeans algorithm (since initial centers are fixed, not randomized). > CC: [~tdas] [~freeman-lab] [~mengxr] > Failure message: > {code} > ====================================================================== > FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest) > Test that prediction happens on the updated model. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1147, in test_trainOn_predictOn > self._eventually(condition, catch_assertions=True) > File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 123, in _eventually > raise lastValue > File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 114, in _eventually > lastValue = condition() > File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", line 1144, in condition > self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]]) > AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]] > First differing element 1: > [0, 0, 0] > [1, 0, 1] > - [[0, 1, 1], [0, 0, 0]] > ? ^^^^ > + [[0, 1, 1], [1, 0, 1]] > ? +++ ^ > ---------------------------------------------------------------------- > Ran 62 tests in 164.188s > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org