spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hunter Kelly (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-12026) ChiSqTest gets slower and slower over time when number of features is large
Date Fri, 27 Nov 2015 14:58:11 GMT

    [ https://issues.apache.org/jira/browse/SPARK-12026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029929#comment-15029929
] 

Hunter Kelly edited comment on SPARK-12026 at 11/27/15 2:57 PM:
----------------------------------------------------------------

Here's the bumpf from the details button on the UI:

org.apache.spark.rdd.RDD.countByValue(RDD.scala:1156)
org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:117)
org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:178)
org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:183)
org.apache.spark.mllib.stat.Statistics.chiSqTest(Statistics.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
clojure.lang.Reflector.invokeStaticMethod(Reflector.java:207)
bow.optimized$chi_square_exploration.doInvoke(optimized.clj:99)
clojure.lang.RestFn.applyTo(RestFn.java:137)
clojure.core$apply.invoke(core.clj:630)
util.cli$_main.doInvoke(cli.clj:63)
clojure.lang.RestFn.applyTo(RestFn.java:139)
util.cli.main(Unknown Source)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Basically it seems to run 2 stages per batch (each batch of 1000 is it's own job, I believe).
 The first countByValue is the one that gets slower; a second countByValue on the same line
runs in a few seconds (as opposed to minutes).

I'll attach some screenshots from the SparkUI.


was (Author: retnuh):
Here's the bumpf from the details button on the UI:

org.apache.spark.rdd.RDD.countByValue(RDD.scala:1156)
org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:117)
org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:178)
org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:183)
org.apache.spark.mllib.stat.Statistics.chiSqTest(Statistics.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
clojure.lang.Reflector.invokeStaticMethod(Reflector.java:207)
bow.optimized$chi_square_exploration.doInvoke(optimized.clj:99)
clojure.lang.RestFn.applyTo(RestFn.java:137)
clojure.core$apply.invoke(core.clj:630)
util.cli$_main.doInvoke(cli.clj:63)
clojure.lang.RestFn.applyTo(RestFn.java:139)
util.cli.main(Unknown Source)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Basically it seems to run 2 stages per batch (each batch of 1000 is it's own job, I believe).
 The first countByValue is the one that gets slower; a second countByValue on the same line
runs in a few seconds (as opposed to minutes).

I have saved the HTML from one of the long runs that shows the times of all the stages.  I
need to scrub the URLS before I post it though; gimme a few minutes.

> ChiSqTest gets slower and slower over time when number of features is large
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-12026
>                 URL: https://issues.apache.org/jira/browse/SPARK-12026
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.5.2
>            Reporter: Hunter Kelly
>              Labels: mllib, stats
>         Attachments: First Stages.png, Latest Stages.png
>
>
> I've been running a ChiSqTest to pick features for feature reduction.  My understanding
is that internally it creates jobs to run on batches of 1000 features at a time.
> I was under the impression that the features are treated as independant, but this does
not appear to be the case.  When the number of features is large (160k in my case), each batch
gets slower and slower.  As an example, running on 25 m3.2xlarges on Amazon EMR, it started
at just over 1 minute per batch.  By the end, batches were taking over 30 minutes per batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message