spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
Date Mon, 13 Mar 2017 09:02:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907038#comment-15907038
] 

Sean Owen commented on SPARK-18608:
-----------------------------------

Is the point here that the .ml implementation can check the storage level of its input, and
pass that information in some way, internally, to the .mllib implementation that it delegates
to? Yes I think that's the most feasible answer to this particular problem, as long as it
doesn't change a public API. The .mllib implementations would have to have some internal mechanism
for getting this information about parents' storage level, if applicable.

It does raise the more general question of whether these implementation can meaningfully decide
about caching anyway, and whether they should try, rather than just warn. More generally it's
hard for any library function to reason about whether to persist its input or not, or even,
to reason about when a data structure can be unpersisted. Those are much bigger and separate
questions, but it's why this type of question keeps popping up and is hard to solve.

> Spark ML algorithms that check RDD cache level for internal caching double-cache data
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-18608
>                 URL: https://issues.apache.org/jira/browse/SPARK-18608
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, {{LinearRegression}}, and I
believe now {{KMeans}}) handle persistence internally. They check whether the input dataset
is cached, and if not they cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. This will actually
always be true, since even if the dataset itself is cached, the RDD returned by {{dataset.rdd}}
will not be cached.
> Hence if the input dataset is cached, the data will end up being cached twice, which
is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input {{DataSet}},
but now we can, so the checks should be migrated to use {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message