spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Felix Cheung (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads
Date Sat, 05 Nov 2016 04:11:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15638546#comment-15638546
] 

Felix Cheung commented on SPARK-12757:
--------------------------------------

I'm seeing the same with latest master running a pipeline with GBTClassifier:

{code}
WARN Executor: 1 block locks were not released by TID = 7:
[rdd_28_0]
{code}

to repro, take the code sample from the ml programming guide:
{code}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a GBT model.
val gbt = new GBTClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setMaxIter(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and GBT in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
{code}

> Use reference counting to prevent blocks from being evicted during reads
> ------------------------------------------------------------------------
>
>                 Key: SPARK-12757
>                 URL: https://issues.apache.org/jira/browse/SPARK-12757
>             Project: Spark
>          Issue Type: Improvement
>          Components: Block Manager
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>             Fix For: 2.0.0
>
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to prevent pages
/ blocks from being evicted while they are being read. With on-heap objects, evicting a block
while it is being read merely leads to memory-accounting problems (because we assume that
an evicted block is a candidate for garbage-collection, which will not be true during a read),
but with off-heap memory this will lead to either data corruption or segmentation faults.
> To address this, we should add a reference-counting mechanism to track which blocks/pages
are being read in order to prevent them from being evicted prematurely. I propose to do this
in two phases: first, add a safe, conservative approach in which all BlockManager.get*() calls
implicitly increment the reference count of blocks and where tasks' references are automatically
freed upon task completion. This will be correct but may have adverse performance impacts
because it will prevent legitimate block evictions. In phase two, we should incrementally
add release() calls in order to fix the eviction of unreferenced blocks. The latter change
may need to touch many different components, which is why I propose to do it separately in
order to make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message