spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From WeichenXu123 <...@git.apache.org>
Subject [GitHub] spark pull request #20313: [SPARK-22974][ML] Attach attributes to output col...
Date Mon, 02 Apr 2018 09:48:42 GMT
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20313#discussion_r178517391
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -264,7 +265,9 @@ class CountVectorizerModel(
     
           Vectors.sparse(dictBr.value.size, effectiveCounts)
         }
    -    dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
    +    val attrs = vocabulary.map(_ => new NumericAttribute).asInstanceOf[Array[Attribute]]
    --- End diff --
    
    The attributes append no useful statistics but only allocate a large array. I think it
should be generated lazily, e.g., when it needed in following transformer then we generate
it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message