spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jkbrad...@apache.org
Subject spark git commit: [SPARK-9062] [ML] Change output type of Tokenizer to Array(String, true)
Date Fri, 17 Jul 2015 20:43:24 GMT
Repository: spark
Updated Branches:
  refs/heads/master f9a82a884 -> 806c579f4


[SPARK-9062] [ML] Change output type of Tokenizer to Array(String, true)

jira: https://issues.apache.org/jira/browse/SPARK-9062

Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec
and Other transformers since their input type is Array(String, true). Seq[String] in udf will
be treated as Array(String, true) by default.

I'm not sure what's the recommended way for Tokenizer to handle the null value in the input.
Any suggestion will be welcome.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7414 from hhbyyh/tokenizer and squashes the following commits:

c01bd7a [Yuhao Yang] change output type of tokenizer


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/806c579f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/806c579f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/806c579f

Branch: refs/heads/master
Commit: 806c579f43ce66ac1398200cbc773fa3b69b5cb6
Parents: f9a82a8
Author: Yuhao Yang <hhbyyh@gmail.com>
Authored: Fri Jul 17 13:43:19 2015 -0700
Committer: Joseph K. Bradley <joseph@databricks.com>
Committed: Fri Jul 17 13:43:19 2015 -0700

----------------------------------------------------------------------
 mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/806c579f/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
index 5f9f57a..0b3af47 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
@@ -42,7 +42,7 @@ class Tokenizer(override val uid: String) extends UnaryTransformer[String,
Seq[S
     require(inputType == StringType, s"Input type must be string type but got $inputType.")
   }
 
-  override protected def outputDataType: DataType = new ArrayType(StringType, false)
+  override protected def outputDataType: DataType = new ArrayType(StringType, true)
 
   override def copy(extra: ParamMap): Tokenizer = defaultCopy(extra)
 }
@@ -113,7 +113,7 @@ class RegexTokenizer(override val uid: String)
     require(inputType == StringType, s"Input type must be string type but got $inputType.")
   }
 
-  override protected def outputDataType: DataType = new ArrayType(StringType, false)
+  override protected def outputDataType: DataType = new ArrayType(StringType, true)
 
   override def copy(extra: ParamMap): RegexTokenizer = defaultCopy(extra)
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message