spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sro...@apache.org
Subject spark git commit: [SPARK-18374][ML] Incorrect words in StopWords/english.txt
Date Tue, 06 Dec 2016 21:12:29 GMT
Repository: spark
Updated Branches:
  refs/heads/master 1ef6b296d -> fac5b75b7


[SPARK-18374][ML] Incorrect words in StopWords/english.txt

## What changes were proposed in this pull request?

Currently English stop words list in MLlib contains only the argumented words after removing
all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer
don't split on apostrophes or quotes.

Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover.
Also remove "won" from list.

see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374

## How was this patch tested?
existing ut

Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #16103 from hhbyyh/addstopwords.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fac5b75b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fac5b75b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fac5b75b

Branch: refs/heads/master
Commit: fac5b75b74b2d76b6314c69be3c769f1f321688c
Parents: 1ef6b29
Author: Yuhao <yuhao.yang@intel.com>
Authored: Wed Dec 7 05:12:24 2016 +0800
Committer: Sean Owen <sowen@cloudera.com>
Committed: Wed Dec 7 05:12:24 2016 +0800

----------------------------------------------------------------------
 .../spark/ml/feature/stopwords/english.txt      | 80 +++++++++++++-------
 .../ml/feature/StopWordsRemoverSuite.scala      |  2 +-
 2 files changed, 55 insertions(+), 27 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/fac5b75b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
----------------------------------------------------------------------
diff --git a/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
index d075cc0..d6094d7 100644
--- a/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
+++ b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
@@ -125,29 +125,57 @@ just
 don
 should
 now
-d
-ll
-m
-o
-re
-ve
-y
-ain
-aren
-couldn
-didn
-doesn
-hadn
-hasn
-haven
-isn
-ma
-mightn
-mustn
-needn
-shan
-shouldn
-wasn
-weren
-won
-wouldn
+i'll
+you'll
+he'll
+she'll
+we'll
+they'll
+i'd
+you'd
+he'd
+she'd
+we'd
+they'd
+i'm
+you're
+he's
+she's
+it's
+we're
+they're
+i've
+we've
+you've
+they've
+isn't
+aren't
+wasn't
+weren't
+haven't
+hasn't
+hadn't
+don't
+doesn't
+didn't
+won't
+wouldn't
+shan't
+shouldn't
+mustn't
+can't
+couldn't
+cannot
+could
+here's
+how's
+let's
+ought
+that's
+there's
+what's
+when's
+where's
+who's
+why's
+would
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/fac5b75b/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
----------------------------------------------------------------------
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
b/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
index 957cf58..5262b14 100755
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
@@ -45,7 +45,7 @@ class StopWordsRemoverSuite
       .setOutputCol("filtered")
     val dataSet = Seq(
       (Seq("test", "test"), Seq("test", "test")),
-      (Seq("a", "b", "c", "d"), Seq("b", "c")),
+      (Seq("a", "b", "c", "d"), Seq("b", "c", "d")),
       (Seq("a", "the", "an"), Seq()),
       (Seq("A", "The", "AN"), Seq()),
       (Seq(null), Seq(null)),


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message