Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D8764200C70 for ; Thu, 4 May 2017 20:25:27 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D7044160B9B; Thu, 4 May 2017 18:25:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 33BE1160BB0 for ; Thu, 4 May 2017 20:25:27 +0200 (CEST) Received: (qmail 26961 invoked by uid 500); 4 May 2017 18:25:26 -0000 Mailing-List: contact commits-help@predictionio.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@predictionio.incubator.apache.org Delivered-To: mailing list commits@predictionio.incubator.apache.org Received: (qmail 26952 invoked by uid 99); 4 May 2017 18:25:26 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 May 2017 18:25:26 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E16E51A047C for ; Thu, 4 May 2017 18:25:25 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.222 X-Spam-Level: X-Spam-Status: No, score=-4.222 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id xVlbfI8cll6k for ; Thu, 4 May 2017 18:25:25 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 9A3C65FC96 for ; Thu, 4 May 2017 18:25:24 +0000 (UTC) Received: (qmail 26865 invoked by uid 99); 4 May 2017 18:25:24 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 May 2017 18:25:24 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id EEDD7E01EA; Thu, 4 May 2017 18:25:23 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: donald@apache.org To: commits@predictionio.incubator.apache.org Date: Thu, 04 May 2017 18:25:24 -0000 Message-Id: <30c87322d1d84970bfd81528ac2192c3@git.apache.org> In-Reply-To: <1aa41639140d4ad29616a401795c751e@git.apache.org> References: <1aa41639140d4ad29616a401795c751e@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [2/9] incubator-predictionio-template-text-classifier git commit: Filter out stop words from vectorization archived-at: Thu, 04 May 2017 18:25:28 -0000 Filter out stop words from vectorization As per the discussion described in : https://github.com/apache/incubator-predictionio-template-text-classifier/pull/8 . We implement a filter for stop words and they are added to the constructor of TFHasher during vectorization of words. Project: http://git-wip-us.apache.org/repos/asf/incubator-predictionio-template-text-classifier/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-predictionio-template-text-classifier/commit/1a316143 Tree: http://git-wip-us.apache.org/repos/asf/incubator-predictionio-template-text-classifier/tree/1a316143 Diff: http://git-wip-us.apache.org/repos/asf/incubator-predictionio-template-text-classifier/diff/1a316143 Branch: refs/heads/master Commit: 1a316143f169bc7804604d0914b380381dfb9fa1 Parents: 7bff411 Author: Natu Lauchande Authored: Mon Dec 5 17:36:04 2016 +0200 Committer: Natu Lauchande Committed: Tue Dec 6 04:04:47 2016 +0200 ---------------------------------------------------------------------- src/main/scala/Preparator.scala | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-predictionio-template-text-classifier/blob/1a316143/src/main/scala/Preparator.scala ---------------------------------------------------------------------- diff --git a/src/main/scala/Preparator.scala b/src/main/scala/Preparator.scala index c8b35d0..1681acc 100644 --- a/src/main/scala/Preparator.scala +++ b/src/main/scala/Preparator.scala @@ -26,7 +26,7 @@ class Preparator(pp: PreparatorParams) def prepare(sc: SparkContext, td: TrainingData): PreparedData = { - val tfHasher = new TFHasher(pp.numFeatures, pp.nGram) + val tfHasher = new TFHasher(pp.numFeatures, pp.nGram, td.stopWords) // Convert trainingdata's observation text into TF vector // and then fit a IDF model @@ -57,7 +57,8 @@ class Preparator(pp: PreparatorParams) class TFHasher( val numFeatures: Int, - val nGram: Int + val nGram: Int, + val stopWords:Set[String] ) extends Serializable { private val hasher = new HashingTF(numFeatures = numFeatures) @@ -65,6 +66,7 @@ class TFHasher( /** Hashing function: Text -> term frequency vector. */ def hashTF(text: String): Vector = { val newList : Array[String] = text.split(" ") + .filterNot(stopWords.contains(_)) .sliding(nGram) .map(_.mkString) .toArray