Return-Path: X-Original-To: apmail-spark-reviews-archive@minotaur.apache.org Delivered-To: apmail-spark-reviews-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D9B817308 for ; Tue, 24 Feb 2015 07:14:47 +0000 (UTC) Received: (qmail 71446 invoked by uid 500); 24 Feb 2015 07:14:44 -0000 Delivered-To: apmail-spark-reviews-archive@spark.apache.org Received: (qmail 71422 invoked by uid 500); 24 Feb 2015 07:14:44 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 71406 invoked by uid 99); 24 Feb 2015 07:14:44 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Feb 2015 07:14:43 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id DA532E03E8; Tue, 24 Feb 2015 07:14:43 +0000 (UTC) From: mengxr To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer Content-Type: text/plain Message-Id: <20150224071443.DA532E03E8@git1-us-west.apache.org> Date: Tue, 24 Feb 2015 07:14:43 +0000 (UTC) Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232691 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String, Seq[String], Tokenizer] { override protected def outputDataType: DataType = new ArrayType(StringType, false) } + + +/** + * :: AlphaComponent :: + * A regex based tokenizer that extracts tokens using a regex. + * Optional additional parameters include enabling lowercase stabdarization, a minimum character + * size for tokens as well as an array of stop words to remove from the results. + */ +@AlphaComponent +class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer] { + + val lowerCase = new BooleanParam(this, + "lowerCase", + "enable case folding to lower case", + Some(true)) + def setLowercase(value: Boolean) = set(lowerCase, value) + def getLowercase: Boolean = get(lowerCase) + + val minLength = new IntParam(this, + "minLength", + "minimum token length (excluded)", + Some(0)) + def setMinLength(value: Int) = set(minLength, value) + def getMinLength: Int = get(minLength) + + val regEx = new Param(this, + "regEx", + "RegEx used for tokenizing", + Some("\\p{L}+|[^\\p{L}\\s]+".r)) --- End diff -- Using `Regex` as param type is not Java/Python friendly. We should use plain string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org