Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 59CC42009F2 for ; Thu, 5 May 2016 21:21:27 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 584D4160A04; Thu, 5 May 2016 19:21:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7DC891609F9 for ; Thu, 5 May 2016 21:21:26 +0200 (CEST) Received: (qmail 407 invoked by uid 500); 5 May 2016 19:21:25 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 394 invoked by uid 99); 5 May 2016 19:21:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 May 2016 19:21:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 74AB51804D6 for ; Thu, 5 May 2016 19:21:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.422 X-Spam-Level: X-Spam-Status: No, score=-0.422 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=wolfram.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 7ZjxeJL83c0I for ; Thu, 5 May 2016 19:21:21 +0000 (UTC) Received: from relay-int.wolfram.com (relay.wolfram.com [140.177.205.37]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 013C75F1B3 for ; Thu, 5 May 2016 19:21:19 +0000 (UTC) Received: from wrimail03.wolfram.com (wrimail03.wolfram.com [10.128.1.208]) by relay-int.wolfram.com (Postfix) with ESMTPS id 166ADD1D51 for ; Thu, 5 May 2016 14:21:19 -0500 (CDT) DKIM-Filter: OpenDKIM Filter v2.10.3 relay-int.wolfram.com 166ADD1D51 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=wolfram.com; s=relay; t=1462476079; bh=qswczdzsUAqJBmv8E9Ofx+dR/rsnCF5iabYte6ABY+c=; h=Date:From:To:In-Reply-To:References:Subject:From; b=O9X11lflv1cebLiEk6IVboyWJkFsd2iGG0B6ThNWT8vddJhgvBr48BRYN4NOVaaMY R7xdDRFmNW9Bm0ANayiUY9XhzdbxYvVkfDycxPcc+J4iewbW4rOAikN8k/kND9s6gM tuBkwIgbqeGrvfI1QXx3+Rbdof247xznuSNZVZkM= Received: from wrimail03.wolfram.com (localhost [127.0.0.1]) by wrimail03.wolfram.com (Postfix) with ESMTPS id 07300484A19C for ; Thu, 5 May 2016 14:21:19 -0500 (CDT) Received: from localhost (localhost [127.0.0.1]) by wrimail03.wolfram.com (Postfix) with ESMTP id F0A78484A19A for ; Thu, 5 May 2016 14:21:18 -0500 (CDT) Received: from wrimail03.wolfram.com ([127.0.0.1]) by localhost (wrimail03.wolfram.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ypUGGZ1Gczyz for ; Thu, 5 May 2016 14:21:18 -0500 (CDT) Received: from wrimail03.wolfram.com (wrimail03.wolfram.com [10.128.1.208]) by wrimail03.wolfram.com (Postfix) with ESMTP id D75784800FB4 for ; Thu, 5 May 2016 14:21:18 -0500 (CDT) Date: Thu, 5 May 2016 14:21:18 -0500 (CDT) From: Daniel Bigham To: java-user@lucene.apache.org Message-ID: <1184902543.13314183.1462476078834.JavaMail.zimbra@wolfram.com> In-Reply-To: <572B8A98.3080002@wolfram.com> References: <572B8A98.3080002@wolfram.com> Subject: Re: StopFilterFactory with french_stop.txt MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_13314182_1666386289.1462476078834" X-Originating-IP: [140.177.99.140] X-Mailer: Zimbra 8.6.0_GA_1194 (ZimbraWebClient - GC52 (Win)/8.6.0_GA_1194) Thread-Topic: StopFilterFactory with french_stop.txt Thread-Index: OSraCSqpNQ0nlsChgTxfho2j2jZEhA== archived-at: Thu, 05 May 2016 19:21:27 -0000 ------=_Part_13314182_1666386289.1462476078834 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit For the time being I seem to be able to do this by using a custom TokenFilterFactory class as follows. If there is a better approach, or if this approach seems flawed, let me know. Thanks. package com.wolfram.textsearch; import java.io.IOException; import java.io.Reader; import java.nio.charset.StandardCharsets; import java.util.Map; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.ar.ArabicAnalyzer; import org.apache.lucene.analysis.bg.BulgarianAnalyzer; import org.apache.lucene.analysis.ca.CatalanAnalyzer; import org.apache.lucene.analysis.cjk.CJKAnalyzer; import org.apache.lucene.analysis.ckb.SoraniAnalyzer; import org.apache.lucene.analysis.core.StopFilter; import org.apache.lucene.analysis.cz.CzechAnalyzer; import org.apache.lucene.analysis.el.GreekAnalyzer; import org.apache.lucene.analysis.eu.BasqueAnalyzer; import org.apache.lucene.analysis.fa.PersianAnalyzer; import org.apache.lucene.analysis.ga.IrishAnalyzer; import org.apache.lucene.analysis.gl.GalicianAnalyzer; import org.apache.lucene.analysis.hi.HindiAnalyzer; import org.apache.lucene.analysis.hy.ArmenianAnalyzer; import org.apache.lucene.analysis.id.IndonesianAnalyzer; import org.apache.lucene.analysis.lt.LithuanianAnalyzer; import org.apache.lucene.analysis.lv.LatvianAnalyzer; import org.apache.lucene.analysis.ro.RomanianAnalyzer; import org.apache.lucene.analysis.snowball.SnowballFilter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.th.ThaiAnalyzer; import org.apache.lucene.analysis.tr.TurkishAnalyzer; import org.apache.lucene.analysis.util.CharArraySet; import org.apache.lucene.analysis.util.TokenFilterFactory; import org.apache.lucene.analysis.util.WordlistLoader; import org.apache.lucene.util.IOUtils; public class MultiLanguageStopWordFilterFactory extends TokenFilterFactory { String language = "English"; private CharArraySet stopWords; private final static String DEFAULT_STOPWORD_FILE = "stopwords.txt"; public MultiLanguageStopWordFilterFactory(Map args) throws IOException { super(args); language = get(args, "language"); if (!args.isEmpty()) { throw new IllegalArgumentException("Unknown parameters: " + args); } int stopwordStyle = 0; String commentChar = "#"; Class analyzerClass = null; String stopwordFile = DEFAULT_STOPWORD_FILE; switch(language) { case "Arabic": analyzerClass = ArabicAnalyzer.class; break; case "Bulgarian": analyzerClass = BulgarianAnalyzer.class; break; case "Catalan": analyzerClass = CatalanAnalyzer.class; break; case "Chinese": analyzerClass = CJKAnalyzer.class; break; case "Japanese": analyzerClass = CJKAnalyzer.class; break; case "Korean": analyzerClass = CJKAnalyzer.class; break; case "KurdishCentral": analyzerClass = SoraniAnalyzer.class; break; case "Czech": analyzerClass = CzechAnalyzer.class; break; case "Danish": stopwordStyle = 1; stopwordFile = "danish_stop.txt"; break; case "German": stopwordStyle = 1; stopwordFile = "german_stop.txt"; break; case "Greek": analyzerClass = GreekAnalyzer.class; break; case "English": stopwordStyle = 2; break; case "Spanish": stopwordStyle = 1; stopwordFile = "spanish_stop.txt"; break; case "Basque": analyzerClass = BasqueAnalyzer.class; break; case "Persian": analyzerClass = PersianAnalyzer.class; break; case "Finnish": stopwordStyle = 1; stopwordFile = "finnish_stop.txt"; break; case "French": stopwordStyle = 1; stopwordFile = "french_stop.txt"; break; case "GaelicIrish": analyzerClass = IrishAnalyzer.class; break; case "Galician": analyzerClass = GalicianAnalyzer.class; break; case "Hindi": analyzerClass = HindiAnalyzer.class; break; case "Hungarian": stopwordStyle = 1; stopwordFile = "hungarian_stop.txt"; break; case "Armenian": analyzerClass = ArmenianAnalyzer.class; break; case "Indonesian": analyzerClass = IndonesianAnalyzer.class; break; case "Italian": stopwordStyle = 1; stopwordFile = "italian_stop.txt"; break; case "Lithuanian": analyzerClass = LithuanianAnalyzer.class; break; case "Latvian": analyzerClass = LatvianAnalyzer.class; break; case "Dutch": stopwordStyle = 1; stopwordFile = "dutch_stop.txt"; break; case "Norwegian": stopwordStyle = 1; stopwordFile = "norwegian_stop.txt"; break; case "Portuguese": stopwordStyle = 1; stopwordFile = "portuguese_stop.txt"; break; case "Romanian": analyzerClass = RomanianAnalyzer.class; break; case "Russian": stopwordStyle = 1; stopwordFile = "russian_stop.txt"; break; case "Swedish": stopwordStyle = 1; stopwordFile = "swedish_stop.txt"; break; case "Thai": analyzerClass = ThaiAnalyzer.class; break; case "Turkish": analyzerClass = TurkishAnalyzer.class; break; } if (stopwordStyle == 0) { stopWords = loadStopwordSet(false, analyzerClass, stopwordFile, commentChar); } else if (stopwordStyle == 1) { stopWords = WordlistLoader.getSnowballWordSet(IOUtils.getDecodingReader(SnowballFilter.class, stopwordFile, StandardCharsets.UTF_8)); } else if (stopwordStyle == 2) { stopWords = StandardAnalyzer.STOP_WORDS_SET; } } /** * Load a stop word set. * * @param aClass the associated analyzer. * @param resource the file. * @param comment the character used in the file to indicate a comment. * * @return a set of stopwords. * * @throws IOException */ static CharArraySet loadStopwordSet( boolean ignoreCase, final Class aClass, final String resource, final String comment) throws IOException { Reader reader = null; try { reader = IOUtils.getDecodingReader(aClass.getResourceAsStream(resource), StandardCharsets.UTF_8); return WordlistLoader.getWordSet(reader, comment, new CharArraySet(16, ignoreCase)); } finally { IOUtils.close(reader); } } @Override public TokenStream create(TokenStream input) { StopFilter stopFilter = new StopFilter(input, stopWords); return stopFilter; } } ----- On May 5, 2016, at 2:02 PM, danielb wrote: > I'd like to use CustomAnalyzer to create an analyzer that is much like > the FrenchAnalyzer. > In doing that, I'm using StopFilterFactory. > But I'm unsure how to point it to use "french_stop.txt". ie. What > FrenchAnalyzer is using here: > public final class FrenchAnalyzer extends StopwordAnalyzerBase { > public final static String DEFAULT_STOPWORD_FILE = "french_stop.txt"; > ... > The typical use of StopFilterFactory: > .addTokenFilter(StopFilterFactory.class, "ignoreCase", "false", "words", > "french_stop.txt", "format", "wordset") > But this looks for a file "french_stop.txt" and can't find it. > (presumably it's looking in a completely different location from > FrenchAnalyzer) > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org ------=_Part_13314182_1666386289.1462476078834--