Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D63FD9766 for ; Sun, 23 Dec 2012 16:57:23 +0000 (UTC) Received: (qmail 87297 invoked by uid 500); 23 Dec 2012 16:57:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 87223 invoked by uid 500); 23 Dec 2012 16:57:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 87212 invoked by uid 99); 23 Dec 2012 16:57:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Dec 2012 16:57:21 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jeremy.long@gmail.com designates 209.85.210.178 as permitted sender) Received: from [209.85.210.178] (HELO mail-ia0-f178.google.com) (209.85.210.178) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Dec 2012 16:57:13 +0000 Received: by mail-ia0-f178.google.com with SMTP id k25so5279557iah.23 for ; Sun, 23 Dec 2012 08:56:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=RoKTqpKWGVW8j5A6tyi9/UEgDiJLaD+oQXV10dKGBg8=; b=o6oIInh1sjuvK+kg9HWUHlOrj74iOXjtSdUdTDfObptQsQ+2Qx4fkKkvEWVMOFJFHl scdXBC564dkw6PdUasxKy03eb+SW1SE5G9PWBVuHu9jTNbjfyMSiOdGwytW8bCWv177n 1OnfqZtIdRvQ4uXeIPRgEEgk1IqA8lwilmskD3bugk2wSm3xccEFQa6NOp9GvgcBmqWR ur61dZ+9BWll//g+8JJJSsgaR40GDO33CI8Ys76C0KBHQ+XLX3qOYSxY+5vGmbEhGb72 g+IqbYFpL/ID710FQNXAAsVDS8gGx2gz4T6oTMB52iCBwzl0Op0Egd6KjhoXoQc9o9Gj Vitw== MIME-Version: 1.0 Received: by 10.50.7.204 with SMTP id l12mr1566241iga.103.1356281812836; Sun, 23 Dec 2012 08:56:52 -0800 (PST) Received: by 10.231.0.234 with HTTP; Sun, 23 Dec 2012 08:56:52 -0800 (PST) Date: Sun, 23 Dec 2012 11:56:52 -0500 Message-ID: Subject: WordDelimiterFilter Question (lucene 4.0) From: Jeremy Long To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=f46d04446b0bab631304d187f71d X-Virus-Checked: Checked by ClamAV on apache.org --f46d04446b0bab631304d187f71d Content-Type: text/plain; charset=ISO-8859-1 Hello, I'm having an issue creating a custom analyzer utilizing the WordDelimiterFilter. I'm attempting to create an index of information gleaned from JAR manifest files. So if I have "spring-framework" I need the following tokens indexed: "spring" "springframework" "framework" "spring-framework". My understanding is that the WordDelimiterFilter is perfect for this. However, when I introduce the filter to the analyzer I don't seem to get any documents indexed correctly. Here is the analyzer: import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.core.WhitespaceTokenizer; import org.apache.lucene.analysis.core.LowerCaseFilter; import org.apache.lucene.analysis.core.StopAnalyzer; import org.apache.lucene.analysis.core.StopFilter; import org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter; import org.apache.lucene.util.Version; public class FieldAnalyzer extends Analyzer { private Version version = null; public FieldAnalyzer(Version version) { this.version = version; } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = new WhitespaceTokenizer(version, reader); TokenStream stream = source; stream = new WordDelimiterFilter(stream, WordDelimiterFilter.CATENATE_WORDS & WordDelimiterFilter.GENERATE_WORD_PARTS & WordDelimiterFilter.PRESERVE_ORIGINAL & WordDelimiterFilter.SPLIT_ON_CASE_CHANGE & WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE, null); stream = new LowerCaseFilter(version, stream); stream = new StopFilter(version, stream, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(source, stream); } } //------------------------------------------------- Performing a very simple test results in zero document found: Analyzer analyzer = new FieldAnalyzer(Version.LUCENE_40); Directory index = new RAMDirectory(); String text = "spring-framework"; String field = "field"; IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter w = new IndexWriter(index, config); Document doc = new Document(); doc.add(new TextField(field, text, Field.Store.YES)); w.addDocument(doc); w.close(); String querystr = "spring-framework"; Query q = new AnalyzingQueryParser(Version.LUCENE_40, field, analyzer).parse(querystr); int hitsPerPage = 10; IndexReader reader = DirectoryReader.open(index); IndexSearcher searcher = new IndexSearcher(reader); TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; System.out.println("Found " + hits.length + " hits."); for (int i = 0; i < hits.length; ++i) { int docId = hits[i].doc; Document d = searcher.doc(docId); System.out.println((i + 1) + ". " + d.get(field)); } Any idea what I've done wrong? If I comment out the addition of WordDelimiterFilter - the search works. Thanks in advance, Jeremy --f46d04446b0bab631304d187f71d--