Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9A8CAEFCB for ; Wed, 26 Dec 2012 14:09:23 +0000 (UTC) Received: (qmail 20769 invoked by uid 500); 26 Dec 2012 14:09:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 20686 invoked by uid 500); 26 Dec 2012 14:09:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 20670 invoked by uid 99); 26 Dec 2012 14:09:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Dec 2012 14:09:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jeremy.long@gmail.com designates 209.85.223.177 as permitted sender) Received: from [209.85.223.177] (HELO mail-ie0-f177.google.com) (209.85.223.177) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Dec 2012 14:09:14 +0000 Received: by mail-ie0-f177.google.com with SMTP id k13so10523218iea.22 for ; Wed, 26 Dec 2012 06:08:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=NIehkRm8SPHwxacNSyK/W7SnaiFPTXImtnJFc9kTLNY=; b=oByEpYSg75Gxhu4ea3edVXAsKG6lMO0YLEce1mElviCBW55DYUDvKnhicEORpfNyM8 QhASTuaa2U5h/oNt+9Ipm7YLKdrn9Q2sV5l7x0hU0mz+cGzc5F7D0iB5TMttmaS7PyR1 eGa2djiwWmjRgjW4gbUMldmS0fjVtkEboa7512UXXRnBTdZ4aL+hhIJNkGVVs8pp2JQw dLTf94ZxLKEY7qOpa54DQVrdv/ToDeK4mn/XcEXefHjWWe7EYsiPkYL23aXkqvW6l5LU m3YyQY1yzWFqSMO2GZFD2U5oqSRELKgt/n3rg9l1ZYBXkd1kxLkRHPdGnWmoxcc5I9JS v+ww== MIME-Version: 1.0 Received: by 10.50.212.3 with SMTP id ng3mr19423349igc.104.1356530933441; Wed, 26 Dec 2012 06:08:53 -0800 (PST) Received: by 10.231.0.234 with HTTP; Wed, 26 Dec 2012 06:08:53 -0800 (PST) Date: Wed, 26 Dec 2012 09:08:53 -0500 Message-ID: Subject: TokenFilter state question From: Jeremy Long To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=14dae93409f76a22ba04d1c1f8e0 X-Virus-Checked: Checked by ClamAV on apache.org --14dae93409f76a22ba04d1c1f8e0 Content-Type: text/plain; charset=ISO-8859-1 Hello, I'm still trying to figure out some of the nuances of Lucene and I have run into a small issue. I have created my own custom analyzer which uses the WhitespaceTokenizer and chains together the LowercaseFilter, StopwordFilter, and my own custom filter (below). I am using this analyzer when searching (i.e. it is the analyzer used in a QueryParser). The custom analyzers purpose is to add tokens by concatenating the previous word with the current word. So that if you were given "Spring Framework Core" the resulting tokens would be "Spring SpringFramework Framework FrameworkCore Core". My problem is that when my query text is "Spring Framework Core" I end up with left-over state in my TokenPairConcatenatingFilter (the previousWord is a member field). So if I end up re-using my query parser on a subsequent search for "Apache Struts" I end up with the token stream of "CoreApache Apache ApacheStruts Struts". The Initial "core" was left over state. The left over state from the initial query appears to arise because in my initial loop that collects all of the tokens from the underlying stream only collects a single token. So the processing is - we collect the token "spring", we write "spring" out to the stream and move it to the previousWord. Next, we are at the end of the stream and we have no more words in the list so the filter returns false. At this time, the filter is called again and "Framework" is collected... repeat until end of tokens from the query is reached; however, "Core" is left in the previousWord field. The filter would work correctly with no state being left over if all of the tokens were collected at the beginning (i.e. the first call to incrementToken). Can anyone explain why all of the tokens would not be collected and/or a work around so that when QueryParser.parse("field:(Spring Framework Core)") is called residual state is not left over in my token filter? I have two hack solutions - 1) don't reuse the analyzer/QueryParser for subsequent queries or 2) build in a reset mechanism to clear the previousWord field. I don't like either solution and was hoping someone from the list might have a suggestion as to what I've done wrong or some feature of Lucene I've missed. The code is below. Thanks in advance, Jeremy //---------------------------------------- // TokenPairConcatenatingFilter import java.io.IOException; import java.util.LinkedList; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; /** *

Takes a TokenStream and adds additional tokens by concatenating pairs of words.

*

Example: "Spring Framework Core" -> "Spring SpringFramework Framework FrameworkCore Core".

* * @author Jeremy Long (jeremy.long@gmail.com) */ public final class TokenPairConcatenatingFilter extends TokenFilter { private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class); private String previousWord = null; private LinkedList words = null; public TokenPairConcatenatingFilter(TokenStream stream) { super(stream); words = new LinkedList(); } /** * Increments the underlying TokenStream and sets CharTermAtttributes to * construct an expanded set of tokens by concatenting tokens with the * previous token. * * @return whether or not we have hit the end of the TokenStream * @throws IOException is thrown when an IOException occurs */ @Override public boolean incrementToken() throws IOException { //collect all the terms into the words collaction while (input.incrementToken()) { String word = new String(termAtt.buffer(), 0, termAtt.length()); words.add(word); } //if we have a previousTerm - write it out as its own token concatonated // with the current word (if one is available). if (previousWord != null && words.size() > 0) { String word = words.getFirst(); clearAttributes(); termAtt.append(previousWord).append(word); posIncAtt.setPositionIncrement(0); previousWord = null; return true; } //if we have words, write it out as a single token if (words.size() > 0) { String word = words.removeFirst(); clearAttributes(); termAtt.append(word); previousWord = word; return true; } return false; } } //---------------------------------------- // SearchFieldAnalyzer import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.core.WhitespaceTokenizer; import org.apache.lucene.analysis.core.LowerCaseFilter; import org.apache.lucene.analysis.core.StopAnalyzer; import org.apache.lucene.analysis.core.StopFilter; import org.apache.lucene.util.Version; /** * * @author Jeremy Long (jeremy.long@gmail.com) */ public class SearchFieldAnalyzer extends Analyzer { private Version version = null; public SearchFieldAnalyzer(Version version) { this.version = version; } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = new WhitespaceTokenizer(version, reader); TokenStream stream = source; stream = new LowerCaseFilter(version, stream); stream = new TokenPairConcatenatingFilter(stream); stream = new StopFilter(version, stream, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(source, stream); } } //---------------------------------------- // The following is a unit test to exercise the above methods and to show the issue: import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper; import java.util.Map; import java.util.HashMap; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopScoreDocCollector; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.index.IndexReader; import org.apache.lucene.search.Query; import java.io.IOException; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.store.Directory; import org.apache.lucene.util.Version; import org.junit.After; import org.junit.AfterClass; import org.junit.Before; import org.junit.BeforeClass; import org.junit.Test; import static org.junit.Assert.*; /** * * @author Jeremy Long (jeremy.long@gmail.com) */ public class FieldAnalyzerTest { public FieldAnalyzerTest() { } @BeforeClass public static void setUpClass() throws Exception { } @AfterClass public static void tearDownClass() throws Exception { } @Before public void setUp() { } @After public void tearDown() { } @Test public void testAnalyzers() throws Exception { Analyzer analyzer = new FieldAnalyzer(Version.LUCENE_40); Directory index = new RAMDirectory(); String field1 = "product"; String text1 = "springframework"; String field2 = "vendor"; String text2 = "springsource"; createIndex(analyzer, index, field1, text1, field2, text2); //Analyzer searchingAnalyzer = new SearchFieldAnalyzer(Version.LUCENE_40); String querystr = "product:(Spring Framework Core) vendor:(SpringSource)"; Map fieldAnalyzers = new HashMap(); fieldAnalyzers.put("product", new SearchFieldAnalyzer(Version.LUCENE_40)); fieldAnalyzers.put("vendor", new SearchFieldAnalyzer(Version.LUCENE_40)); PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper( new StandardAnalyzer(Version.LUCENE_40), fieldAnalyzers); QueryParser parser = new QueryParser(Version.LUCENE_40, field1, wrapper); Query q = parser.parse(querystr); System.out.println(q.toString()); querystr = "product:(Apache Struts) vendor:(Apache)"; q = parser.parse(querystr); System.out.println(q.toString()); int hitsPerPage = 10; IndexReader reader = DirectoryReader.open(index); IndexSearcher searcher = new IndexSearcher(reader); TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true); searcher.search(q, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; assertEquals("Did not find 1 document", 1, hits.length); } private void createIndex(Analyzer analyzer, Directory index, String field1, String text1, String field2, String text2) throws IOException { IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter w = new IndexWriter(index, config); addDoc(w, field1, text1, field2, text2); w.close(); } private static void addDoc(IndexWriter w, String field1, String text1, String field2, String text2) throws IOException { Document doc = new Document(); doc.add(new TextField(field1, text1, Field.Store.YES)); doc.add(new TextField(field2, text2, Field.Store.YES)); w.addDocument(doc); } } //------------------------------------- // The following is my "Fieldanalyzer" used in the above test case. import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.core.WhitespaceTokenizer; import org.apache.lucene.analysis.core.LowerCaseFilter; import org.apache.lucene.analysis.core.StopAnalyzer; import org.apache.lucene.analysis.core.StopFilter; import org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter; import org.apache.lucene.util.Version; /** * * @author Jeremy Long (jeremy.long@gmail.com) */ public class FieldAnalyzer extends Analyzer { private Version version = null; public FieldAnalyzer(Version version) { this.version = version; } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = new WhitespaceTokenizer(version, reader); TokenStream stream = source; stream = new WordDelimiterFilter(stream, WordDelimiterFilter.CATENATE_WORDS | WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.PRESERVE_ORIGINAL | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE, null); stream = new LowerCaseFilter(version, stream); //stream = new ConcatenateFilter(stream); stream = new StopFilter(version, stream, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(source, stream); } } --14dae93409f76a22ba04d1c1f8e0--