Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of lists@nabble.com designates
 72.21.53.35 as permitted sender)
Message-ID: <6160316.post@talk.nabble.com>
Date: Tue, 5 Sep 2006 14:06:13 -0700 (PDT)
From: Philip Brown <pmb@us.ibm.com>
To: java-user@lucene.apache.org
Subject: Re: Phrase search using quotes -- special Tokenizer
In-Reply-To: <Pine.LNX.4.58.0609051040080.22254@hal.rescomp.berkeley.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
References: <6093138.post@talk.nabble.com> <44F81F0D.7080605@gmail.com>
 <6098930.post@talk.nabble.com> <44F83F0A.7080005@gmail.com>
 <6106920.post@talk.nabble.com>
 <ab0709bf0609011521v40398219jd5727cfed0cb2b46@mail.gmail.com>
 <6107649.post@talk.nabble.com>
 <359a92830609011659o51839642g31502fef0fc86b28@mail.gmail.com>
 <6109067.post@talk.nabble.com>
 <359a92830609020643le432a02qeb19b6ec906e915f@mail.gmail.com>
 <44F98D2C.7030007@gmail.com> <6115360.post@talk.nabble.com>
 <359a92830609030745u61252dc9t16daa772218d3b96@mail.gmail.com>
 <6125651.post@talk.nabble.com>
 <Pine.LNX.4.58.0609031642460.27918@hal.rescomp.berkeley.edu>
 <6128827.post@talk.nabble.com>
 <Pine.LNX.4.58.0609031923480.2297@hal.rescomp.berkeley.edu>
 <6134864.post@talk.nabble.com>
 <Pine.LNX.4.58.0609041249080.6144@hal.rescomp.berkeley.edu>
 <6145591.post@talk.nabble.com>
 <Pine.LNX.4.58.0609051040080.22254@hal.rescomp.berkeley.edu>


Here's a little sample program (borrowed some code from Erick Erickson :)). 
Whether I add as TOKENIZED or UN_TOKENIZED seems to make no difference in
the output.  Is this what you'd expect?

- Philip

package com.test;

import java.io.IOException;
import java.util.HashSet;
import java.util.regex.Pattern;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.KeywordAnalyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.RAMDirectory;

public class Test2 {
	    private PerFieldAnalyzerWrapper analyzer = null; 
	    private RAMDirectory idx = null;

	    private Analyzer getAnalyzer() { 
	        if (analyzer == null) { 
	        	analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer());	        	
	        	analyzer.addAnalyzer("keyword", new KeywordAnalyzer());	
	        } 
	        return analyzer; 

	    } 

	    private void makeTestIndex() throws Exception { 
			idx = new RAMDirectory();	
	        IndexWriter writer = new IndexWriter(idx, getAnalyzer(), true); 			
			Document doc = new Document();
			doc.add(new Field("keyword", "hello world", Field.Store.YES,
Field.Index.UN_TOKENIZED));	
			doc.add(new Field("booleanField", "false", Field.Store.YES,
Field.Index.UN_TOKENIZED));
			writer.addDocument(doc);
			doc = new Document();
			doc.add(new Field("keyword", "hello world", Field.Store.YES,
Field.Index.UN_TOKENIZED));	
			doc.add(new Field("booleanField", "true", Field.Store.YES,
Field.Index.UN_TOKENIZED));
			writer.addDocument(doc);			
System.out.println(writer.docCount());			
			writer.optimize();
			writer.close();
	    } 

	    private void doSearch(String query, int expectedHits) throws Exception
{ 
	        try { 
	            QueryParser qp = new QueryParser("keyword", getAnalyzer());             
	            IndexSearcher srch = new IndexSearcher(idx); 
	            Query tmp = qp.parse(query); 
	            // Uncomment to see parsed form of query 
	             System.out.println("Parsed form is '" + tmp.toString() + "'"); 
	            Hits hits = srch.search(tmp); 

	            String msg = ""; 

	            if (hits.length() == expectedHits) { 
	                msg = "Test passed "; 
	            } else { 
	                msg = "************TEST FAILED************ "; 
	            } 
	            System.out.println(msg + "Expected " 
	                    + Integer.toString(expectedHits) + " hits, got " 
	                    + Integer.toString(hits.length()) + " hits"); 

	        } catch (IOException e) { 
	            System.out.println("Caught IOException"); 
	            e.printStackTrace(); 
	        } 
	    } 


	    public static void main(String[] args) { 
	        try { 
	            Test2 test = new Test2();  
	            test.makeTestIndex(); 
	            test.doSearch("Hello World", 0); 
	            test.doSearch("hello world", 0); 
	            test.doSearch("hello", 0); 
	            test.doSearch("world", 0); 

	            test.doSearch("\"Hello World\"", 0); 
	            test.doSearch("\"hello world\"", 2);  
	            test.doSearch("\"hello world\" +booleanField:false", 1);
	            test.doSearch("\"hello world\" +booleanField:true", 1);

	        } catch (Exception e) { 
	            System.err.println(e.getMessage()); 
	        } 
	    } 
}


Chris Hostetter wrote:
> 
> 
> : So, if I do as you suggest below (using PerFieldAnalyzerWrapper with
> : StandardAnalyzer) then I still need to enclose in quotes the phrases
> : (keywords with spaces) when I issue the search, and they are only
> returned
> 
> Yes, quotes will be neccessary to tell the QueryParser "this
> is one chunk of text, passs it to the analyzer whole" - but that's so you
> can get the "compelx" part of the problem you described... recognizing
> that "my brown-cow" and "red fox" should be matched as seperate values
> intead of trying to find one big vlaue containing "my brown-cow red fox"
> 
> : in the results if the case is identical to how it was added?  (This
> seems to
> : be what I observe anyway.  And whether I add as TOKENIZED or
> UN_TOKENIZED
> : seems to have no effect.)
> 
> 1) wether case matters is determined enitrely by your analyzer, if it
>    produces differnet tokens for "Blue" and "BLUE" then case matters
> 2) use TOKENIZED or your Analyzer will be completely irrelevant
> 3) if you observse something working differently then you expect, post the
>   code -- we're way pastthe point of being able to offer you any
>   meaningful help without seeing a self contained example of what you want
>   to see work.
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6160316
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org