lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Search Expansion - one step closer ... !
Date Sun, 04 Apr 2004 17:42:45 GMT
On Apr 4, 2004, at 12:28 PM, hgadm@cswebmail.com wrote:
> Hi Eric, all,

with a 'k' :))

> Several of my terms are in fact keyphrases with 2 or
> more words separated by whitespaces, e.g. 'host
> defense'.

You've not told us how you are indexing.  What field type are you 
using?  From your description it seems you want to analyze text as it 
may have special characters.

These are the types of decisions that really matter when using Lucene.  
My first hunch is that you need a domain-aware analyzer that knows when 
it sees "host defense", "Host-Defense", "Host_DEFENSE" that it 
tokenizes it as "host defense".

Or perhaps you need an analyzer that does a floating window of two 
words and bi-grams them into single tokens?

I don't really have any quick and easy answers for you - you're asking 
for domain specific common sense in the analysis process from what I am 
gathering, and Lucene itself makes this possible but does not give it 
to you for free.

You could, perhaps, take an easier way out and run text through an 
Analyzer as you build up your query, without using QueryParser.  Look, 
again, at my AnalysisDemo code in the java.net article.... just pull 
what you need from there to process a TokenStream out of an Analyzer.

	Erik

> They are obviously not handled properly during the
> construction of the boolean query because 'host
> defense' is not found though it is in the field.
> Replacing the whitespace inbetween the words by an
> underscore ('host_defense' which is recognised by query
> parser and yields similar results to double
>
> quoting, e.g. "host defense") did not retrieve either
> ...
>
> I had to convert to lowercase before sending to his
> function because - unlike in the QueryParser call - no
> analyzer is used at the moment.
> Indexing was done with StandardAnalyzer so I would
> prefer using an analyser at search as well.
> The terms are well formed because they are taken from a
> domain ontology but there could be inconsistencies in
> spelling between what is in the ontology and
>
> what is in the field, e.g. as 'host-defense' which
> would need equivalent handling to 'host defense'. Guess
> this will be dealt with by the analyser - but where do
> I
>
> put it within the current code (see below) with boolean
> query generation ?
>
> Any hints ?
> Anyway - thanks a lot so far !
>
> Holger
>
>
> Code follows:
>
>     public String[] doSearchBQ(String index_path,
> String[] myquery){
>     // does query processing without QueryParser but by
> contructing a boolean query	
>     try {
>       Searcher searcher = new IndexSearcher(index_path);
>       Analyzer analyzer = new StandardAnalyzer();
> 	
> 	BooleanQuery query = new BooleanQuery();
> 	
> 	//for each term to add:
> 	for (int j=0; j<myquery.length; j++){
> 	query.add(new TermQuery(new Term("subject",
> myquery[j])), false, false);
> 	}
> 	
> 	Hits hits = searcher.search(query);
> 	
> 	lucene_out = new String[hits.length()];	
> 	for (int i = 0; i < hits.length(); i ++)
>      	 {
> 	    Document doc = hits.doc(i);
> 	    String name = doc.get("filename");
> 	    lucene_out[i] = name + "|" + doc.get("subject") +
> "|" + doc.get("message");
> 	}
>       searcher.close();
>
>     } catch (Exception e) {
>       System.out.println(" caught a " + e.getClass() +
> 			 "\n with message: " + e.getMessage());
>     }
>     return lucene_out;
>   }
>
> ___________________________________________________
> The ALL NEW CS2000 from CompuServe
>  Better!  Faster! More Powerful!
>  250 FREE hours! Sign-on Now!
>  http://www.compuserve.com/trycsrv/cs2000/webmail/
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message