lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: how to control terms to be highlighted?
Date Fri, 02 Dec 2005 12:34:17 GMT
Hi Harini, 
I updated QueryTermsExtractor in Subversion last night
to support your requirement.

The JUnit test is also updated with a field-specific
example.

Cheers,
Mark


--- Harini Raghavan <harini.raghavan@insideview.com>
wrote:

> Hi Chris,
> 
> Can we pass a different query object for searching
> and a different one 
> to the highlighter? I am not sure of that.
> In any case,  based on Mark's suggestion I modified
> the 
> QueryTermsExtractor class and filtered the query 
> terms by the fieldName.
> Attached is the modified file.
> 
> Thanks,
> Harini
> 
> 
> 
> Chris Hostetter wrote:
> 
> >I don't know what your application is, and I have
> no experience with the
> >Highlighter code, so forgive me if this is a silly
> suggestion:
> >
> >It looks like you are building a query up
> programaticaly, which
> >contains some words to search on, and some other
> stuff that's mainly
> >being used to "filter" the results (i'll avoid my
> usual rant about
> >people underutilizing Filters).  So why not pass
> the Higherlighter just
> >the portion of the Query that you acctaully want to
> contribute to the
> >highlighting?  In this query...
> >
> >: >> +DocumentType:news
> >: >> +(CompanyId:10 CompanyId:20 CompanyId:30
> CompanyId:40)
> >: >> +FilingDate:[20041201 TO 20051201]
> >: >> +(Content:"cost saving" Content:"cost savings"
> >: >>Content:outsource
> >: >>Content:outsources Content:downsize
> >: >>Content:downsizes
> >: >>Content:restructuring Content:restructure)
> >
> >...just give the highlighter...
> >
> >    (Content:"cost saving" Content:"cost savings"
> >     Content:outsource
> >     Content:outsources Content:downsize
> >     Content:downsizes
> >     Content:restructuring Content:restructure)
> >
> >
> >: Date: Thu, 01 Dec 2005 10:38:41 +0530
> >: From: Harini Raghavan
> <harini.raghavan@insideview.com>
> >: Reply-To: java-user@lucene.apache.org
> >: To: java-user@lucene.apache.org
> >: Subject: Re: how to control terms to be
> highlighted?
> >:
> >: Hi Mark,
> >:
> >: It would be great if you can make this change and
> send the
> >: QueryTermsExtractor class. I am invoking the
> QueryScorer(Query)
> >: contructor. Should I use QueryScorer(Query query,
> IndexReader reader,
> >: String fieldName) instead for this to work?
> >:
> >: Thanks,
> >: Harini
> >:
> >: mark harwood wrote:
> >:
> >: >>>>Is there anyway to restrict the highlighter
> to
> >: >>>>
> >: >>>>
> >: >>highlight only the values
> >: >>mentioned for the field 'Content'?
> >: >>
> >: >>
> >: >
> >: >The problem lies in the QueryTermsExtractor
> class
> >: >which is typically used to provide the
> Highlighter
> >: >with the list of strings to identify in the
> text. It
> >: >currently has no filter for fieldname - you
> could add
> >: >this without too much effort.
> >: >
> >: >I could make this modification but it may change
> the
> >: >behaviour of existing applications - currently
> the
> >: >QueryTermsExtractor method that takes a
> fieldname only
> >: >uses that fieldname to derive IDF weightings,
> the
> >: >proposed change would also have the effect of
> >: >filtering out any query terms that weren't for
> this
> >: >field.
> >: >Would this change be a problem for anyone?
> >: >
> >: >Cheers,
> >: >Mark
> >: >
> >: >--- Harini Raghavan
> <harini.raghavan@insideview.com>
> >: >wrote:
> >: >
> >: >
> >: >
> >: >>Hi,
> >: >>
> >: >>I have a requirement to highlight search
> keywords in
> >: >>the results and
> >: >>display the matching fragment of the text with
> the
> >: >>results. I am using
> >: >>the Hits highlighting mentioned in Lucene in
> Action.
> >: >>
> >: >>Here is the search query(BooleanQuery) I am
> passing
> >: >>to the IndexSearcher
> >: >>and QueryScorer:
> >: >> +DocumentType:news
> >: >> +(CompanyId:10 CompanyId:20 CompanyId:30
> >: >>CompanyId:40)
> >: >> +FilingDate:[20041201 TO 20051201]
> >: >> +(Content:"cost saving" Content:"cost savings"
> >: >>Content:outsource
> >: >>Content:outsources Content:downsize
> >: >>Content:downsizes
> >: >>Content:restructuring Content:restructure)
> >: >>
> >: >>My requirement is to highlight only the
> keywords for
> >: >>'Content' field,
> >: >>but the highlighter api is also highlighting
> words
> >: >>like 'news', '10',
> >: >>'40' etc.
> >: >>Is there anyway to restrict the highlighter to
> >: >>highlight only the values
> >: >>mentioned for the field 'Content'?
> >: >>
> >: >>Thanks,
> >: >>Harini
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >: >>
> >:
>
>---------------------------------------------------------------------
> >: >
> >: >
> >: >>To unsubscribe, e-mail:
> >: >>java-user-unsubscribe@lucene.apache.org
> >: >>For additional commands, e-mail:
> >: >>java-user-help@lucene.apache.org
> >: >>
> >: >>
> >: >>
> >: >>
> >: >
> >: >
> >: >
> >: >
> >:
>
>___________________________________________________________
> >: >Yahoo! Model Search 2005 - Find the next catwalk
> superstars -
> http://uk.news.yahoo.com/hot/model-search/
> >: >
> >:
>
>---------------------------------------------------------------------
> >: >To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> >: >For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >: >
> >: >
> >: >
> >: >
> >:
> >:
> >:
>
---------------------------------------------------------------------
> 
=== message truncated ===> package
org.apache.lucene.search.highlight;
> /**
>  * Copyright 2002-2004 The Apache Software
> Foundation
>  *
>  * Licensed under the Apache License, Version 2.0
> (the "License");
>  * you may not use this file except in compliance
> with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in
> writing, software
>  * distributed under the License is distributed on
> an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
> either express or implied.
>  * See the License for the specific language
> governing permissions and
>  * limitations under the License.
>  */
> 
> import java.io.IOException;
> import java.util.Collection;
> import java.util.HashSet;
> import java.util.Iterator;
> 
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.BooleanClause;
> import org.apache.lucene.search.BooleanQuery;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.search.spans.SpanNearQuery;
> 
> /**
>  * Utility class used to extract the terms used in a
> query, plus any weights.
>  * This class will not find terms for
> MultiTermQuery, RangeQuery and PrefixQuery classes
>  * so the caller must pass a rewritten query (see
> Query.rewrite) to obtain a list of
>  * expanded terms.
>  *
>  */
> public final class QueryTermExtractor
> {
> 
> 	/**
> 	 * Extracts all terms texts of a given Query into
> an array of WeightedTerms
> 	 *
> 	 * @param query      Query to extract term texts
> from
> 	 * @return an array of the terms used in a query,
> plus their weights.
> 	 */
> 	public static final WeightedTerm[] getTerms(Query
> query)
> 	{
> 		return getTerms(query,false,"");
> 	}
> 
> 	/**
> 	 * Extracts all terms texts of a given Query into
> an array of WeightedTerms
> 	 *
> 	 * @param query      Query to extract term texts
> from
> 	 * @param reader used to compute IDF which can be
> used to a) score selected fragments better
> 	 * b) use graded highlights eg chaning intensity of
> font color
> 	 * @param fieldName the field on which Inverse
> Document Frequency (IDF) calculations are based
> 	 * @return an array of the terms used in a query,
> plus their weights.
> 	 */
> 	public static final WeightedTerm[]
> getIdfWeightedTerms(Query query, IndexReader reader,
> String fieldName)
> 	{
> 	    WeightedTerm[]
> terms=getTerms(query,false,fieldName);
> 	    int totalNumDocs=reader.numDocs();
> 	    for (int i = 0; i < terms.length; i++)
>         {
> 	        try
>             {
>                 int docFreq=reader.docFreq(new
> Term(fieldName,terms[i].term));
>                 //IDF algorithm taken from
> DefaultSimilarity class
>                 float
>
idf=(float)(Math.log((float)totalNumDocs/(double)(docFreq+1))
> + 1.0);
>                 terms[i].weight*=idf;
>             }
> 	        catch (IOException e)
>             {
> 	            //ignore
>             }
>         }
> 		return terms;
> 	}
> 
> 	/**
> 	 * Extracts all terms texts of a given Query into
> an array of WeightedTerms
> 	 *
> 	 * @param query      Query to extract term texts
> from
> 	 * @param prohibited <code>true</code> to extract
> "prohibited" terms, too
>    * @return an array of the terms used in a query,
> plus their weights.
>    */
> 	public static final WeightedTerm[] getTerms(Query
> query, boolean prohibited, String fieldName)
> 	{
> 		HashSet terms=new HashSet();
> 		getTerms(query,terms,prohibited,fieldName);
> 		return (WeightedTerm[]) terms.toArray(new
> WeightedTerm[0]);
> 	}
> 
> 	private static final void getTerms(Query query,
> HashSet terms,boolean prohibited, String fieldName)
> 	{
> 		if (query instanceof BooleanQuery)
> 			getTermsFromBooleanQuery((BooleanQuery) query,
> terms, prohibited, fieldName);
> 		else
> 			if (query instanceof PhraseQuery)
> 				getTermsFromPhraseQuery((PhraseQuery) query,
> terms, fieldName);
> 			else
> 				if (query instanceof TermQuery)
> 					getTermsFromTermQuery((TermQuery) query, terms,
> fieldName);
> 				else
> 		        if(query instanceof SpanNearQuery)
> 		           
> getTermsFromSpanNearQuery((SpanNearQuery) query,
> terms, fieldName);
> 	}
> 
> 	private static final void
> getTermsFromBooleanQuery(BooleanQuery query, HashSet
> terms, boolean prohibited, String fieldName)
> 	{
> 		BooleanClause[] queryClauses = query.getClauses();
> 		int i;
> 
> 		for (i = 0; i < queryClauses.length; i++)
> 		{
> 			if (prohibited || !queryClauses[i].prohibited)
> 				getTerms(queryClauses[i].query, terms,
> prohibited, fieldName);
> 		}
> 	}
> 
> 	private static final void
> getTermsFromPhraseQuery(PhraseQuery query, HashSet
> terms, String fieldName)
> 	{
> 		Term[] queryTerms = query.getTerms();
> 		int i;
> 		String field;
> 
> 		for (i = 0; i < queryTerms.length; i++)
> 		{
> 			if(fieldName.equals(""))
> 				terms.add(new
>
WeightedTerm(query.getBoost(),queryTerms[i].text()));
> 			else {
> 				field = queryTerms[i].field();
> 				if(field.equals(fieldName))
> 					terms.add(new
>
WeightedTerm(query.getBoost(),queryTerms[i].text()));
> 			}
> 		}
> 	}
> 
> 	private static final void
> getTermsFromTermQuery(TermQuery query, HashSet
> terms, String fieldName)
> 	{
> 		String field = query.getTerm().field();
> 		if(fieldName.equals(""))
> 			terms.add(new
>
WeightedTerm(query.getBoost(),query.getTerm().text()));
> 		else if(field.equals(fieldName)) {
> 			terms.add(new
>
WeightedTerm(query.getBoost(),query.getTerm().text()));
> 		}
> 	}
> 
> 
=== message truncated ===>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
java-user-help@lucene.apache.org



		
___________________________________________________________ 
WIN ONE OF THREE YAHOO! VESPAS - Enter now! - http://uk.cars.yahoo.com/features/competitions/vespa.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message