lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harini Raghavan <harini.ragha...@insideview.com>
Subject Re: how to control terms to be highlighted?
Date Mon, 05 Dec 2005 11:28:13 GMT
Hi,

I was able to use the Highlighter API to extract the text where the 
keywords occur.  However I am facing another related problem. My 
application downloads the news items to the local server. The indexer 
api parses these HTML files and extracts the content and stores it in 
the index. The parser extracts all the text in the html page including 
title, headings etc. So when the highlighter is run on this content, 
instead of highlighting the keywords in the main content, it just shows 
the title or words found in the beginning of the page.

For example for the article in the link:  
http://biz.yahoo.com/rb/051130/apple.html , the highlighted text is 
something like below:
/Options Order Book Symbol Lookup Reuters Apple may launch Intel 
laptops: analyst Wednesday November 30, 10:24 am ET NEW

/My requirement is to extract the best fragment/sentence from the news 
article where the keywords appear(similar to google) and display below 
the search result. But, the above text extracted is not really the best 
fragment, it seems to be the first fragment which has the keywords.  Has 
someone implemented this kind of functionality?

-Harini



Harini Raghavan wrote:

> Hi Chris,
>
> Can we pass a different query object for searching and a different one 
> to the highlighter? I am not sure of that.
> In any case,  based on Mark's suggestion I modified the 
> QueryTermsExtractor class and filtered the query  terms by the fieldName.
> Attached is the modified file.
>
> Thanks,
> Harini
>
>
>
> Chris Hostetter wrote:
>
>> I don't know what your application is, and I have no experience with the
>> Highlighter code, so forgive me if this is a silly suggestion:
>>
>> It looks like you are building a query up programaticaly, which
>> contains some words to search on, and some other stuff that's mainly
>> being used to "filter" the results (i'll avoid my usual rant about
>> people underutilizing Filters).  So why not pass the Higherlighter just
>> the portion of the Query that you acctaully want to contribute to the
>> highlighting?  In this query...
>>
>> : >> +DocumentType:news
>> : >> +(CompanyId:10 CompanyId:20 CompanyId:30 CompanyId:40)
>> : >> +FilingDate:[20041201 TO 20051201]
>> : >> +(Content:"cost saving" Content:"cost savings"
>> : >>Content:outsource
>> : >>Content:outsources Content:downsize
>> : >>Content:downsizes
>> : >>Content:restructuring Content:restructure)
>>
>> ...just give the highlighter...
>>
>>    (Content:"cost saving" Content:"cost savings"
>>     Content:outsource
>>     Content:outsources Content:downsize
>>     Content:downsizes
>>     Content:restructuring Content:restructure)
>>
>>
>> : Date: Thu, 01 Dec 2005 10:38:41 +0530
>> : From: Harini Raghavan <harini.raghavan@insideview.com>
>> : Reply-To: java-user@lucene.apache.org
>> : To: java-user@lucene.apache.org
>> : Subject: Re: how to control terms to be highlighted?
>> :
>> : Hi Mark,
>> :
>> : It would be great if you can make this change and send the
>> : QueryTermsExtractor class. I am invoking the QueryScorer(Query)
>> : contructor. Should I use QueryScorer(Query query, IndexReader reader,
>> : String fieldName) instead for this to work?
>> :
>> : Thanks,
>> : Harini
>> :
>> : mark harwood wrote:
>> :
>> : >>>>Is there anyway to restrict the highlighter to
>> : >>>>
>> : >>>>
>> : >>highlight only the values
>> : >>mentioned for the field 'Content'?
>> : >>
>> : >>
>> : >
>> : >The problem lies in the QueryTermsExtractor class
>> : >which is typically used to provide the Highlighter
>> : >with the list of strings to identify in the text. It
>> : >currently has no filter for fieldname - you could add
>> : >this without too much effort.
>> : >
>> : >I could make this modification but it may change the
>> : >behaviour of existing applications - currently the
>> : >QueryTermsExtractor method that takes a fieldname only
>> : >uses that fieldname to derive IDF weightings, the
>> : >proposed change would also have the effect of
>> : >filtering out any query terms that weren't for this
>> : >field.
>> : >Would this change be a problem for anyone?
>> : >
>> : >Cheers,
>> : >Mark
>> : >
>> : >--- Harini Raghavan <harini.raghavan@insideview.com>
>> : >wrote:
>> : >
>> : >
>> : >
>> : >>Hi,
>> : >>
>> : >>I have a requirement to highlight search keywords in
>> : >>the results and
>> : >>display the matching fragment of the text with the
>> : >>results. I am using
>> : >>the Hits highlighting mentioned in Lucene in Action.
>> : >>
>> : >>Here is the search query(BooleanQuery) I am passing
>> : >>to the IndexSearcher
>> : >>and QueryScorer:
>> : >> +DocumentType:news
>> : >> +(CompanyId:10 CompanyId:20 CompanyId:30
>> : >>CompanyId:40)
>> : >> +FilingDate:[20041201 TO 20051201]
>> : >> +(Content:"cost saving" Content:"cost savings"
>> : >>Content:outsource
>> : >>Content:outsources Content:downsize
>> : >>Content:downsizes
>> : >>Content:restructuring Content:restructure)
>> : >>
>> : >>My requirement is to highlight only the keywords for
>> : >>'Content' field,
>> : >>but the highlighter api is also highlighting words
>> : >>like 'news', '10',
>> : >>'40' etc.
>> : >>Is there anyway to restrict the highlighter to
>> : >>highlight only the values
>> : >>mentioned for the field 'Content'?
>> : >>
>> : >>Thanks,
>> : >>Harini
>> : >>
>> : >>
>> : >>
>> : >>
>> : >>
>> : >>
>> : >>
>> : >>
>> : >---------------------------------------------------------------------
>> : >
>> : >
>> : >>To unsubscribe, e-mail:
>> : >>java-user-unsubscribe@lucene.apache.org
>> : >>For additional commands, e-mail:
>> : >>java-user-help@lucene.apache.org
>> : >>
>> : >>
>> : >>
>> : >>
>> : >
>> : >
>> : >
>> : >
>> : >___________________________________________________________
>> : >Yahoo! Model Search 2005 - Find the next catwalk superstars - 
>> http://uk.news.yahoo.com/hot/model-search/
>> : >
>> : >---------------------------------------------------------------------
>> : >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> : >For additional commands, e-mail: java-user-help@lucene.apache.org
>> : >
>> : >
>> : >
>> : >
>> :
>> :
>> : ---------------------------------------------------------------------
>> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> : For additional commands, e-mail: java-user-help@lucene.apache.org
>> :
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>  
>>
>------------------------------------------------------------------------
>
>package org.apache.lucene.search.highlight;
>/**
> * Copyright 2002-2004 The Apache Software Foundation
> *
> * Licensed under the Apache License, Version 2.0 (the "License");
> * you may not use this file except in compliance with the License.
> * You may obtain a copy of the License at
> *
> *     http://www.apache.org/licenses/LICENSE-2.0
> *
> * Unless required by applicable law or agreed to in writing, software
> * distributed under the License is distributed on an "AS IS" BASIS,
> * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> * See the License for the specific language governing permissions and
> * limitations under the License.
> */
>
>import java.io.IOException;
>import java.util.Collection;
>import java.util.HashSet;
>import java.util.Iterator;
>
>import org.apache.lucene.index.IndexReader;
>import org.apache.lucene.index.Term;
>import org.apache.lucene.search.BooleanClause;
>import org.apache.lucene.search.BooleanQuery;
>import org.apache.lucene.search.PhraseQuery;
>import org.apache.lucene.search.Query;
>import org.apache.lucene.search.TermQuery;
>import org.apache.lucene.search.spans.SpanNearQuery;
>
>/**
> * Utility class used to extract the terms used in a query, plus any weights.
> * This class will not find terms for MultiTermQuery, RangeQuery and PrefixQuery classes
> * so the caller must pass a rewritten query (see Query.rewrite) to obtain a list of
> * expanded terms.
> *
> */
>public final class QueryTermExtractor
>{
>
>	/**
>	 * Extracts all terms texts of a given Query into an array of WeightedTerms
>	 *
>	 * @param query      Query to extract term texts from
>	 * @return an array of the terms used in a query, plus their weights.
>	 */
>	public static final WeightedTerm[] getTerms(Query query)
>	{
>		return getTerms(query,false,"");
>	}
>
>	/**
>	 * Extracts all terms texts of a given Query into an array of WeightedTerms
>	 *
>	 * @param query      Query to extract term texts from
>	 * @param reader used to compute IDF which can be used to a) score selected fragments
better
>	 * b) use graded highlights eg chaning intensity of font color
>	 * @param fieldName the field on which Inverse Document Frequency (IDF) calculations
are based
>	 * @return an array of the terms used in a query, plus their weights.
>	 */
>	public static final WeightedTerm[] getIdfWeightedTerms(Query query, IndexReader reader,
String fieldName)
>	{
>	    WeightedTerm[] terms=getTerms(query,false,fieldName);
>	    int totalNumDocs=reader.numDocs();
>	    for (int i = 0; i < terms.length; i++)
>        {
>	        try
>            {
>                int docFreq=reader.docFreq(new Term(fieldName,terms[i].term));
>                //IDF algorithm taken from DefaultSimilarity class
>                float idf=(float)(Math.log((float)totalNumDocs/(double)(docFreq+1)) +
1.0);
>                terms[i].weight*=idf;
>            }
>	        catch (IOException e)
>            {
>	            //ignore
>            }
>        }
>		return terms;
>	}
>
>	/**
>	 * Extracts all terms texts of a given Query into an array of WeightedTerms
>	 *
>	 * @param query      Query to extract term texts from
>	 * @param prohibited <code>true</code> to extract "prohibited" terms, too
>   * @return an array of the terms used in a query, plus their weights.
>   */
>	public static final WeightedTerm[] getTerms(Query query, boolean prohibited, String fieldName)
>	{
>		HashSet terms=new HashSet();
>		getTerms(query,terms,prohibited,fieldName);
>		return (WeightedTerm[]) terms.toArray(new WeightedTerm[0]);
>	}
>
>	private static final void getTerms(Query query, HashSet terms,boolean prohibited, String
fieldName)
>	{
>		if (query instanceof BooleanQuery)
>			getTermsFromBooleanQuery((BooleanQuery) query, terms, prohibited, fieldName);
>		else
>			if (query instanceof PhraseQuery)
>				getTermsFromPhraseQuery((PhraseQuery) query, terms, fieldName);
>			else
>				if (query instanceof TermQuery)
>					getTermsFromTermQuery((TermQuery) query, terms, fieldName);
>				else
>		        if(query instanceof SpanNearQuery)
>		            getTermsFromSpanNearQuery((SpanNearQuery) query, terms, fieldName);
>	}
>
>	private static final void getTermsFromBooleanQuery(BooleanQuery query, HashSet terms,
boolean prohibited, String fieldName)
>	{
>		BooleanClause[] queryClauses = query.getClauses();
>		int i;
>
>		for (i = 0; i < queryClauses.length; i++)
>		{
>			if (prohibited || !queryClauses[i].prohibited)
>				getTerms(queryClauses[i].query, terms, prohibited, fieldName);
>		}
>	}
>
>	private static final void getTermsFromPhraseQuery(PhraseQuery query, HashSet terms, String
fieldName)
>	{
>		Term[] queryTerms = query.getTerms();
>		int i;
>		String field;
>
>		for (i = 0; i < queryTerms.length; i++)
>		{
>			if(fieldName.equals(""))
>				terms.add(new WeightedTerm(query.getBoost(),queryTerms[i].text()));
>			else {
>				field = queryTerms[i].field();
>				if(field.equals(fieldName))
>					terms.add(new WeightedTerm(query.getBoost(),queryTerms[i].text()));
>			}
>		}
>	}
>
>	private static final void getTermsFromTermQuery(TermQuery query, HashSet terms, String
fieldName)
>	{
>		String field = query.getTerm().field();
>		if(fieldName.equals(""))
>			terms.add(new WeightedTerm(query.getBoost(),query.getTerm().text()));
>		else if(field.equals(fieldName)) {
>			terms.add(new WeightedTerm(query.getBoost(),query.getTerm().text()));
>		}
>	}
>
>    private static final void getTermsFromSpanNearQuery(SpanNearQuery query, HashSet terms,
String fieldName){
>
>        Collection queryTerms = query.getTerms();
>
>        for(Iterator iterator = queryTerms.iterator(); iterator.hasNext();){
>
>            // break it out for debugging.
>
>            Term term = (Term) iterator.next();
>
>            String text = term.text();
>
>			String field = term.field();
>
>			if(fieldName.equals(""))
>				terms.add(new WeightedTerm(query.getBoost(), text));
>			else if(field.equals(fieldName)) {
>        	    terms.add(new WeightedTerm(query.getBoost(), text));
>			}
>
>        }
>
>    }
>
>}
>
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message