Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 13707 invoked from network); 9 Apr 2002 19:22:55 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 9 Apr 2002 19:22:55 -0000 Received: (qmail 23475 invoked by uid 97); 9 Apr 2002 19:22:59 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 23447 invoked by uid 97); 9 Apr 2002 19:22:59 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 23432 invoked from network); 9 Apr 2002 19:22:58 -0000 To: "Lucene Developers List" Date: Tue, 09 Apr 2002 12:22:50 -0700 From: "none none" Message-ID: Mime-Version: 1.0 X-Expiredinmiddle: true Reply-To: korfut@lycos.com X-Sent-Mail: off X-Mailer: MailCity Service Subject: HighLighting Service X-Priority: 3 X-Sender-Ip: 64.187.36.2 Organization: Lycos Mail (http://mail.lycos.com:80) Content-Type: multipart/mixed; boundary="=_-=_-EPECKFIMBOINBCAA" Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N --=_-=_-EPECKFIMBOINBCAA Content-Language: en Content-Type: text/plain; charset=us-ascii Content-Language: en Content-Length: 2691 Content-Transfer-Encoding: 7bit hi all, i am working on the Highlight terms functionality of Lucene. I followed step by step the suggestion of Maik Schreiber (http://www.iq-computing.de/lucene/highlight.htm), i implemented it with some changes: In the white paper the HL was based just on the summary field, my version read the document (from a cache) with SelfBufferedStream method, into a string that is passed to the HighLight method. Some problem show up here: 1.It doesn't work with all the Query , e.g.: WidcardQuery,FuzzyQuery,PrefixQuery, PhraseQuery. 3.The response time is not constant, e.g.: if the documents to produce highlight are big files , like 2/4 MB , the average response time per query is: -20 sec 10 doc of 2 mb each otherwise for small files: -0.6 sec 10 doc of 20 kb each What we can do? any suggestion? Some tips : 1.Document must be plain text to have a good result, this mean there are 2 options: first build a text version on the document at runtime (if there are big document this will be an other handycap in response time), second have a cache of all the document is plain text version. 2.The HL process produce an highlighted version for the entire document, while would be good have just a portion or 2 or 3. In this case we can take advantage because we cut the iteration process when we are done, saving some time and resource. 3.I think we should incorporate this feature in Lucene, right now to make this working you should change some code in the Lucene package, so stay up to date require to change every time these part of code (if they are still there!!).Also because it strictly depend on the Lucene core package. I attach my version of the LuceneTools.java and the code i wrote used by the servlet: ... String brief; String url = doc.get("url"); //get the cached plain text version of document to highlight StringBuffer sb = new StringBuffer(""); StringBuffer sblower = new StringBuffer(""); String s = new String(); FileInputStream fis = new FileInputStream(url) ; byte[] b = new byte[1024]; int effective=-1; while( (effective=fis.read(b))!=-1 ) { s = new String(b); sb.append( s ); sblower.append(s.toLowerCase()); } fis.close(); { brief = LuceneTools.highlightTerms( sb.toString() , sblower.toString(), highLighter , query, analyzer); } catch(Exception e) {e.printStackTrace();} out.println(searchUI.getSearchItem(score,doctitle,url,"..."+brief+"...")); .... I hope someone can help me giving some tips to make me able to complete this functionality. Thanks, bye. See Dave Matthews Band live or win a signed guitar http://r.lycos.com/r/bmgfly_mail_dmb/http://win.ipromotions.com/lycos_020201/splash.asp --=_-=_-EPECKFIMBOINBCAA Content-Type: java/*; name="LuceneTools.java" Content-Length: 13170 Content-Transfer-Encoding: 8bit /* Lucene-Highlighting � Lucene utilities to highlight terms in texts Copyright (C) 2001 Maik Schreiber This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ package org.apache.lucene.demo.highlight; import java.io.*; import java.util.*; import org.apache.lucene.analysis.*; import org.apache.lucene.index.*; import org.apache.lucene.search.*; /** * Contains miscellaneous utility methods for use with Lucene. * * @version $Id: LuceneTools.java,v 1.5 2001/10/16 07:25:55 mickey Exp $ * @author Maik Schreiber (mailto: bZ@iq-computing.de) */ public final class LuceneTools { private static int MAX_LENGTH = 300; private static int MAX_HL_TOKEN = 3; /** LuceneTools must not be instantiated directly. */ private LuceneTools() {} /** * Highlights a text in accordance to a given query. * * @param text text to highlight terms in * @param highlighter TermHighlighter to use to highlight terms in the text * @param query Query which contains the terms to be highlighted in the text * @param analyzer Analyzer used to construct the Query * * @return highlighted text */ public static final String highlightTerms(String text,String lowerCaseText, TermHighlighter highlighter, Query query, Analyzer analyzer) throws IOException { long st=System.currentTimeMillis(); StringBuffer newText = new StringBuffer(); TokenStream stream = null; String term; boolean phrase = query instanceof PhraseQuery; String qry = query.toString("contents"); if(phrase) term = qry.toLowerCase().substring(1,qry.length()-1); else term = qry; System.out.println("Query Type:"+query.getClass()); try { int cursorPos=0; int textLength = text.length(); if(phrase) { System.out.println("Text Length:"+textLength); System.out.println("Query:"+term); int termLen=term.length(); System.out.println("Query Length:"+termLen); int startOff = lowerCaseText.indexOf(term,cursorPos); if(startOff<0) startOff=1; System.out.println("startOff:"+startOff); int endOff= startOff + termLen; System.out.println("endOff:"+endOff); int delta=(MAX_LENGTH - termLen)/2; System.out.println("delta:"+delta); int from; int to; int over=0; int miss=0; int extraOver=0; if(delta>startOff) { from=1; over = delta-startOff; System.out.println("from:"+from+" over:"+over); } else { from=startOff-delta; System.out.println("from:"+from); } if(endOff+delta+over>textLength) { to=textLength; miss = to-(endOff+delta+over); extraOver = from - miss; if(extraOver>0) from=from-extraOver; else from=1; System.out.println("to:"+to+" miss:"+miss+" extraOver:"+extraOver); } else { to=endOff+delta+over; System.out.println("to:"+to); } newText.append( text.substring(from,startOff) + ""+text.substring(startOff,endOff)+""+text.substring(endOff,to) ); } else { HashSet terms = new HashSet(); org.apache.lucene.analysis.Token token; String tokenText; int startOffset; int endOffset; int lastEndOffset = 0; getTerms(query, terms, false); stream = analyzer.tokenStream(new StringReader(text)); while ((token = stream.next()) != null) { if (terms.contains(token.termText())) { startOffset = token.startOffset(); endOffset = token.endOffset(); tokenText = text.substring(startOffset, endOffset); int termLen=tokenText.length(); System.out.println("Term Length:"+termLen); System.out.println("Query:"+tokenText); int delta=(MAX_LENGTH - termLen)/2; System.out.println("delta:"+delta); int from; int to; int over=0; int miss=0; int extraOver=0; if(delta>startOffset) { from=1; over = delta-startOffset; System.out.println("from:"+from+" over:"+over); } else { from=startOffset-delta; System.out.println("from:"+from); } if(endOffset+delta+over>textLength) { to=textLength; miss = to-(endOffset+delta+over); extraOver = from - miss; if(extraOver>0) from=from-extraOver; else from=1; System.out.println("to:"+to+" miss:"+miss+" extraOver:"+extraOver); } else { to=endOffset+delta+over; System.out.println("to:"+to); } newText.append( text.substring(from,startOffset) + ""+tokenText+""+text.substring(endOffset,to) ); break; } } } System.out.println("In Memory HighLighting Time:"+(System.currentTimeMillis() - st)); return newText.toString().trim(); } finally { if (stream != null) { try { stream.close(); } catch (Exception e) {} } } } /* getTerms(query, terms, false); stream = analyzer.tokenStream(new StringReader(text)); while ((token = stream.next()) != null) { startOffset = token.startOffset(); endOffset = token.endOffset(); tokenText = text.substring(startOffset, endOffset); // append text between end of last token (or beginning of text) and start of current token if (startOffset > lastEndOffset) newText.append(text.substring(lastEndOffset, startOffset)); // does query contain current token? if (terms.contains(token.termText())) { tok++; newText.append(highlighter.highlightTerm(tokenText)); } else newText.append(tokenText); lastEndOffset = endOffset; if(tok>MAX_HL_TOKEN) break; } if (lastEndOffset < text.length()) newText.append(text.substring(lastEndOffset)); */ /** * Extracts all term texts of a given Query. Term texts will be returned in lower-case. * * @param query Query to extract term texts from * @param terms HashSet where extracted term texts should be put into (Elements: String) * @param prohibited true to extract "prohibited" terms, too */ public static final void getTerms(Query query, HashSet terms, boolean prohibited) throws IOException { if (query instanceof BooleanQuery) getTermsFromBooleanQuery((BooleanQuery) query, terms, prohibited); else if (query instanceof PhraseQuery) getTermsFromPhraseQuery((PhraseQuery) query, terms); else if (query instanceof TermQuery) getTermsFromTermQuery((TermQuery) query, terms); else if (query instanceof PrefixQuery) getTermsFromPrefixQuery((PrefixQuery) query, terms, prohibited); else if (query instanceof RangeQuery) getTermsFromRangeQuery((RangeQuery) query, terms, prohibited); else if (query instanceof MultiTermQuery) getTermsFromMultiTermQuery((MultiTermQuery) query, terms, prohibited); } /** * Extracts all term texts of a given BooleanQuery. Term texts will be returned in lower-case. * * @param query BooleanQuery to extract term texts from * @param terms HashSet where extracted term texts should be put into (Elements: String) * @param prohibited true to extract "prohibited" terms, too */ private static final void getTermsFromBooleanQuery(BooleanQuery query, HashSet terms, boolean prohibited) throws IOException { BooleanClause[] queryClauses = query.getClauses(); int i; for (i = 0; i < queryClauses.length; i++) { if (prohibited || !queryClauses[i].prohibited) getTerms(queryClauses[i].query, terms, prohibited); } } /** * Extracts all term texts of a given PhraseQuery. Term texts will be returned in lower-case. * * @param query PhraseQuery to extract term texts from * @param terms HashSet where extracted term texts should be put into (Elements: String) */ private static final void getTermsFromPhraseQuery(PhraseQuery query, HashSet terms) { terms.add(query.toString("contents").toLowerCase().substring(1,query.toString().length()-1)); /* Term[] queryTerms = query.getTerms(); int i; for (i = 0; i < queryTerms.length; i++) terms.add(getTermsFromTerm(queryTerms[i])); */ } /** * Extracts all term texts of a given TermQuery. Term texts will be returned in lower-case. * * @param query TermQuery to extract term texts from * @param terms HashSet where extracted term texts should be put into (Elements: String) */ private static final void getTermsFromTermQuery(TermQuery query, HashSet terms) { terms.add(getTermsFromTerm(query.getTerm())); } /** * Extracts all term texts of a given MultiTermQuery. Term texts will be returned in lower-case. * * @param query MultiTermQuery to extract term texts from * @param terms HashSet where extracted term texts should be put into (Elements: String) * @param prohibited true to extract "prohibited" terms, too */ private static final void getTermsFromMultiTermQuery(MultiTermQuery query, HashSet terms, boolean prohibited) throws IOException { getTerms(query.getQuery(), terms, prohibited); } /** * Extracts all term texts of a given PrefixQuery. Term texts will be returned in lower-case. * * @param query PrefixQuery to extract term texts from * @param terms HashSet where extracted term texts should be put into (Elements: String) * @param prohibited true to extract "prohibited" terms, too */ private static final void getTermsFromPrefixQuery(PrefixQuery query, HashSet terms, boolean prohibited) throws IOException { getTerms(query.getQuery(), terms, prohibited); } /** * Extracts all term texts of a given RangeQuery. Term texts will be returned in lower-case. * * @param query RangeQuery to extract term texts from * @param terms HashSet where extracted term texts should be put into (Elements: String) * @param prohibited true to extract "prohibited" terms, too */ private static final void getTermsFromRangeQuery(RangeQuery query, HashSet terms, boolean prohibited) throws IOException { getTerms(query.getQuery(), terms, prohibited); } /** * Extracts the term of a given Term. The term will be returned in lower-case. * * @param term Term to extract term from * * @return the Term's term text */ private static final String getTermsFromTerm(Term term) { return term.text().toLowerCase(); } } --=_-=_-EPECKFIMBOINBCAA Content-Type: text/plain; charset=us-ascii -- To unsubscribe, e-mail: For additional commands, e-mail: --=_-=_-EPECKFIMBOINBCAA--