Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Date: Tue, 09 Apr 2002 12:22:50 -0700
From: "none none" <korfut@lycos.com>
Message-ID: <AAOBJDJDHMNCCCAA@mailcity.com>
Mime-Version: 1.0
Reply-To: korfut@lycos.com
Subject: HighLighting Service
Organization: Lycos Mail  (http://mail.lycos.com:80)
Content-Type: multipart/mixed; boundary="=_-=_-EPECKFIMBOINBCAA"
Content-Transfer-Encoding: 7bit

--=_-=_-EPECKFIMBOINBCAA
Content-Language: en
Content-Type: text/plain; charset=us-ascii
Content-Language: en
Content-Length: 2691
Content-Transfer-Encoding: 7bit

hi all,
i am working on the Highlight terms functionality of Lucene.
I followed step by step the suggestion of Maik Schreiber (http://www.iq-computing.de/lucene/highlight.htm), i implemented it with some changes:
In the white paper the HL was based just on the summary field, my version read the document (from a cache) with SelfBufferedStream method, into a string that is passed to the HighLight method.

Some problem show up here:

1.It doesn't work with all the Query , e.g.: WidcardQuery,FuzzyQuery,PrefixQuery, PhraseQuery.

3.The response time is not constant, e.g.: if the documents to produce highlight are big files , like 2/4 MB , the average response time per query is:
-20 sec 10 doc of 2 mb each 
otherwise for small files:
-0.6 sec 10 doc of 20 kb each

What we can do? any suggestion?

Some tips :
1.Document must be plain text to have a good result, this mean there are 2 options: first build a text version on the document at runtime (if there are big document this will be an other handycap in response time), second have a cache of all the document is plain text version.

2.The HL process produce an highlighted version for the entire document, while would be good have just a portion or 2 or 3.
In this case we can take advantage because we cut the iteration process when we are done, saving some time and resource.

3.I think we should incorporate this feature in Lucene, right now to make this working you should change some code in the Lucene package, so stay up to date require to change every time these part of code (if they are still there!!).Also because it strictly depend on the Lucene core package.

I attach my version of the LuceneTools.java and the code i wrote used by the servlet:
...
String brief;
String url = doc.get("url"); //get the cached plain text version of document to highlight
StringBuffer sb = new StringBuffer("");
StringBuffer sblower = new StringBuffer("");
String s = new String();
FileInputStream fis = new FileInputStream(url) ;
byte[] b = new byte[1024];
int effective=-1;
while( (effective=fis.read(b))!=-1 )
{
 s = new String(b);
 sb.append( s );
 sblower.append(s.toLowerCase());
}
fis.close();
{
 brief = LuceneTools.highlightTerms( sb.toString() , sblower.toString(), highLighter , query, analyzer);
}
catch(Exception e)
{e.printStackTrace();}
out.println(searchUI.getSearchItem(score,doctitle,url,"..."+brief+"..."));

....

I hope someone can help me giving some tips to make me able to complete this functionality.
Thanks, bye. 


See Dave Matthews Band live or win a signed guitar
http://r.lycos.com/r/bmgfly_mail_dmb/http://win.ipromotions.com/lycos_020201/splash.asp 
--=_-=_-EPECKFIMBOINBCAA
Content-Type: java/*; name="LuceneTools.java"
Content-Length: 13170
Content-Transfer-Encoding: 8bit

/*

Lucene-Highlighting � Lucene utilities to highlight terms in texts
Copyright (C) 2001 Maik Schreiber

This library is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or
(at your option) any later version.

This library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
License for more details.

You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

*/

package org.apache.lucene.demo.highlight;

import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;


/**
 * Contains miscellaneous utility methods for use with Lucene.
 *
 * @version $Id: LuceneTools.java,v 1.5 2001/10/16 07:25:55 mickey Exp $
 * @author Maik Schreiber (mailto: bZ@iq-computing.de)
 */
public final class LuceneTools
{

    private static int MAX_LENGTH = 300;
    private static int MAX_HL_TOKEN = 3;
  /** LuceneTools must not be instantiated directly. */
  private LuceneTools() {}


  /**
   * Highlights a text in accordance to a given query.
   *
   * @param text        text to highlight terms in
   * @param highlighter TermHighlighter to use to highlight terms in the text
   * @param query       Query which contains the terms to be highlighted in the text
   * @param analyzer    Analyzer used to construct the Query
   *
   * @return highlighted text
   */


  public static final String highlightTerms(String text,String lowerCaseText, TermHighlighter highlighter, Query query, Analyzer analyzer) 
  throws IOException
  {
    long st=System.currentTimeMillis();
    StringBuffer newText = new StringBuffer();
    TokenStream stream = null;
    String term;
    boolean phrase = query instanceof PhraseQuery;

    String qry = query.toString("contents");
    if(phrase)
        term = qry.toLowerCase().substring(1,qry.length()-1);
    else
        term = qry;

        System.out.println("Query Type:"+query.getClass());
    try
    {
        int cursorPos=0;
        int textLength = text.length();
        if(phrase)
        {
                System.out.println("Text Length:"+textLength);
                System.out.println("Query:"+term);
            int termLen=term.length();
                System.out.println("Query Length:"+termLen);
            int startOff = lowerCaseText.indexOf(term,cursorPos);
            if(startOff<0)
                startOff=1;
                System.out.println("startOff:"+startOff);
            int endOff= startOff + termLen;
                System.out.println("endOff:"+endOff);
            int delta=(MAX_LENGTH - termLen)/2;
                System.out.println("delta:"+delta);
            int from;
            int to;
            int over=0;
            int miss=0;
            int extraOver=0;

            if(delta>startOff)
            {
                from=1;
                over = delta-startOff;
                    System.out.println("from:"+from+" over:"+over);
            }
            else
            {
                from=startOff-delta;
                    System.out.println("from:"+from);
            }
            if(endOff+delta+over>textLength)
            {
                to=textLength;
                miss = to-(endOff+delta+over);
                extraOver = from - miss;
                if(extraOver>0)
                    from=from-extraOver;
                else
                    from=1;
                    System.out.println("to:"+to+" miss:"+miss+" extraOver:"+extraOver);
            }
            else
            {
                to=endOff+delta+over;
                    System.out.println("to:"+to);
            }
            
            newText.append( text.substring(from,startOff) + "<b>"+text.substring(startOff,endOff)+"</b>"+text.substring(endOff,to) );
        }
        else
        {
            HashSet terms = new HashSet();
            org.apache.lucene.analysis.Token token;
            String tokenText;
            int startOffset;
            int endOffset;
            int lastEndOffset = 0;

            getTerms(query, terms, false);
            stream = analyzer.tokenStream(new StringReader(text));
            while ((token = stream.next()) != null)
            {
                if (terms.contains(token.termText()))
                {
                    startOffset = token.startOffset();
                    endOffset = token.endOffset();
                    tokenText = text.substring(startOffset, endOffset);
                    int termLen=tokenText.length();
                        System.out.println("Term Length:"+termLen);
                        System.out.println("Query:"+tokenText);
                    int delta=(MAX_LENGTH - termLen)/2;
                        System.out.println("delta:"+delta);
                        
                    int from;
                    int to;
                    int over=0;
                    int miss=0;
                    int extraOver=0;

                    if(delta>startOffset)
                    {
                        from=1;
                        over = delta-startOffset;
                            System.out.println("from:"+from+" over:"+over);
                    }
                    else
                    {
                        from=startOffset-delta;
                            System.out.println("from:"+from);
                    }
                    if(endOffset+delta+over>textLength)
                    {
                        to=textLength;
                        miss = to-(endOffset+delta+over);
                        extraOver = from - miss;
                        if(extraOver>0)
                            from=from-extraOver;
                        else
                            from=1;
                            System.out.println("to:"+to+" miss:"+miss+" extraOver:"+extraOver);
                    }
                    else
                    {
                        to=endOffset+delta+over;
                            System.out.println("to:"+to);
                    }
                    newText.append( text.substring(from,startOffset) + "<b>"+tokenText+"</b>"+text.substring(endOffset,to) );
                    break;
                }
            }
        }
        System.out.println("In Memory HighLighting Time:"+(System.currentTimeMillis() - st));
        return newText.toString().trim();
    }
    finally
    {
      if (stream != null)
      {
        try
        {
          stream.close();
        }
        catch (Exception e) {}
      }
    }
  }

  /*
      getTerms(query, terms, false);
            stream = analyzer.tokenStream(new StringReader(text));
            while ((token = stream.next()) != null)
            {
                startOffset = token.startOffset();
                endOffset = token.endOffset();
                tokenText = text.substring(startOffset, endOffset);
                // append text between end of last token (or beginning of text) and start of current token
                if (startOffset > lastEndOffset)
                    newText.append(text.substring(lastEndOffset, startOffset));
                // does query contain current token?
                if (terms.contains(token.termText()))
                {
                    tok++;
                    newText.append(highlighter.highlightTerm(tokenText));
                }
                else
                    newText.append(tokenText);
                lastEndOffset = endOffset;
                if(tok>MAX_HL_TOKEN)
                    break;
            }
            if (lastEndOffset < text.length())
                newText.append(text.substring(lastEndOffset));
            */
  /**
   * Extracts all term texts of a given Query. Term texts will be returned in lower-case.
   *
   * @param query      Query to extract term texts from
   * @param terms      HashSet where extracted term texts should be put into (Elements: String)
   * @param prohibited <code>true</code> to extract "prohibited" terms, too
   */
  public static final void getTerms(Query query, HashSet terms, boolean prohibited)
    throws IOException
  {
    if (query instanceof BooleanQuery)
      getTermsFromBooleanQuery((BooleanQuery) query, terms, prohibited);
    else if (query instanceof PhraseQuery)
      getTermsFromPhraseQuery((PhraseQuery) query, terms);
    else if (query instanceof TermQuery)
      getTermsFromTermQuery((TermQuery) query, terms);
    else if (query instanceof PrefixQuery)
      getTermsFromPrefixQuery((PrefixQuery) query, terms, prohibited);
    else if (query instanceof RangeQuery)
      getTermsFromRangeQuery((RangeQuery) query, terms, prohibited);
    else if (query instanceof MultiTermQuery)
      getTermsFromMultiTermQuery((MultiTermQuery) query, terms, prohibited);
  }

  /**
   * Extracts all term texts of a given BooleanQuery. Term texts will be returned in lower-case.
   *
   * @param query      BooleanQuery to extract term texts from
   * @param terms      HashSet where extracted term texts should be put into (Elements: String)
   * @param prohibited <code>true</code> to extract "prohibited" terms, too
   */
  private static final void getTermsFromBooleanQuery(BooleanQuery query, HashSet terms,
    boolean prohibited) throws IOException
  {
    BooleanClause[] queryClauses = query.getClauses();
    int i;

    for (i = 0; i < queryClauses.length; i++)
    {
      if (prohibited || !queryClauses[i].prohibited)
        getTerms(queryClauses[i].query, terms, prohibited);
    }
  }

  /**
   * Extracts all term texts of a given PhraseQuery. Term texts will be returned in lower-case.
   *
   * @param query PhraseQuery to extract term texts from
   * @param terms HashSet where extracted term texts should be put into (Elements: String)
   */
  private static final void getTermsFromPhraseQuery(PhraseQuery query, HashSet terms)
  {
    terms.add(query.toString("contents").toLowerCase().substring(1,query.toString().length()-1));
    /*
    Term[] queryTerms = query.getTerms();
    int i;

    for (i = 0; i < queryTerms.length; i++)
      terms.add(getTermsFromTerm(queryTerms[i]));
    */
  }

  /**
   * Extracts all term texts of a given TermQuery. Term texts will be returned in lower-case.
   *
   * @param query TermQuery to extract term texts from
   * @param terms HashSet where extracted term texts should be put into (Elements: String)
   */
  private static final void getTermsFromTermQuery(TermQuery query, HashSet terms)
  {
    terms.add(getTermsFromTerm(query.getTerm()));
  }

  /**
   * Extracts all term texts of a given MultiTermQuery. Term texts will be returned in lower-case.
   *
   * @param query      MultiTermQuery to extract term texts from
   * @param terms      HashSet where extracted term texts should be put into (Elements: String)
   * @param prohibited <code>true</code> to extract "prohibited" terms, too
   */
  private static final void getTermsFromMultiTermQuery(MultiTermQuery query, HashSet terms,
    boolean prohibited) throws IOException
  {
    getTerms(query.getQuery(), terms, prohibited);
  }

  /**
   * Extracts all term texts of a given PrefixQuery. Term texts will be returned in lower-case.
   *
   * @param query      PrefixQuery to extract term texts from
   * @param terms      HashSet where extracted term texts should be put into (Elements: String)
   * @param prohibited <code>true</code> to extract "prohibited" terms, too
   */
  private static final void getTermsFromPrefixQuery(PrefixQuery query, HashSet terms,
    boolean prohibited) throws IOException
  {
    getTerms(query.getQuery(), terms, prohibited);
  }

  /**
   * Extracts all term texts of a given RangeQuery. Term texts will be returned in lower-case.
   *
   * @param query      RangeQuery to extract term texts from
   * @param terms      HashSet where extracted term texts should be put into (Elements: String)
   * @param prohibited <code>true</code> to extract "prohibited" terms, too
   */
  private static final void getTermsFromRangeQuery(RangeQuery query, HashSet terms,
    boolean prohibited) throws IOException
  {
    getTerms(query.getQuery(), terms, prohibited);
  }

  /**
   * Extracts the term of a given Term. The term will be returned in lower-case.
   *
   * @param term Term to extract term from
   *
   * @return the Term's term text
   */
  private static final String getTermsFromTerm(Term term)
  {
    return term.text().toLowerCase();
  }
}

--=_-=_-EPECKFIMBOINBCAA
Content-Type: text/plain; charset=us-ascii

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
--=_-=_-EPECKFIMBOINBCAA--