lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wulf Berschin <bersc...@dosco.de>
Subject Re: PDF Highlighting using PDF Highlight File
Date Thu, 12 May 2011 14:47:17 GMT
Well, AFAIS the Lucene Highlighters do not offer this functionality via 
their API, but could easily do.

I think support for highlighting documents would be a very welcome 
feature. Highlighting HTML documents is already possible with the 
org.apache.solr.analysis.HTMLStripCharFilter and a NullFragmenter, but 
ther seems to be nothing for highlighting PDF files...

As starting point I quarried out from 
org.apache.lucene.search.highlight.Highlighter the class below which 
just returns the Tokens contributing to the hit.

Using the returned tokens a PDF highlight file could be easily generated 
and voilĂ ..

-- Wulf

package org.apache.lucene.search.highlight;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import 
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;


public class HighlightTokensExtractor
{
   private Scorer fragmentScorer = null;

   public HighlightTokensExtractor(Scorer fragmentScorer)
   {
     this.fragmentScorer = fragmentScorer;
   }

   public final List<Token> getTokens(TokenStream tokenStream, String text,
       boolean mergeContiguousFragments, int maxNumFragments)
       throws IOException, InvalidTokenOffsetsException
   {
     List<Token> result = new ArrayList<Token>();
     TermAttribute termAtt = tokenStream.addAttribute(TermAttribute.class);
     OffsetAttribute offsetAtt = 
tokenStream.addAttribute(OffsetAttribute.class);
     tokenStream.addAttribute(PositionIncrementAttribute.class);
     tokenStream.reset();

     // dummy text fragment
     TextFragment currentFrag = new TextFragment("", 0, 0);
     TokenStream newStream = fragmentScorer.init(tokenStream);
     if (newStream != null) {
       tokenStream = newStream;
     }
     fragmentScorer.startFragment(currentFrag);

     try {

       TokenGroup tokenGroup = new TokenGroup(tokenStream);

       for (boolean next = tokenStream.incrementToken(); next; next = 
tokenStream
           .incrementToken()) {
         if ((offsetAtt.endOffset() > text.length())
             || (offsetAtt.startOffset() > text.length())) {
           throw new InvalidTokenOffsetsException("Token " + termAtt.term()
               + " exceeds length of provided text sized " + text.length());
         }
         if ((tokenGroup.numTokens > 0) && (tokenGroup.isDistinct())) {

           if (tokenGroup.getTotalScore() > 0) {
             System.out.println(tokenGroup.matchStartOffset + " "
                 + tokenGroup.matchEndOffset);
 
result.add((Token)tokenGroup.getToken(tokenGroup.getNumTokens()-1));
           }
           tokenGroup.clear();

         }
         tokenGroup.addToken(fragmentScorer.getTokenScore());

       }

       if (tokenGroup.numTokens > 0) {

         if (tokenGroup.getTotalScore() > 0) {
           System.out.println(tokenGroup.matchStartOffset + " "
               + tokenGroup.matchEndOffset);
 
result.add((Token)tokenGroup.getToken(tokenGroup.getNumTokens()-1));
         }
       }

       return result;

     }
     finally {
       if (tokenStream != null) {
         try {
           tokenStream.close();
         }
         catch (Exception e) {
         }
       }
     }
   }

}



Am 10.05.2011 12:32, schrieb Wulf Berschin:
> Hi all,
>
> in our Lucene 3.0.3-based web application when a user clicks on a hit
> link the targeted PDF should be opened in the browser with highlighted
> hits.
>
> For this purpose using the Acrobat Highlight File (Parameter xml, see
> http://www.pdfbox.org/userguide/highlighting.html and
> http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf)
> seems most reasonable to me.
>
> Since the position to highlight are given by (page and) character
> offsets and Lucene uses offsets as well I think it could be easy (for
> more Lucene-skilled people than me) to create an Highlighter which
> produces this highlight file.
>
> Does such a Highlighter already exists in the Lucene World?
>
> If not could someone please point me the direction (e.g. where to hook
> into the existing (fast vector?) highlighter just to extract the offsets).
>
> BTW: Luke gyve me the impression that Term Vectors are only stored when
> the field content is sored as well. Is that true?
>
> Wulf


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message