lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-644) Contrib: another highlighter approach
Date Wed, 02 Aug 2006 13:34:15 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-644?page=comments#action_12425230 ] 
            
Mark Harwood commented on LUCENE-644:
-------------------------------------

Many thanks for the client code Ronnie - I have tried it with my index and have reproduced
the speed-up. 
I'm keen to integrate any code that offers a speed-up and ideally in such a way so that we
have one highlighter + Junit test rig which can work with indexes with TermPositionVectors
and also those without. This I suspect will involve merging bits of our code. There are a
lot of test cases with strange analyzers that need to be considered so that's why I'm keen
to have one codebase.

I'm disappearing on 2 weeks holiday (vacation) shortly so haven't got a lot of time to look
at this right now but I plan to when I get back.

After a quick look I haven't yet identified the difference between your approach and mine
which offers the speed-up. One factor is likely that your code only considers offset positions
of tokens that are actually query terms and that may be something I could retrofit into TokenSources
to produce TokenStreams of only the important tokens to the highlighter.
I suspect there are other benefits to be had from your code too though which I'll have to
consider when I have more time.

Thanks again for this,

Cheers
Mark

> Contrib: another highlighter approach
> -------------------------------------
>
>                 Key: LUCENE-644
>                 URL: http://issues.apache.org/jira/browse/LUCENE-644
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>            Reporter: Ronnie Kolehmainen
>            Priority: Minor
>         Attachments: FulltextHighlighter.java, FulltextHighlighterTest.java, svn-diff.patch
>
>
> Mark Harwoods highlighter package is a great contribution to Lucene, I've used it a lot!
However, when you have *large* documents (fields), highlighting can be quite time consuming
if you increase the number of bytes to analyze with setMaxDocBytesToAnalyze(int). The default
value of 50k is often too low for indexed PDFs etcetera, which results in empty highlight
strings.
> This is an alternative approach using term position vectors only to build fragment info
objects. Then a StringReader can read the relevant fragments and skip() between them. This
is a lot faster. Also, this method uses the *entire* field for finding the best fragments
so you're always guaranteed to get a highlight snippet.
> Because this method only works with fields which have term positions stored one can check
if this method works for a particular field using following code (taken from TokenSources.java):
>         TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, field);
>         if (tfv != null && tfv instanceof TermPositionVector)
>         {
>           // use FulltextHighlighter
>         }
>         else
>         {
>           // use standard Highlighter
>         }
> Someone else might find this useful so I'm posting the code here.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message