lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Mon, 23 Mar 2009 22:43:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688448#action_12688448
] 

Mark Miller commented on LUCENE-1522:
-------------------------------------

{quote}But that's really quite a serious problem; it's the kind that
immediately erodes user's trust. Though if this user had used
SpanScorer it would have been fixed (right?).{quote}

Right - my point was more that it was a common complaint and has been solved in one way or
another for a long time. Even back when that post occured, there was a JIRA highlighter that
worked with phrase queries I think. There have been at least one or two besides the SpanScorer.

{quote}Is there any reason not to use SpanScorer (vs QueryScorer)?{quote}

It is slower when working with position sensitive clauses - because it actually does some
work. For non position sensitive terms, its the same speed as the standard. Makes sense to
me to always use it, but if you don't care and want every term highlighted, why pay the price
I guess...

{quote}
Well... I'd still like to explore some way to better integrate w/ core
(just don't have enough time, but maybe if I keep talking about it
here, someone else will get the itch + time .
{quote}

Right - don't get me wrong - I was just getting thoughts in my head down. These types of brain
dumps you higher level guys do def leads to work getting done - the SpanScorer came directly
from these types of discussions, and quite a bit later - the original discussion happened
before my time.

{quote}
Well this is open source after all. Things get "naturally
prioritized".

    A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters.

Yeah also another common theme in open-source development, though it's
in good company: evolution and capitalism share the same "flaw".
{quote}

Right. I suppose I was just suggesting that something more practical might make more sense
(more musing than suggesting). And practical in terms of how much activity we have seen in
the highlighter area (fairly low, and not usually to the extent needed to get something committed
and in use).

And the split work on the highlighters is fine - but if we had the right highlighter base,
more work could have been concentrated on the highlighter thats most used. Not really a complaint,
but idea for the future. If we can get something better going, perhaps we can get to the point
were people work with the current implementation rather than creating a new one every time.

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message