lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Mon, 23 Mar 2009 22:07:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688439#action_12688439
] 

Michael McCandless commented on LUCENE-1522:
--------------------------------------------

bq. I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries
don't highlight correctly

But that's really quite a serious problem; it's the kind that
immediately erodes user's trust.  Though if this user had used
SpanScorer it would have been fixed (right?).

Is there any reason not to use SpanScorer (vs QueryScorer)?

The "final inch" (search UI) is exceptionally important!

bq. When users see the PhraseQuery look right, I havn't seen any other repeated complaints
really.

OK.

bq. And I think we have positional solved fairly well with the current API - its just too
darn slow.

Well... I'd still like to explore some way to better integrate w/ core
(just don't have enough time, but maybe if I keep talking about it
here, someone else will get the itch + time ;).

I think an IndexReader impl around loaded TermVectors can get us OK
performance (no re-analysis nor linear scan of resynthesized
TokenStream).

bq. Not that I am against things being sweet and perfect, and getting exact matches, but there
has been lots of talk in the past about integrating the highlighter into core and making things
really fast and efficient - and generally it comes down to what work actually gets done (and
all this stuff ends up at the hard end of the pool).

Well this is open source after all.  Things get "naturally
prioritized".

bq. A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters.

Yeah also another common theme in open-source development, though it's
in good company: evolution and capitalism share the same "flaw".


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message