lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karel Tejnora (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-663) New feature rich higlighter for Lucene.
Date Wed, 23 Aug 2006 00:07:15 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-663?page=comments#action_12429848 ] 
            
Karel Tejnora commented on LUCENE-663:
--------------------------------------

Hi,
yes as I  wrote in the code and keeps author - I borrow small code parts from this contribution
http://issues.apache.org/jira/browse/LUCENE-644?page=all 
(where is a small bug when term is on or near to end of field - change lines 321:sb.append(cbuf,
0, EOF ? skip : (surround - skippedChars));  
276:int readed = reader.read(cbuf, 0, nextStart - pos); 278:sb.append(cbuf,0,readed);
also from WildcardTermEnum.

Motivation - I was unable to find a highlighter with good performance and proper phrase highlight
(at beginning I needed just phrase with slop 0).

This highlighter results highlight for query "karel drinks beer"~4 on text karel drinks a
lot of beers. Beer is his life. -> <SUFFIX>karel</SUFFIX> <PREFIX>drinks<SUFFIX>
a lot of  czech <PREFIX>beer</SUFFIX>. Beer is his life.

I started to implement a stack for phrase query - end up with this.  Still it is not final,
fuzzy, span,scoring and coloring needs to be done.
I mean 'Coloring':
<PREFIX>karel</SUFFIX> <PREFIX>drinks<SUFFIX> a <PREFIX1>lot</SUFFIX1>
of  <PREFIX1>czech</SUFFIX1> <PREFIX>beer</SUFFIX>. Beer is his life.

for wild card BMW* -> <PREFIX>BMW</SUFFIX><PREFIX>ED</SUFFIX1>
etc.

So user can see why document matches his query.

Usage is maybe more straightforward:

Constructs Highlighter where all passed fields will be highlighted using TermPositionVector
(where is not tpv null is returned)

FulltextHighlighter highlighter = new FulltextHighlighter(reader,query,prefix,suffix);

OR 
Constructs Highlighter where all fields with highlight will be highlighted using Analyzer

FulltextHighlighter highlighter = new FulltextHighlighter(analyzer,query,prefix,suffix);

Constructs Highlighter where analyzer or TermVector will be autodetected
FulltextHighlighter highlighter = new FulltextHighlighter(reader, analyzer,query,prefix,suffix);

And when iterating hits:
String higlightedText = highlighter.highlight(luceneDocumentID, luceneDocument, fieldName);
 // To use tpv

OR
String higlightedText = highlighter.highlight(luceneDocument, fieldName);  // To use analyzer,
if tpv usage is forced assert reacts

it has some options:
setAnalyzerUnstable(boolean analyzerUnstable)  set it false (default true) if you know that
Token t(n).startOffset() < t(n+1).startOffset
setMaxFragments(int i); max fragmets
setSurround(int surround);

a) b) I don't know maybe it will be faster or lighter or none from both but I began because
none from contributed and issued give 'nice' results.
Im using a lot queries to search names like "James Bond" OR "Sean Connery" a this gives me
nicer view why the document matches my query.

:-) Or I don't know how to use google

> New feature rich higlighter for Lucene.
> ---------------------------------------
>
>                 Key: LUCENE-663
>                 URL: http://issues.apache.org/jira/browse/LUCENE-663
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karel Tejnora
>         Attachments: lucene-hlt-src.jar
>
>
> Well, I refactored (took) some code from two previous highlighters.
> This highlighter:
> + use TermPositionVector where available
> + use Analyzer if no TermPositionVector found or is forced to use it.
> + support for all lucene queries (Term, Phrase with slops, Prefix, Wildcard, Range) except
Fuzzy Query (can be implemented easly)
> - has no support for scoring (yet)
> - use same prefix,postfix for accepted terms (yet)
> ? It's written in Java5
> In next release I'd like to add support for Fuzzy, "coloring" f.e. diffrent color for
terms btw. phrase terms (slops), scoring of fragments
> It's apache licensed - I hope so :-) I put licene statement in every file

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message