lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <>
Subject [jira] Commented: (LUCENE-627) highlighter problems with overlapping tokens
Date Thu, 13 Jul 2006 22:25:31 GMT
    [ ] 

Yonik Seeley commented on LUCENE-627:

>>The original token stream is a valid one though right?
> I don't think so, see below...

Ah, right... I constructed the wrong one first.  I wanted pod and ipod in the same position...
so the token stream looks like "i" ("pod"|"ipod") "foo".
Now this token-stream is correct, I believe, but the same problem happens.

A work-around is to swap the order that "pod" and "ipod" tokens appear, but it seems like
any such workaround should be put into the highlighter rather than external to it.

  public void testOverlapAnalyzer2() throws Exception

    String s = "iPod foo";
    // the token stream for the string above:
    TokenStream ts = new TokenStream() {
      Iterator iter;
        List lst = new ArrayList();
        Token t;
        t = new Token("i",0,1);
        t = new Token("pod",1,4);
        t = new Token("ipod",0,4);
        t.setPositionIncrement(0);   // pod and ipod occupy the same token position.
        t = new Token("foo",5,8);
        iter = lst.iterator();
      public Token next() throws IOException {
        return iter.hasNext() ? (Token) : null;

    String srchkey = "foo";

    QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
    Query query = parser.parse(srchkey);

    Highlighter highlighter = new Highlighter(new QueryScorer(query));

// Get 3 best fragments and seperate with a "..."
    String result = highlighter.getBestFragments(ts, s, 3, "...");
    String expectedResult="iPod <B>foo</B>";

> highlighter problems with overlapping tokens
> --------------------------------------------
>          Key: LUCENE-627
>          URL:
>      Project: Lucene - Java
>         Type: Bug

>   Components: Other
>     Versions: 2.0.1
>     Reporter: Yonik Seeley

> The lucene highlighter has problems when tokens that overlap are generated.
> For example, if analysis of iPod generates the tokens "i", "pod", "ipod" (with pod and
ipod in the same position),
> then the highlighter will output this as iipod, regardless of if any of those tokens
are highlighted.
> Discovered via

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message