lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <>
Subject [jira] Commented: (SOLR-2051) analysis.jsp is incorrect for protWords etc
Date Mon, 16 Aug 2010 22:16:17 GMT


Uwe Schindler commented on SOLR-2051:

After a discussion with Robert, I also think that a Tap would be an elegant and less intrusive
aproach (from the TokenStreams point of view). The Whole thing would simply create the Tokenizer,
wrap the tap-filter around then add the next filter in chain, again add the tap again, and
so on.

The filter simply calls input.increametToken() and then prints the current attributes. It
can also hold a local "pos" field that is updated with positionIncrement to do formatting
right. The code to resort tokens when negative position increments occur is useless, as Lucene
no longer allows negative position increments (from what I know). The whole JSP would use
no caching lists of tokens, no iterators, no array copy, no copyTo(). It just builds a tokenstream
and consumes it. The Tap filter can also be added around the generic (non TokenizerChain Lucene
Analyzer). The main code would simply do "while (ts.incrementToken())" - nothing more. All
printout is done in the filters added between each chain step (or after the generic lucene

> analysis.jsp is incorrect for protWords etc
> -------------------------------------------
>                 Key: SOLR-2051
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: web gui
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>         Attachments: SOLR-2051.patch, SOLR-2051.patch
> Analysis.jsp gives the incorrect results if you use "protwords.txt" or "stemdict.txt"
or the like.
> This is because this is now implemented with KeywordAttribute (so you can easily override
any stemmer etc).
> For example, if your schema had "foobars" in protwords.txt, analysis.jsp would show it
being stemmed to "foobar", even though this doesnt actually happen.
> The problem is that this jsp is downconverting the entire tokenstream to Token in between
processing, so it silently discards KeywordAttribute and you get the wrong result.
> Note: this issue isnt about *displaying* other attributes such as KeywordAttribute (which
would be a new feature). Its about not throwing them away so that the analysis actually represents
what happens.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message