lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DM Smith (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1350) SnowballFilter resets the payload
Date Tue, 05 Aug 2008 17:14:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619961#action_12619961
] 

DM Smith commented on LUCENE-1350:
----------------------------------

The non-reuse interface is deprecated. LUCENE-1333 deals with cleaning that up and applying
reuse in all of Lucene. To date, it was partially applied to core. This results in sub-optimal
performance with Filter chains that use both reuse and non-reuse inputs and filters.

So LUCENE-1333 updates SnowballFilter to use next(Token).

The documentation in TokenStream documents that only producers invoke clear().

To me, it is not clearcut what a producer or a consumer actually is. Obviously, input streams
are producers. Some filters, generate multiple tokens as a replacement for the current one
(e.g. NGram, stemming,...). To me, these are producers.

If the rule of thumb is that Filters are consumers, merely changing their token's term, then
there are lot's of places that need to be changed. I noticed that SnowballFilter's methodology
was fairly common:
Token token = input.next();
...
String newTerm = ....;
...
return new Token(newTerm, token.startOffset(), token.endOffset(), token.type());

In migrating this to the reuse pattern, I saw new Token(...) as a producer pattern and to
maintain the equivalent behavior clear() needed to be called:
public Token next(Token token)
{
token = input.next(token);
...
String newTerm = ....;
...
token.clear(); // do most of the initialization that new Token does
token.setTermBuffer(newTerm); // new method introduced in LUCENE-1333
return token;
}

I don't know why the following pattern was not originally used (some filters do this) or why
you didn't migrate to this:
Token token = input.next();
...
String newTerm = ....;
...
token.setTermText(newTerm);
return token;

This would be faster than cloning and would preserve all fields.



> SnowballFilter resets the payload
> ---------------------------------
>
>                 Key: LUCENE-1350
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1350
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/*
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>         Attachments: LUCENE-1350.patch
>
>
> Passing tokens with payloads through SnowballFilter results in tokens with no payloads.
> A workaround for this is to apply stemming first and only then run whatever logic creates
the payload, but this is not always convenient.
> Patch to follow that preserves the payload.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message