lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1333) Token implementation needs improvements
Date Sun, 10 Aug 2008 10:31:44 GMT


Michael McCandless commented on LUCENE-1333:

But still would like to clarify on what can the TokenStream assume. I think
TokenStream cannot assume anything about the token it gets as input, and,
once it returned a token, it cannot assume anything about how that token
is used. So why should it not expect being passed the token it just returned?

The upshot of all of this, Producers don't care which token they reuse.

I agree -- technically speaking, whenever a Token is returned from a source/filter's next(Token)
method, *anything* is allowed to happen do it (including any & all changes, and subsequent
reuse in future calls to next(Token)) and so the current pattern will run correctly if all
sources & filters are implemented correctly. This is the contract in the reuse API.

It's just that it looks spooky, when you are consuming tokens, not to create & re-use
your own reusable token.  I think it's also possible (but not sure) that the JRE can compile/run
the "single reusable token" pattern more efficienctly, since you are making many method calls
with a constant (for the life time of the for loop) single argument, but this is pure speculation
on my part...

I think from a code-smell standpoint I'd still like to use the "single re-use" pattern when
applicable.  DM I'll make this change & post a new patch.

> Token implementation needs improvements
> ---------------------------------------
>                 Key: LUCENE-1333
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.1
>         Environment: All
>            Reporter: DM Smith
>            Priority: Minor
>             Fix For: 2.4
>         Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch,
LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch,
LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch,
LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch,
LUCENE-1333.patch, LUCENE-1333a.txt
> This was discussed in the thread (not sure which place is best to reference so here are
> or to see it all at once:
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the
> 2. Deprecations are incomplete. The constructors that take String as an argument and
the methods that take and/or return String should *all* be deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting buffer can
be over-allocated. A less aggressive algorithm would be better. In the thread, the Python
example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be upgraded
now rather than waiting for 3.0. As it stands, filter chains that alternate between char[]
and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in
contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse
is maximized and descendant classes should be rewritten to over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add constructors
that took a Token. This would simplify the use of the two together.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message