lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Delbru (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource
Date Mon, 21 Jan 2013 15:28:13 GMT


Renaud Delbru commented on LUCENE-4642:

Personally: I think we should remove Tokenizer(AttributeSource): it bloats the APIs and causes
ctor explosion.

Why not the contrary instead ? I.e., remove Tokenizer(AttributeFactory) and leave Tokenizer(AttributeSource)
since AttributeFactory is an enclosed class of AttributeSource ? Limiting the API to only
AttributeFactory will restrict it unnecessarily imho.

Our use case is to be able to create "advanced token streams", where one "parent token stream"
can have multiple "child token streams", the parent token stream will share their attribute
sources with the child token streams for performance reasons. Emulating this behaviour by
doing copies of the attributes from stream to stream is really ineffective (our throughput
is divided by at least 3).
A more concrete use case is the ability to create "specific token streams" for a particular
"token type". For example, our parent tokenizer tokenizes a string into a list of tokens,
each one having a specific type. Then, each token is processed downstream by "child token
streams". The child token stream that will process the token depends on the token type attribute.
> TokenizerFactory should provide a create method with a given AttributeSource
> ----------------------------------------------------------------------------
>                 Key: LUCENE-4642
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.1
>            Reporter: Renaud Delbru
>            Assignee: Steve Rowe
>              Labels: analysis, attribute, tokenizer
>             Fix For: 4.2, 5.0
>         Attachments: LUCENE-4642.patch, LUCENE-4642.patch
> All tokenizer implementations have a constructor that takes a given AttributeSource as
parameter (LUCENE-1826). However, the TokenizerFactory does not provide an API to create tokenizers
with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide constructors that
take AttributeSource and AttributeFactory.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message