lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Delbru (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource
Date Fri, 25 Jan 2013 17:29:13 GMT


Renaud Delbru commented on LUCENE-4642:


have you looked at TeeSinkTokenFilter

Yes, and from my current understanding, it is similar to our current implementation. The problem
with this approach is that the exchange of attributes is performed using the AttributeSource.State
API with AttributeSource#captureState and AttributeSource#restoreState, which copies the values
of all attribute implementations that the state contains, and this is very inefficient as
it has to copies arrays and other objects (e.g., char term arrays, etc.) for every single


Concerning the problem of UOEs, the new patch of Steve reduces the number of UOEs to one only,
which is much more reasonable than my first approach. I have looked at the current state of
the Lucene trunk, and there are already a lot of UOEs in many places. So, I would suggest
that this problem may not be a blocking one (but I might be wrong).

Concerning the problem of constructor explosion, maybe we can find a consensus. Your proposition
of removing Tokenizer(AttributeSource) cannot work for us, as we need it to share a same AttributeSource
across multiple streams. However, as I proposed, removing the Tokenizer(AttributeFactory)
could work as it could be emulated by using Tokenizer(AttributeSource).

> TokenizerFactory should provide a create method with a given AttributeSource
> ----------------------------------------------------------------------------
>                 Key: LUCENE-4642
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.1
>            Reporter: Renaud Delbru
>            Assignee: Steve Rowe
>              Labels: analysis, attribute, tokenizer
>             Fix For: 4.2, 5.0
>         Attachments: LUCENE-4642.patch, LUCENE-4642.patch
> All tokenizer implementations have a constructor that takes a given AttributeSource as
parameter (LUCENE-1826). However, the TokenizerFactory does not provide an API to create tokenizers
with a given AttributeSource.
> Side note: There are still a lot of tokenizers that do not provide constructors that
take AttributeSource and AttributeFactory.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message