lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Smith (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1826) All Tokenizer implementations should have constructors that take AttributeSource and AttributeFactory
Date Fri, 21 Aug 2009 13:36:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745969#action_12745969
] 

Tim Smith commented on LUCENE-1826:
-----------------------------------

bq. This is not possible per design. The AttributeSource cannot be changed.
I fully understand why

but...
it should be rather easy to add a reset(AttributeSource input) to AttributeSource
{code}
public void reset(AttributeSource input) {
    if (input == null) {
      throw new IllegalArgumentException("input AttributeSource must not be null");
    }
    this.attributes = input.attributes;
    this.attributeImpls = input.attributeImpls;
    this.factory = input.factory;
}
{code}

This would require making attributes and attributeImpls non-final (potentially reducing some
jvm caching capabilities)

However, this then provides the ability to do even more Attribute reuse
For example, if this method existed, the Indexer could use a ThreadLocal of raw AttributeSources
(one AttributeSource per thread)
then, prior to calling TokenStream.reset(), it could call TokenStream.reset(ThreadLocal AttributeSource)

This would result in all token streams for the same document using the same AttributeSource
(reusing TermAttribute, etc)

This would require that the no TokenStreams/Filters/Tokenizers call addAttribute() in the
constructor (they would have to do this in reset())

I totally get that this is a tall order
If you want i can open a separate ticket for this (AttributeSource.reset(AttributeSource))
for further consideration



> All Tokenizer implementations should have constructors that take AttributeSource and
AttributeFactory
> -----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1826
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Assignee: Michael Busch
>             Fix For: 2.9
>
>
> I have a TokenStream implementation that joins together multiple sub TokenStreams (i
then do additional filtering on top of this, so i can't just have the indexer do the merging)
> in 2.4, this worked fine.
> once one sub stream was exhausted, i just started using the next stream 
> however, in 2.9, this is very difficult, and requires copying Term buffers for every
token being aggregated
> however, if all the sub TokenStreams share the same AttributeSource, and my "concat"
TokenStream shares the same AttributeSource, this goes back to being very simple (and very
efficient)
> So for example, i would like to see the following constructor added to StandardTokenizer:
> {code}
>   public StandardTokenizer(AttributeSource source, Reader input, boolean replaceInvalidAcronym)
{
>     super(source);
>     ...
>   }
> {code}
> would likewise want similar constructors added to all Tokenizer sub classes provided
by lucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message