lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: attribute thoughts
Date Thu, 13 Aug 2009 20:32:11 GMT
On 8/13/09 7:29 AM, Yonik Seeley wrote:
> I'm liking the new attribute based analysis (in conjunction with
> reusability), but I'm running into some questions...
>
> Is it valid for tokenizers or token filters add new attributes after
> their constructor (after they have processed some tokens)?
>
>    

At the moment we're saying in the javadocs of TokenStream that all 
Attributes should be
added up front. We could change these semantics. I had some thoughts 
about it in the original
JIRA issue (LUCENE-1422).

> Should restoreState() be able to add attributes (it currently throws
> an exception)?  If not, does that mean that it's not supported/advised
> to use state across different TokenStreams?
>
>    

See answer below.

> We've previously seen that the native java clone() can be much slower
> than implementing it ourselves in Java.  Should we have our own
> clone() method on Attribute?  Or just implement clone() ourselves and
> require that subclasses override if needed?  This is inner-loop
> per-token stuff, and a single captureState() will invoke many clone
> operations (6 attributes make up the legacy Token object).
>
>    

Improving the cloning performance was actually the main reason for 
LUCENE-1693.
It separates the Attribute interfaces from the actual implementation, 
and as you probably
know Token now implements all token attributes. So in a TokenStream 
chain which does
cloning (e.g. with a TeeSinkTokenFilter or CachingTokenFilter) one could 
use a different
AttributeFactory to get much better performance.

The AttributeSource builds internally a simple linked list (State), 
which captureState()
clones then by calling the clone() method of the AttributeImpls. Using 
the linked list approach
performed best for me. We could change the implementations of the 
clone() methods of
the AttributeImpls or even add our own clone method if performance would 
improve.

The nice thing about LUCENE-1693 is that if cloning performance is 
really crucial for your
usecase you can simply implement a class that only implements the token 
attributes you need.
Often term, positionIncrement and offset is enough. Then the object to 
be cloned is smaller.

Ideally it'd be cool if we could synthesize a class automatically during 
runtime that implements
all Attribute interfaces in use, but I think with java you can only do 
that if you add a special
jar from the JDK to the classpath.

So back to your question if we should allow restoreState() to add 
attributes and use a state
across different AttributeSources: the complication is that we can only 
allow that if
the different AttributeSource were filled using the same 
AttributeFactory, otherwise
different AtttributeImpls could be in the sources and the copying 
wouldn't work anymore.

I didn't find a good (efficient) way of doing the cloning/copying per 
Attribute interface yet,
which I did it this way. I'll try to think about if a bit more.... maybe 
you have an idea?!

  Michael
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message