incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject TokenBatch
Date Sun, 25 Jun 2006 21:18:13 GMT

Lucene's TokenStream concept does not translate very well from Java  
to our target languages, primarily because calling next() once for  
each analyzer-token pairing creates a lot of method-call overhead.   
Also, the analysis process creates a *lot* of Token objects; in  
Plucene these were hash-based objects, relatively expensive to create  
and destroy.

The solution adopted by KinoSearch is the TokenBatch.  Instead of  
passing tokens one by one through an analysis chain, they are grouped  
together and passed en masse from analyzer to analyzer.  Analyzers do  
not supply a next() method; they supply an analyze() method, which  
both accepts and returns a TokenBatch object.  Analysis chains are  
set up using a PolyAnalyzer object, which is an ordered array of  
Analyzers, each of which will be called upon to analyze() a  
TokenBatch in turn.

Implementing TokenBatch as a doubly-linked list of Token structs  
seems to work pretty well.

typedef struct lucy_Token {
     char              *text;
     lucy_i32_t         len;
     lucy_i32_t         start_offset;
     lucy_i32_t         end_offset;
     lucy_i32_t         pos_inc;
     struct lucy_Token *next;
     struct lucy_Token *prev;
} lucy_Token;

typedef struct lucy_TokenBatch {
     lucy_Token  *first;
     lucy_Token  *last;
     lucy_Token  *current;
     lucy_i32_t   size;
     lucy_bool_t  initialized;
} lucy_TokenBatch;

(We might define token->text as either an 8 or 16 bit type depending  
on how the target platform handles strings, but that's a topic for  
another day.)

The tricky part here is how to expose TokenBatch and its constituent  
tokens as a native API, which is necessary for subclassing Analyzer.

Core Analyzer subclasses distributed with Lucy might peek into the  
guts of Token structs.  However, it would be a bad idea to make *any*  
C struct's makeup part of Lucy's public API.  The only supported way  
to get at a struct's members should be through methods.

The obvious way to affect individual Tokens would be to spin them off  
as native objects when iterating over the TokenBatch object.   
Illustrated in Java:

   while ((token = != null) {
     token.setText( lowerCase(token.getText()) );

However, that introduces a garbage collection issue.  Who's  
responsible for destroying the individual Token structs?  We can't  
have them destroyed twice, once when the TokenBatch gets destroyed,  
and once when the spun-off token gets destroyed.  I see two  
possibilities, neither of which is attractive.  We might implement  
our own reference-counting scheme, which would be messy and  
complicated.  Or we could start off by creating each token as a  
native object and defer to native GC, which is messy, complicated,  
and expensive to boot.

The least worst solution I see is to avoid exposing the individual  
tokens via native API, and allow access to them one at a time via  
methods against the TokenBatch object.  next() would return a boolean  
indicating whether the TokenBatch was currently located at a valid  
Token position, instead of a native Token object.

   while ( {
     batch.setText( lowerCase(batch.getText()) );

This is perhaps a little less intuitive than iterating over an faux- 
array of Tokens, but it's similar to how Lucene's TermDocs, TermEnum  
and Scorer classes all work.

It's also the fastest scheme I've come up with.  Without making  
TokenBatch a native array of Token objects...

   for my $token (@tokens) {
     $token->set_text( $token->get_text );

... there's no way to avoid the method-call overhead of next().  Any  
user-defined subclass of Analyzer implemented using the native API is  
therefore going to be a little slower than it's possible to make core  
Analyzer classes which are allowed to access struct members, but  
that's the way it goes.  If we supply a good set of flexible Analyzer  
classes, hopefully it won't be necessary for people to create  
multiple custom Analyzers and their indexing apps will stay fast enough.

Marvin Humphrey
Rectangular Research

View raw message