incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Balmain" <>
Subject Re: TokenBatch
Date Wed, 28 Jun 2006 00:24:01 GMT
On 6/26/06, Marvin Humphrey <> wrote:
> Greets,
> Lucene's TokenStream concept does not translate very well from Java
> to our target languages, primarily because calling next() once for
> each analyzer-token pairing creates a lot of method-call overhead.
> Also, the analysis process creates a *lot* of Token objects; in
> Plucene these were hash-based objects, relatively expensive to create
> and destroy.
> The solution adopted by KinoSearch is the TokenBatch.  Instead of
> passing tokens one by one through an analysis chain, they are grouped
> together and passed en masse from analyzer to analyzer.  Analyzers do
> not supply a next() method; they supply an analyze() method, which
> both accepts and returns a TokenBatch object.  Analysis chains are
> set up using a PolyAnalyzer object, which is an ordered array of
> Analyzers, each of which will be called upon to analyze() a
> TokenBatch in turn.
> Implementing TokenBatch as a doubly-linked list of Token structs
> seems to work pretty well.
> typedef struct lucy_Token {
>      char              *text;
>      lucy_i32_t         len;
>      lucy_i32_t         start_offset;
>      lucy_i32_t         end_offset;
>      lucy_i32_t         pos_inc;
>      struct lucy_Token *next;
>      struct lucy_Token *prev;
> } lucy_Token;
> typedef struct lucy_TokenBatch {
>      lucy_Token  *first;
>      lucy_Token  *last;
>      lucy_Token  *current;
>      lucy_i32_t   size;
>      lucy_bool_t  initialized;
> } lucy_TokenBatch;
> (We might define token->text as either an 8 or 16 bit type depending
> on how the target platform handles strings, but that's a topic for
> another day.)
> The tricky part here is how to expose TokenBatch and its constituent
> tokens as a native API, which is necessary for subclassing Analyzer.
> Core Analyzer subclasses distributed with Lucy might peek into the
> guts of Token structs.  However, it would be a bad idea to make *any*
> C struct's makeup part of Lucy's public API.  The only supported way
> to get at a struct's members should be through methods.
> The obvious way to affect individual Tokens would be to spin them off
> as native objects when iterating over the TokenBatch object.
> Illustrated in Java:
>    while ((token = != null) {
>      token.setText( lowerCase(token.getText()) );
>    }
> However, that introduces a garbage collection issue.  Who's
> responsible for destroying the individual Token structs?  We can't
> have them destroyed twice, once when the TokenBatch gets destroyed,
> and once when the spun-off token gets destroyed.  I see two
> possibilities, neither of which is attractive.  We might implement
> our own reference-counting scheme, which would be messy and
> complicated.  Or we could start off by creating each token as a
> native object and defer to native GC, which is messy, complicated,
> and expensive to boot.
> The least worst solution I see is to avoid exposing the individual
> tokens via native API, and allow access to them one at a time via
> methods against the TokenBatch object.  next() would return a boolean
> indicating whether the TokenBatch was currently located at a valid
> Token position, instead of a native Token object.
>    while ( {
>      batch.setText( lowerCase(batch.getText()) );
>    }
> This is perhaps a little less intuitive than iterating over an faux-
> array of Tokens, but it's similar to how Lucene's TermDocs, TermEnum
> and Scorer classes all work.
> It's also the fastest scheme I've come up with.  Without making
> TokenBatch a native array of Token objects...
>    for my $token (@tokens) {
>      $token->set_text( $token->get_text );
>    }
> ... there's no way to avoid the method-call overhead of next().  Any
> user-defined subclass of Analyzer implemented using the native API is
> therefore going to be a little slower than it's possible to make core
> Analyzer classes which are allowed to access struct members, but
> that's the way it goes.  If we supply a good set of flexible Analyzer
> classes, hopefully it won't be necessary for people to create
> multiple custom Analyzers and their indexing apps will stay fast enough.

FYI, Ferret's Analyzers each have a single Token object with space for
MAX_WORD_LENGTH characters. TokenFilters just use the Token owned by
the Tokenizer that they filter. This way a new token doesn't need to
be allocated for each term. I'm not sure how much time is to be gained
by having a batch of tokens. I'll have to do some experiments.


View raw message