lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <>
Subject RE: Token offset values for custom Tokenizer
Date Mon, 16 Jul 2007 08:13:55 GMT

> Hi,
> I am storing custom values in the Tokens provided by a Tokenizer but 
> when retrieving them from the index the values don't match. 

What do you mean by retrieving? Do you mean retrieving terms, or do you mean doing a search
with words you know that should be in, but you do not find a match?

In the latter, you must make sure that you are using the same analyzer for the search as you
used for indexing. 

> I've looked 
> in the LIA book but it's not current since it mentioned term vectors 
> aren't stored. I'm using Lucene Nightly 146 but the same thing has 
> happened with older versions. Looking at the internals, 
> DocumentWriter 
> seems to keep track of the end offset that was placed into 
> the index and 
> modifies the token values (with +1) but I'm not sure whether 
> I should be 
> concerned with it.
> No existing analyzers are used when adding the document so all the 
> offsets are generated manually.
> Any suggestions of how the token offsets should be stored?

Look at other clases that implement TokenStream. Also take a look at setPositionIncrement
when you are putting in your own terms

Regards Ard

> Is this valid?
> Token, start, end
> aaa, 0, 3
> bbb, 4, 7
> ccc, 8, 11
> Thanks,
> Shahan
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message