lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <>
Subject RE: EmailAddressAnalyzer & TokenStreams
Date Wed, 20 Aug 2008 23:21:35 GMT
Hi Dino,

The Lucene KeywordTokenizer is about as simple as tokenizers get - it just outputs its entire
input as a single token:


Check out the source code for other Tokenizer descendants in the Lucene source for more hints.
 Warning: a few of them are generated by scanner generator tools (JavaCC and JFlex), so the
code is a bit impenetrable in places.

To set the position for a Token, call its setPositionIncrement() method.  From the javadocs:

    Set the position increment.  This determines the position of
    this token relative to the previous Token in a TokenStream,
    used in phrase searching.

(Read the rest of the javadoc for that method.  Go on, you know you want to.)

Good luck,

On 08/20/2008 at 12:58 PM, Dino Korah wrote:
> Hi guys,
> If I am to tokenize an email address like "John Smith" <
> <>>  into
>     [ <>]
>     [John] [Smith] [J.Smith] [] []
>     [] [world] [net]
> Is it possible to have a different Position increment for each of these
> tokens? If it is, could you please help me with the same sample, with
> numbers against each token.
> Also could you please point me to a skeleton code for a custom Tokenizer.
> Many Thanks
> Dino

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message