lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: EmailAddressAnalyzer & TokenStreams
Date Wed, 20 Aug 2008 23:21:35 GMT
Hi Dino,

The Lucene KeywordTokenizer is about as simple as tokenizers get - it just outputs its entire
input as a single token:

<http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/KeywordTokenizer.java?revision=687357&view=markup>

Check out the source code for other Tokenizer descendants in the Lucene source for more hints.
 Warning: a few of them are generated by scanner generator tools (JavaCC and JFlex), so the
code is a bit impenetrable in places.

To set the position for a Token, call its setPositionIncrement() method.  From the javadocs:

    Set the position increment.  This determines the position of
    this token relative to the previous Token in a TokenStream,
    used in phrase searching.

(Read the rest of the javadoc for that method.  Go on, you know you want to.)

Good luck,
Steve

On 08/20/2008 at 12:58 PM, Dino Korah wrote:
> Hi guys,
> 
> If I am to tokenize an email address like "John Smith" <
> <mailto:J.Smith@london.gb.world.net> J.Smith@london.gb.world.net>  into
> 
>     [ <mailto:J.Smith@london.gb.world.net> J.Smith@london.gb.world.net]
>     [John] [Smith] [J.Smith] [london.gb.world.net] [gb.world.net]
>     [world.net] [world] [net]
> 
> Is it possible to have a different Position increment for each of these
> tokens? If it is, could you please help me with the same sample, with
> numbers against each token.
> 
> Also could you please point me to a skeleton code for a custom Tokenizer.
> 
> Many Thanks
> Dino

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message