lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Byrne <>
Subject Re: searching for C++
Date Tue, 24 Jun 2008 16:03:01 GMT
I don't think there is a simpler way. I think you will have to modify 
the tokenizer. Once you go beyond basic human-readable text, you always 
end up having to do that. I have modified the JavaCC version of 
StandardTokenizer  for allowing symbols to pass through, but I've never 
used the JFlex version - don't know anything about JFlex I'm afraid!

A good strategy might be to make a new type of lexical token called 
"SYMBOL" and try to catch as many symbols as you can think of; then 
maybe create new token types which are ALPHANUM types that can have 
pre-fixed or post-fixed symbols.

That way, you'll be able to catch things like "c++" in a TokenFilter, 
and you can choose to pass it through as a single token, or split it up 
into two tokens, or whatever you want.

Hope that helps.


Alex Soto wrote:
> Hello:
> I have a problem where I need to search for the term "C++".
> If I use StandardAnalyzer, the "+" characters are removed and the
> search is done on just the "c" character which is not what is
> intended.
> Yet, I need to use standard analyzer for the other benefits it provides.
> I think I need to write a specialized tokenizer (and accompanying
> analyzer) that let the "+" characters pass.
> I would use the JFlex provided one, modify it and add it to my project.
> My question is:
> Is there any simpler way to accomplish the same?
> Best regards,
> Alex Soto
> -
> Amicus Plato, sed magis amica veritas.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message