lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Soto" <lexs...@gmail.com>
Subject Re: searching for C++
Date Tue, 24 Jun 2008 16:40:02 GMT
Thanks everyone. I appreciate the help.

I think I will write my own tokenizer, because I do not have a
predefined list of words with symbols.
I will modify the grammar by defining a SYMBOL token as John suggested
and redefine ALPHANUM to include it.

Regards,
Alex Soto



On Tue, Jun 24, 2008 at 12:12 PM, N. Hira <nhira@cognocys.com> wrote:
> This isn't ideal, but if you have a defined list of such terms, you may find
> it easier to filter these terms out into a separate field for indexing.
>
> -h
> ----------------------------------------------------------------------
> Hira, N.R.
> Solutions Architect
> Cognocys, Inc.
> (773) 251-7453
>
> On 24-Jun-2008, at 11:03 AM, John Byrne wrote:
>
>> I don't think there is a simpler way. I think you will have to modify the
>> tokenizer. Once you go beyond basic human-readable text, you always end up
>> having to do that. I have modified the JavaCC version of StandardTokenizer
>>  for allowing symbols to pass through, but I've never used the JFlex version
>> - don't know anything about JFlex I'm afraid!
>>
>> A good strategy might be to make a new type of lexical token called
>> "SYMBOL" and try to catch as many symbols as you can think of; then maybe
>> create new token types which are ALPHANUM types that can have pre-fixed or
>> post-fixed symbols.
>>
>> That way, you'll be able to catch things like "c++" in a TokenFilter, and
>> you can choose to pass it through as a single token, or split it up into two
>> tokens, or whatever you want.
>>
>> Hope that helps.
>>
>> Regards,
>> JB
>>
>> Alex Soto wrote:
>>>
>>> Hello:
>>>
>>> I have a problem where I need to search for the term "C++".
>>> If I use StandardAnalyzer, the "+" characters are removed and the
>>> search is done on just the "c" character which is not what is
>>> intended.
>>> Yet, I need to use standard analyzer for the other benefits it provides.
>>>
>>> I think I need to write a specialized tokenizer (and accompanying
>>> analyzer) that let the "+" characters pass.
>>> I would use the JFlex provided one, modify it and add it to my project.
>>>
>>> My question is:
>>>
>>> Is there any simpler way to accomplish the same?
>>>
>>>
>>> Best regards,
>>> Alex Soto
>>> lexsoto@gmail.com
>>>
>>> -
>>> Amicus Plato, sed magis amica veritas.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Alex Soto
lexsoto@gmail.com

-
Amicus Plato, sed magis amica veritas.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message