lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Inconsistent tokenizing of words containing underscores.
Date Tue, 30 Aug 2005 14:00:42 GMT
Another solution would be for you to create a custom TokenFilter that  
split tokens at "_" characters and then a custom Analyzer that used  
that filter after the StandardTokenizer.

     Erik


On Aug 30, 2005, at 6:52 AM, Is, Studcio wrote:

> Hello,
>
> first of all thanks to everyone for replies and suggestions. I  
> solved my
> problem by adapting the StandardTokenizer.jj and compiling it using
> javacc.
>
> I replaced line 90:
>
> <ALPHANUM: (<LETTER>|<DIGIT>)+ >
>
> with
>
> <ALPHANUM: (<LETTER>|<DIGIT>|"_")+ >
>
> so that underscore is treated like alphanumeric characters. In my  
> first
> tests, it seems to work perfectly. Anyhow, the problem remains that I
> can't understand how the described bevaviour might be the expected
> behaviour. I couldn't find the appropriate documentation in the javacc
> source of the tokenizer either. I suppose the source of the problem  
> with
> underscores lies in the definition of NUM (floating point, serial,  
> model
> numbers, ip addresses, etc.). No matter what, I guess my problem is
> solved.
>
> Thanks again and regards
>
> Sebastian
>
>
>
>
>
>
> -----Original Message-----
> From: Aigner, Thomas [mailto:TAigner@WescoDist.com]
> Sent: Tuesday, August 30, 2005 12:12 AM
> To: java-user@lucene.apache.org
> Subject: RE: Inconsistent tokenizing of words containing underscores.
>
> What seems to be working for me is a punctuation filter that  
> removes / -
> _ etc and makes the token without them.  Then "most" of the time the
> word XYZZZY_DE_SA0001 will be tokenized as XYZZZYDESA0001.  For  
> this to
> work, you will have to use the same punctuation filter on the strings
> before you search for them.
>
> Tom
>
> -----Original Message-----
> From: Daniel Naber [mailto:lucenelist@danielnaber.de]
> Sent: Monday, August 29, 2005 3:15 PM
> To: java-user@lucene.apache.org
> Subject: Re: Inconsistent tokenizing of words containing underscores.
>
> On Monday 29 August 2005 19:21, Jeremy Meyer wrote:
>
>
>> The expected behavior is to sometimes treat a character as indicating
>>
> a
>
>> new token and other times to ignore the same character?
>>
>
> It depends on whether there are digits in the token.  It's  
> documented in
>
> the javacc source for the tokenizer(?).
>
> Regards
>  Daniel
>
> -- 
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message