Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Inconsistent tokenizing of words containing underscores.
Date: Tue, 30 Aug 2005 12:52:29 +0200
Message-ID: 
 <9391EFF529350A4A9AAC16AAAAD9B5C60A3A07@DEEXVS01.wincor-nixdorf.com>
Thread-Topic: Inconsistent tokenizing of words containing underscores.
Thread-Index: AcWszezTBwZ3sFrBRx+vMSlhgEKP5AAGG69AABllElA=
From: "Is, Studcio" <Studcio.is@wincor-nixdorf.com>
To: <java-user@lucene.apache.org>

Hello,

first of all thanks to everyone for replies and suggestions. I solved my
problem by adapting the StandardTokenizer.jj and compiling it using
javacc.

I replaced line 90:

<ALPHANUM: (<LETTER>|<DIGIT>)+ >

with

<ALPHANUM: (<LETTER>|<DIGIT>|"_")+ >

so that underscore is treated like alphanumeric characters. In my first
tests, it seems to work perfectly. Anyhow, the problem remains that I
can't understand how the described bevaviour might be the expected
behaviour. I couldn't find the appropriate documentation in the javacc
source of the tokenizer either. I suppose the source of the problem with
underscores lies in the definition of NUM (floating point, serial, model
numbers, ip addresses, etc.). No matter what, I guess my problem is
solved.

Thanks again and regards

Sebastian

=20


-----Original Message-----
From: Aigner, Thomas [mailto:TAigner@WescoDist.com]=20
Sent: Tuesday, August 30, 2005 12:12 AM
To: java-user@lucene.apache.org
Subject: RE: Inconsistent tokenizing of words containing underscores.

What seems to be working for me is a punctuation filter that removes / -
_ etc and makes the token without them.  Then "most" of the time the
word XYZZZY_DE_SA0001 will be tokenized as XYZZZYDESA0001.  For this to
work, you will have to use the same punctuation filter on the strings
before you search for them. =20

Tom

-----Original Message-----
From: Daniel Naber [mailto:lucenelist@danielnaber.de]=20
Sent: Monday, August 29, 2005 3:15 PM
To: java-user@lucene.apache.org
Subject: Re: Inconsistent tokenizing of words containing underscores.

On Monday 29 August 2005 19:21, Jeremy Meyer wrote:

> The expected behavior is to sometimes treat a character as indicating
a
> new token and other times to ignore the same character?

It depends on whether there are digits in the token.  It's documented in

the javacc source for the tokenizer(?).

Regards
 Daniel

--=20
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org