lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller" <markrmil...@gmail.com>
Subject Re: StandardAnalyzer question
Date Fri, 21 Jul 2006 20:09:28 GMT
| < #LETTER:                      // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff"
      ]

becomes

| < #LETTER:                      // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "\u002d"
      ]

On 7/21/06, Ngo, Anh (ISS Southfield) <ango@iss.net> wrote:
>
>
> Hello Mark,
>
>
> Please show me how to add "-" to #LETTER definition
>
>
> Thanks,
>
>
> Anh Ngo
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: Friday, July 21, 2006 3:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: StandardAnalyzer question
>
> I do not beleive so. If you look above you will see that #P is only used
> when looking for a num: a host ip, a phone number, etc. You will be
> removing
> that ability to recognize a "_" while rooting those tokens out. It will
> still be parsed when tokenizing an EMAIL as well. I dont think this is
> the
> behavior you want.
>
> - Mark
>
> On 7/21/06, Ngo, Anh (ISS Southfield) <ango@iss.net> wrote:
> >
> >
> > What is #LETTER definition in SnardarTokernize.jj?
> >
> >
> > I saw:
> >
> > | <#P: ("_"|"-"|"/"|"."|",") >
> > | <#HAS_DIGIT:                                    // at least one
> digit
> >     (<LETTER>|<DIGIT>)*
> >     <DIGIT>
> >     (<LETTER>|<DIGIT>)*
> >   >
> >
> >
> > Should I remove "_" and recompile the source code?
> >
> > Sincerely,
> >
> >
> > Anh Ngo
> >
> > -----Original Message-----
> > From: Daniel Naber [mailto:lucenelist2005@danielnaber.de]
> > Sent: Friday, July 21, 2006 2:49 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: StandardAnalyzer question
> >
> > On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:
> >
> > > The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as
> a
> > > token. Is there a way I can make StandardAnalyzer don't tokenize for
> > > "_" or any given characters?
> >
> > You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
> > then
> > rebuild StandardTokenizer.java using the appropriate and task.
> >
> > Regards
> > Daniel
> >
> > --
> > http://www.danielnaber.de
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message