lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: np-pandock search problem (again, with more detail)
Date Thu, 07 Jun 2007 21:05:31 GMT
"Michael D. Curtin" <mike@curtin.com> wrote on 07/06/2007 13:30:28:

> > I think it splits by hyphens unless the no-hyphen
> > part has digits, so:
> >   np-pandock-a7
> > becomes
> >   np
> >   pandock-a7
> > This is for the indexing part.
>
> Wow!  Do you know the thinking behind that, i.e. why a number in a
> hyphenated expression prevents the split?

I actually asked myself the same question before the previous
post - javadocs for StandardAnalyzer just has the obvious - a
grammar-based tokenizer constructed with JavaCC.... - the wiki
page AnalysisParalysis also didn't explain much on the logic
behind it.

>From the StandardAnalyzer javacc grammar :
  // floating point, serial, model numbers, ip addresses, etc.
  // every other segment must have at least one digit
  <NUM: (<ALPHANUM> <P> <HAS_DIGIT> .... etc.
  <#P: ("_"|"-"|"/"|"."|",") >
My understanding of this: a non-whitespace sequence is broken
at either of these 5 chars
   _  -  /  .  ,
unless the part that follows part has a digit, in which case
it is assumed to be (part of) a serial no., model, etc.

Seems we can improve the documentation here.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message