lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael D. Curtin" <m...@curtin.com>
Subject Re: np-pandock search problem (again, with more detail)
Date Thu, 07 Jun 2007 21:43:54 GMT
Doron Cohen wrote:

>>>From the StandardAnalyzer javacc grammar :
>   // floating point, serial, model numbers, ip addresses, etc.
>   // every other segment must have at least one digit
>   <NUM: (<ALPHANUM> <P> <HAS_DIGIT> .... etc.
>   <#P: ("_"|"-"|"/"|"."|",") >
> My understanding of this: a non-whitespace sequence is broken
> at either of these 5 chars
>    _  -  /  .  ,
> unless the part that follows part has a digit, in which case
> it is assumed to be (part of) a serial no., model, etc.

Weird.  The definition seems to allow expressions of the form 
A-B-C-D-E-..., where
-   "-" can be one of the five characters you mentioned
-   the A, B, C, ... are alphanumeric pseudo-words
-   A, C, E, ... or B, D, F, ... must have digits, i.e. alternating
     digit components
So "A-1-B-2" and "1-A-2-B" would be kept as single tokens, but "A-B-1-2" 
would not.  Seems more than a little hokey, but I suppose it's been 
working for a long time, for the most part.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message