lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: StandardTokenizer issue ?
Date Fri, 13 Mar 2009 12:03:36 GMT
That does sound like an issue.  Can you open a JIRA issue for it?

Thanks,
Grant

On Mar 12, 2009, at 5:55 AM, iMe wrote:

>
> I spotted an unexepcted behavior when using the StandardAnalyzer.
>
>
> This analyzer uses the StandardTokenizer which javadoc states:
>
>
> Splits words at hyphens, unless there's a number in the token, in  
> which case
> the whole token is interpreted as a product number and is not split.
>
>
>
> But looking to my index with luke, I saw that my product reference
> AB-CD-1234 is split in 3 token AB, CD and 123 while I was expected the
> tokenizer to keep it as a whole.
>
>
> So its look like the StandardTokenizer does not work as is should.
>
>
> Am I right ?
>
>
> I had a deeper look, and found out (
> https://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
> here ) the jflex source used to generate the StandardTokenizerImpl.
>
>
> And here is how "product numbers" are defined: (P being the  
> punctuation:
> "_", "-", "/", "." and ",")
>
>
> // floating point, serial, model numbers, ip addresses, etc.
> // every other segment must have at least one digit
> NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
>           | {HAS_DIGIT} {P} {ALPHANUM}
>           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P}  
> {HAS_DIGIT})+
>           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P}  
> {ALPHANUM})+)
>
>
> I am not a jflex expert, but it looks like the {ALPHANUM} ({P}  
> {ALPHANUM}
> {P} {HAS_DIGIT}) is missing ?
>
> As well as all other patterns containing two digits or two alpha  
> separated
> by a punctuation. :
>
>
> -- 
> View this message in context: http://www.nabble.com/StandardTokenizer-issue---tp22471475p22471475.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message