lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: Incomprehensible (to me) tokenizing behavior
Date Mon, 30 Dec 2002 22:14:42 GMT
Terry Steichen wrote:
 > PS: Is this kind of thing (and more importantly, any other similar
 > design issues) documented any place?

This one is described in the source code, with the comment:

   // floating point, serial, model numbers, ip addresses, etc.
   // every other segment must have at least one digit

> PSS: What is the simplest way to alter this behavior to one that parses the
> same regardless of the presence or absence of numeric characters?

According to:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html

"Many applications have specific tokenizer needs. If this tokenizer does 
not suit your application, please consider copying this source code 
directory to your project and maintaining your own grammar-based tokenizer."

You need to copy StandardTokenizer.jj, change its package statement, add 
some import statements, add a JavaCC task to your build.xml, and, 
finally, modify the clause following the above comment.

Doug


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message