lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Taylor <>
Subject Re: Installing a custom tokenizer
Date Tue, 29 Aug 2006 23:21:57 GMT
I have copied Lucene's StandardTokenizer.jj into my directory, renamed 
it, and did a global change of the names to my class name, 

The issue is that the generated does not compile for 
2 reasons:

1) in the constructor, this(new FastCharStream(reader)); fails because 
there is no such constructor in the parent class.  I commented it out.

2) I get an error on the next() method which throws ParseException and 
IO Exception.  The message is Exception ParseException is not 
compatible with throws clause in  As far as I can 
see, the exceptions are OK.

Since all of this is generated code, my feelings are a bit hurt.  Did 
Lucene use an older version of JavaCC?  I am using javacc-4.0

On Aug 29, 2006, at 4:57 PM, Erick Erickson wrote:

> Tucked away in the contrib section of  Lucene (I'm using 2.0) there 
> is....
> org.apache.lucene.index.memory.PatternAnalyzer
> which takes a regular expression as and tokenizes with it. Would that 
> help?
> Word of warning... the regex determines what is NOT a token, not what 
> IS a
> token (as I remember), which threw me for a bit.
> Don't know if this is really useful, but it might work for you without 
> as
> much work...
> Best
> Erick@I'mNowBeyondMyCompetence.WhyDoTheyStillEmployMeHere?
> On 8/29/06, Bill Taylor <> wrote:
>> On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:
>> >
>> > : Have a look at PerFieldAnalyzerWrapper:
>> >
>> > :
>> >
>> > PerFieldAnalyzerWrapper.html
>> >
>> > ...which can be specified in the constructors for IndexWriter and
>> > QueryParser.
>> As I understand it, this allows me to specify a different analyzer for
>> each field name.  My problem is that the standard analyzer will not
>> work for my content field and I need to define a new one.  I need to
>> make a modification to the StandardTokenizer so that a number does not
>> need to have a digit in every other segment of a part number.
>> For example, the StandardTokenizer breaks aa-bb-2 on the - between aa
>> and bb because it demands that every other string between a - have a
>> digit.
>> I need to modify the .jj file for the Standard Tokenizer and get a new
>> one, but I am confused by the javaCC documentation and do not know how
>> to run it to get what I need.
>> Thanks for the help.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message