lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Taylor <watay...@as-st.com>
Subject Re: Installing a custom tokenizer
Date Tue, 29 Aug 2006 23:21:57 GMT
I have copied Lucene's StandardTokenizer.jj into my directory, renamed 
it, and did a global change of the names to my class name, 
LogTokenizer.

The issue is that the generated LogTokenizer.java does not compile for 
2 reasons:

1) in the constructor, this(new FastCharStream(reader)); fails because 
there is no such constructor in the parent class.  I commented it out.

2) I get an error on the next() method which throws ParseException and 
IO Exception.  The message is Exception ParseException is not 
compatible with throws clause in TokenStream.next().  As far as I can 
see, the exceptions are OK.

Since all of this is generated code, my feelings are a bit hurt.  Did 
Lucene use an older version of JavaCC?  I am using javacc-4.0

On Aug 29, 2006, at 4:57 PM, Erick Erickson wrote:

> Tucked away in the contrib section of  Lucene (I'm using 2.0) there 
> is....
>
> org.apache.lucene.index.memory.PatternAnalyzer
>
> which takes a regular expression as and tokenizes with it. Would that 
> help?
> Word of warning... the regex determines what is NOT a token, not what 
> IS a
> token (as I remember), which threw me for a bit.
>
> Don't know if this is really useful, but it might work for you without 
> as
> much work...
>
> Best
> Erick@I'mNowBeyondMyCompetence.WhyDoTheyStillEmployMeHere?
>
> On 8/29/06, Bill Taylor <wataylor@as-st.com> wrote:
>>
>>
>> On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:
>>
>> >
>> > : Have a look at PerFieldAnalyzerWrapper:
>> >
>> > :
>> > http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/
>> > PerFieldAnalyzerWrapper.html
>> >
>> > ...which can be specified in the constructors for IndexWriter and
>> > QueryParser.
>>
>> As I understand it, this allows me to specify a different analyzer for
>> each field name.  My problem is that the standard analyzer will not
>> work for my content field and I need to define a new one.  I need to
>> make a modification to the StandardTokenizer so that a number does not
>> need to have a digit in every other segment of a part number.
>>
>> For example, the StandardTokenizer breaks aa-bb-2 on the - between aa
>> and bb because it demands that every other string between a - have a
>> digit.
>>
>> I need to modify the .jj file for the Standard Tokenizer and get a new
>> one, but I am confused by the javaCC documentation and do not know how
>> to run it to get what I need.
>>
>> Thanks for the help.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message