lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: Installing a custom tokenizer
Date Tue, 29 Aug 2006 19:04:36 GMT
Bill Taylor wrote:
> On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
>> I'm in a real rush here, so pardon my brevity, but..... one of the
>> constructors for IndexWriter takes an Analyzer as a parameter, which 
>> can be
>> a PerFieldAnalyzerWrapper. That, if I understand your issue, should 
>> fix you
>> right up.
> that almost worked.  I can't use a per Field analyzer because I have 
> to process the content fields of all documents.  I built a custom 
> analyzer which extended the Standard Analyzer and replaced the 
> tokenStream method with a new one which used WhitespaceTokenizer 
> instead of StandardTokenizer.  This meant that my document IDs were 
> not split, but I lost the conversion of acronyms such as w.o. to wo 
> and the like
> So what I need to do is to make a new Tokenizer based on the 
> StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj 
> should be
> | NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >
> so that a serial number need not have a digit in every other segment 
> and a series of letters and digits without special characters such as 
> a dash will be treated as a single word.
> Questions:
> 1) If I change the .jj file in this way, how to I run javaCC to make a 
> new tokenizer?  The JavaCC documentation says that JavaCC generates a 
> number of output files; I think that I only need the tokenizer code.
> 2) I suppose i have to tell the query parser to parse queries in the 
> same way, is that right?
> The reason I think so is that Luke says I have words such as w.o. in 
> the index which the query parser can't find.  I suspect I have to use 
> the same Analyzer on both, right?
Get JavaCC and run it on StandardTokenizer.jj. This should be as simple 
as typing 'JavaCC StandardTokenizer.jj'...I believe with no output 
folder specified all of the files will be built in the current 
directory. Don't worry about not generating the ones you do not 
need--JavaCC will handle everything for you. If you use Eclipse I 
recommend the JavaCC plug-in. I find it very handy.

Generally you must run the same analyzer that you indexed with on your 
search strings...if the standard analyzer parses oldman-83 to oldman 
while indexing and you use whitespace analyzer while searching then you 
will attempt to find oldman-83 in the index instead of oldman (which was 
what standard analyzer stored).

- Mark

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message