lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Taylor <watay...@as-st.com>
Subject Re: Installing a custom tokenizer
Date Tue, 29 Aug 2006 19:12:03 GMT

On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:

> I'm in a real rush here, so pardon my brevity, but..... one of the
> constructors for IndexWriter takes an Analyzer as a parameter, which 
> can be
> a PerFieldAnalyzerWrapper. That, if I understand your issue, should 
> fix you
> right up.

that almost worked.  I can't use a per Field analyzer because I have to 
process the content fields of all documents.  I built a custom analyzer 
which extended the Standard Analyzer and replaced the tokenStream 
method with a new one which used WhitespaceTokenizer instead of 
StandardTokenizer.  This meant that my document IDs were not split, but 
I lost the conversion of acronyms such as w.o. to wo and the like

So what I need to do is to make a new Tokenizer based on the 
StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj 
should be

| NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >

so that a serial number need not have a digit in every other segment 
and a series of letters and digits without special characters such as a 
dash will be treated as a single word.

Questions:

1) If I change the .jj file in this way, how to I run javaCC to make a 
new tokenizer?  The JavaCC documentation says that JavaCC generates a 
number of output files; I think that I only need the tokenizer code.

2) I suppose i have to tell the query parser to parse queries in the 
same way, is that right?

The reason I think so is that Luke says I have words such as w.o. in 
the index which the query parser can't find.  I suspect I have to use 
the same Analyzer on both, right?

> On 8/29/06, Bill Taylor <wataylor@as-st.com> wrote:
>>
>> I am indexing documents which are filled with government jargon.  As
>> one would expect, the standard tokenizer has problems with
>> governmenteese.
>>
>> In particular, the documents use words such as 310N-P-Q as references
>> to other documents.  The standard tokenizer breaks this "word" at the
>> dashes so that I can find P or Q but not the entire token.
>>
>> I know how to write a new tokenizer.  I would like hints on how to
>> install it and get my indexing system to use it.  I don't want to
>> modify the standard .jar file.  What I think I want to do is set up my
>> indexing operation to use the WhitespaceTokenizer instead of the 
>> normal
>> one, but I am unsure how to do this.
>>
>> I know that the IndexTask has a setAnalyzer method.  The document
>> formats are rather complicated and I need special code to isolate the
>> text strings which should be indexed.   My file analyzer isolates the
>> string I want to index, then does
>>
>> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
>> Field.Store.YES, Field.index.TOKENIZED));
>>
>> I suspect that my issue is getting the Field constructor to use a
>> different tokenizer.  Can anyone help?
>>
>> Thanks.
>>
>> Bill Taylor
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message