lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Disabling modifiers?
Date Tue, 16 Dec 2003 12:28:26 GMT
On Tuesday, December 16, 2003, at 05:46  AM, Iain Young wrote:
>> Treating them as two separate words when quoted is indicative of your
>> analyzer not being sufficient for your domain.  What Analyzer are you
>> using?  Do you have knowledge of what it is tokenizing text into?
>
> I have created a custom analyzer (CobolAnalyzer) which contains some 
> custom
> stop words for the language, but it's using the StandardTokenizer and
> StandardFilters. I'll have a look and see if I can see what it's 
> actually
> tokenizing the text into...

Look at my article at java.net and try out the AnalyzerDemo code using 
some sample text and your custom analyzer:

	http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

One of the things I plan to do with an enhanced Lucene demo to ship 
with Lucene's binary distributions is integrate in this type of 
"analyzing the analyzer" feature.  It is the root of a lot of questions 
about Lucene.  You can really only search for what you index, and you 
only index what the Analyzer creates, so understanding it is key to a 
lot.

And yes, if you are using StandardTokenizer, you are probably not 
tokenizing COBOL quite like you expect.  Is there a COBOL parser you 
could tap into that could give you the tokens you want?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message