lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fondermann <bernd.fonderm...@googlemail.com>
Subject Re: Tokenizer, TokenStream, Token Filters
Date Tue, 11 Aug 2009 07:43:12 GMT
On Tue, Aug 11, 2009 at 00:52, K. M. McCormick<kyliemccormick@gmail.com> wrote:
> Hello Again:
>
> I'm trying to figure out what Filters do to terms in Lucene, specifically
>
> StandardTokenizer
> StandardFilter
>
> While these are usually 'enough' for my work, I need to know specifically
> what happens to the tokens in this, how they are split, etc. in order to
> make sure my indexes match my queries, which are being parsed/modified very
> specifically. I was tempted to make my own filter (like MyCrazyFilter) but I
> hesitate to throw away the 'standards' for no reason.
>
> Also, I have had a hard time finding information about writing your own
> Tokenizers and Token Filters, other than the fact that you can do this. Most
> of the work I want to do is fairly simple stuff, but I can't find much
> information on how Lucene does it.

What helped me in the past was browsing the javadoc, for example for
the filter classes you mentioned and their superclasses.
In addition, you may are not aware of the package javadoc for the
analysis package you find here:

http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/analysis/package-summary.html#package_description

Furthermore, I often found reading the source code to be helpful:

http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/analysis/standard/StandardFilter.java

Proper support for all these would be best obtained on the java-user
mailing list:
java-user-subscribe@lucene.apache.org

HTH,

  Bernd

>
> I specifically know I want to ensure the following:
> - tokens are broken at whitespace only, not at any other kinds of marks
> - tokens have no accents (I use a normalizer for this)
> - tokens do not only consist of punctuation (I use a simple function for
> this)
> - tokens do not have 'oddball' circumstances (such as the end of a sentence
> retaining that punctuation... I  truncate this).
>
> Thanks,
> drago
>

Mime
View raw message