Mailing-List: contact general-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of
 bernd.fondermann@googlemail.com designates 209.85.219.226 as permitted
 sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=lX/da/Xo8nqKHugG4clPA0NtBFbXjtPjCltNTMMObRyyMh7NRlvsiXgbxrKzS+Jj+J
         KNLkTsMJJwi4LXpi/k8NxUyJYs6A9sj7Udg8V7y4bAEzz9NwOtSumVyJjFRSLQipb9eh
         kNL6h2gEIx412LkCY5p/4dOb8SjyWa4OMpU/U=
MIME-Version: 1.0
In-Reply-To: <3b61738b0908101552n298ca3bbvb3b7b6810dc906fb@mail.gmail.com>
References: <3b61738b0908101552n298ca3bbvb3b7b6810dc906fb@mail.gmail.com>
Date: Tue, 11 Aug 2009 09:43:12 +0200
Message-ID: <88f6e29a0908110043r49e6a11fs99f0023ac56484b5@mail.gmail.com>
Subject: Re: Tokenizer, TokenStream, Token Filters
From: Bernd Fondermann <bernd.fondermann@googlemail.com>
To: general@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, Aug 11, 2009 at 00:52, K. M. McCormick<kyliemccormick@gmail.com> wr=
ote:
> Hello Again:
>
> I'm trying to figure out what Filters do to terms in Lucene, specifically
>
> StandardTokenizer
> StandardFilter
>
> While these are usually 'enough' for my work, I need to know specifically
> what happens to the tokens in this, how they are split, etc. in order to
> make sure my indexes match my queries, which are being parsed/modified ve=
ry
> specifically. I was tempted to make my own filter (like MyCrazyFilter) bu=
t I
> hesitate to throw away the 'standards' for no reason.
>
> Also, I have had a hard time finding information about writing your own
> Tokenizers and Token Filters, other than the fact that you can do this. M=
ost
> of the work I want to do is fairly simple stuff, but I can't find much
> information on how Lucene does it.

What helped me in the past was browsing the javadoc, for example for
the filter classes you mentioned and their superclasses.
In addition, you may are not aware of the package javadoc for the
analysis package you find here:

http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/analysis/pac=
kage-summary.html#package_description

Furthermore, I often found reading the source code to be helpful:

http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/=
apache/lucene/analysis/standard/StandardFilter.java

Proper support for all these would be best obtained on the java-user
mailing list:
java-user-subscribe@lucene.apache.org

HTH,

  Bernd

>
> I specifically know I want to ensure the following:
> - tokens are broken at whitespace only, not at any other kinds of marks
> - tokens have no accents (I use a normalizer for this)
> - tokens do not only consist of punctuation (I use a simple function for
> this)
> - tokens do not have 'oddball' circumstances (such as the end of a senten=
ce
> retaining that punctuation... I =A0truncate this).
>
> Thanks,
> drago
>