Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 703 invoked from network); 11 Aug 2009 07:43:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Aug 2009 07:43:38 -0000 Received: (qmail 12213 invoked by uid 500); 11 Aug 2009 07:43:44 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 12138 invoked by uid 500); 11 Aug 2009 07:43:44 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 12128 invoked by uid 99); 11 Aug 2009 07:43:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Aug 2009 07:43:44 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bernd.fondermann@googlemail.com designates 209.85.219.226 as permitted sender) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Aug 2009 07:43:33 +0000 Received: by ewy26 with SMTP id 26so3799924ewy.5 for ; Tue, 11 Aug 2009 00:43:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=TI3mBK2yYujQ7qwI7Wdn+lM5qUiL4Ad3BmRx32J7RoE=; b=r53HtMj4iM4F31unwRlMpr6NCD58MRVedCxUOwYvBmbH6oTDT62HLmSFlOnxw9ZpZd fkQimmqdfS7UltBRjI+ssXgl7hux4KTF/jMu9xJ7OnGf1x7FREKQQ+8iGMvlj8TmL80w hRrI+nTdoFcn635m4rM/SN8Zq3pkc/skoEcXs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=lX/da/Xo8nqKHugG4clPA0NtBFbXjtPjCltNTMMObRyyMh7NRlvsiXgbxrKzS+Jj+J KNLkTsMJJwi4LXpi/k8NxUyJYs6A9sj7Udg8V7y4bAEzz9NwOtSumVyJjFRSLQipb9eh kNL6h2gEIx412LkCY5p/4dOb8SjyWa4OMpU/U= MIME-Version: 1.0 Received: by 10.216.87.9 with SMTP id x9mr1218858wee.0.1249976592359; Tue, 11 Aug 2009 00:43:12 -0700 (PDT) In-Reply-To: <3b61738b0908101552n298ca3bbvb3b7b6810dc906fb@mail.gmail.com> References: <3b61738b0908101552n298ca3bbvb3b7b6810dc906fb@mail.gmail.com> Date: Tue, 11 Aug 2009 09:43:12 +0200 Message-ID: <88f6e29a0908110043r49e6a11fs99f0023ac56484b5@mail.gmail.com> Subject: Re: Tokenizer, TokenStream, Token Filters From: Bernd Fondermann To: general@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Aug 11, 2009 at 00:52, K. M. McCormick wr= ote: > Hello Again: > > I'm trying to figure out what Filters do to terms in Lucene, specifically > > StandardTokenizer > StandardFilter > > While these are usually 'enough' for my work, I need to know specifically > what happens to the tokens in this, how they are split, etc. in order to > make sure my indexes match my queries, which are being parsed/modified ve= ry > specifically. I was tempted to make my own filter (like MyCrazyFilter) bu= t I > hesitate to throw away the 'standards' for no reason. > > Also, I have had a hard time finding information about writing your own > Tokenizers and Token Filters, other than the fact that you can do this. M= ost > of the work I want to do is fairly simple stuff, but I can't find much > information on how Lucene does it. What helped me in the past was browsing the javadoc, for example for the filter classes you mentioned and their superclasses. In addition, you may are not aware of the package javadoc for the analysis package you find here: http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/analysis/pac= kage-summary.html#package_description Furthermore, I often found reading the source code to be helpful: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_4_1/src/java/org/= apache/lucene/analysis/standard/StandardFilter.java Proper support for all these would be best obtained on the java-user mailing list: java-user-subscribe@lucene.apache.org HTH, Bernd > > I specifically know I want to ensure the following: > - tokens are broken at whitespace only, not at any other kinds of marks > - tokens have no accents (I use a normalizer for this) > - tokens do not only consist of punctuation (I use a simple function for > this) > - tokens do not have 'oddball' circumstances (such as the end of a senten= ce > retaining that punctuation... I =A0truncate this). > > Thanks, > drago >