Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 71216 invoked from network); 22 Aug 2007 07:35:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 Aug 2007 07:35:05 -0000 Received: (qmail 58366 invoked by uid 500); 22 Aug 2007 07:34:55 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 58342 invoked by uid 500); 22 Aug 2007 07:34:55 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 58328 invoked by uid 99); 22 Aug 2007 07:34:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2007 00:34:55 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bdelacretaz@gmail.com designates 64.233.182.190 as permitted sender) Received: from [64.233.182.190] (HELO nf-out-0910.google.com) (64.233.182.190) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Aug 2007 07:34:52 +0000 Received: by nf-out-0910.google.com with SMTP id g16so58939nfd for ; Wed, 22 Aug 2007 00:34:30 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=tGp6YO0evE5ry3zzWA8/XWy+IyW+dPq6rAO39n+TijbnVwHTonZFJF41SFb9gK46OrcpccfQluEOitvqZZ7gKMIEaotg0ffaoFaPUUMDy2eEIP3i09T6VzUD74Zg4Rj31vexODg8TLTQnJwVmHIP7iPmEA1Rvdc2wZj7LrDbyv0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=jf0ljAiG9CYkfvPDxB01m/LlY5B9bzZtwcpVHsu19bM+nBkeCDb1zmF+LvwEKIBzFBJf0+kknRzI9Ma53PLwedITXxR7yTMBYsPHx74mOJOrUcmdQP/DeljfralaOO4zF4wpu2cnvejLx9DKxCdM8+fPnLCB1dZCfnrPY6aIOHs= Received: by 10.78.97.7 with SMTP id u7mr236875hub.1187768069831; Wed, 22 Aug 2007 00:34:29 -0700 (PDT) Received: by 10.78.131.3 with HTTP; Wed, 22 Aug 2007 00:34:29 -0700 (PDT) Message-ID: Date: Wed, 22 Aug 2007 09:34:29 +0200 From: "Bertrand Delacretaz" Sender: bdelacretaz@gmail.com To: dev@jackrabbit.apache.org Subject: Re: IndexingConfiguration jr 1.4 release, analyzing, searching and synonymprovider In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: X-Google-Sender-Auth: 0e77531291f9395c X-Virus-Checked: Checked by ClamAV on apache.org On 8/21/07, Ard Schrijvers wrote: > ...So would you like to see parts like chaining of filters for a indexing a property? Think > that shouldn't be to hard to implement.... If that's within the scope of your work, that would IMHO be very useful, to give people precise control on how the various properties are indexed. ...Certainly something like > > > > > would ofcourse ease the use of implementing synonyms/stopwords yourself.... Yes, given that many Lucene TokenFilters are available, this is useful I think. I see two potential issues that you might want to take into account: 1) With configurable indexing analyzers, people sometimes have a hard time figuring out how exactly their data is indexed (and why they don't find it later). Solr provides an analysis test page for that (see "Solr's content analysis test page" in [1]). In the case of Jackrabbit, maybe logging the filtered values of fields at the DEBUG level would help. 2) As discussed previously, one problem with this is which analyzer to use when running a query that applies to several fields. In Solr, you can configure a different analyzer for querying, it's probably the best solution. People then have to make sure their config is consistent for indexing and querying, and might need in some cases to provide their own custom QueryAnalyzer to achieve this. For example one that provides fake synonyms for a token, with each synonym being the result of the one of the analysis methods used. This can get tricky depending on the configured analysis, when searching in multiple fields. See also http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for more info on how Solr manages the analyzers. -Bertrand [1] http://www.xml.com/lpt/a/1668