Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5AF249A64 for ; Sun, 29 Apr 2012 09:59:18 +0000 (UTC) Received: (qmail 81516 invoked by uid 500); 29 Apr 2012 09:59:16 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 81255 invoked by uid 500); 29 Apr 2012 09:59:15 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 81203 invoked by uid 99); 29 Apr 2012 09:59:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Apr 2012 09:59:13 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Apr 2012 09:59:11 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 8F86A4263C1 for ; Sun, 29 Apr 2012 09:58:51 +0000 (UTC) Date: Sun, 29 Apr 2012 09:58:51 +0000 (UTC) From: "Oliver Schihin (JIRA)" To: dev@lucene.apache.org Message-ID: <1971986105.7294.1335693531609.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1361265634.14930.1322412880286.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oliver Schihin updated SOLR-2921: --------------------------------- Comment: was deleted (was: Am I off topic, or is ICUCollationKeyFilterFactory a candidate, as well?) > Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should > --------------------------------------------------------------------------------------------- > > Key: SOLR-2921 > URL: https://issues.apache.org/jira/browse/SOLR-2921 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis > Affects Versions: 3.6, 4.0 > Environment: All > Reporter: Erick Erickson > Assignee: Erick Erickson > Priority: Minor > Fix For: 3.6, 4.0 > > Attachments: SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-3x.patch, SOLR-2921-trunk.patch, SOLR-2921_rest.patch > > > SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr to automatically assemble a "multiterm" analyzer that does the right thing vis-a-vis transforming the individual terms of a multi-term query at query time. Examples are: lower casing, folding accents, etc. Currently (27-Nov-2011), the following classes implement MultiTermAwareComponent: > * ASCIIFoldingFilterFactory > * LowerCaseFilterFactory > * LowerCaseTokenizerFactory > * MappingCharFilterFactory > * PersianCharFilterFactory > When users put any of the above in their query analyzer, Solr will "do the right thing" at query time and the perennial question users have, "why didn't my wildcard query automatically lower-case (or accent fold or....) my terms?" will be gone. Die question die! > But taking a quick look, for instance, at the various FilterFactories that exist, there are a number of possibilities that *might* be good candidates for implementing MultiTermAwareComponent. But I really don't understand the correct behavior here well enough to know whether these should implement the interface or not. And this doesn't include other CharFilters or Tokenizers. > Actually implementing the interface is often trivial, see the classes above for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the right thing in this case. > Here is a quick cull of the Filters that, just from their names, might be candidates. If anyone wants to take any of them on, that would be great. If all you can do is provide test cases, I could probably do the code part, just let me know. > ArabicNormalizationFilterFactory > GreekLowerCaseFilterFactory > HindiNormalizationFilterFactory > ICUFoldingFilterFactory > ICUNormalizer2FilterFactory > ICUTransformFilterFactory > IndicNormalizationFilterFactory > ISOLatin1AccentFilterFactory > PersianNormalizationFilterFactory > RussianLowerCaseFilterFactory > TurkishLowerCaseFilterFactory -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org