Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2993710D3B for ; Sat, 29 Mar 2014 15:09:33 +0000 (UTC) Received: (qmail 74641 invoked by uid 500); 29 Mar 2014 15:09:29 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 74322 invoked by uid 500); 29 Mar 2014 15:09:28 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 74312 invoked by uid 99); 29 Mar 2014 15:09:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Mar 2014 15:09:27 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.220.171 as permitted sender) Received: from [209.85.220.171] (HELO mail-vc0-f171.google.com) (209.85.220.171) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Mar 2014 15:09:22 +0000 Received: by mail-vc0-f171.google.com with SMTP id lg15so6992890vcb.16 for ; Sat, 29 Mar 2014 08:09:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=3CO3sQU4WWOMMhXNSd1uJ6IM8jpC+XAqkBP+CMytcOQ=; b=J7dd8icVUlIk+z6OMSCTlhxbUjAJMugHSRy48ru+m7yhqo/kj5+ciVdxv/njnL3tDL flyOn69ryXWwzQWQ0KGLlYQllGpKl8u0haBFNJz9GJHG82XbkKj2+biPWvR37sV6JiM+ 17Cc54Wg5VqPcTM4eOysEugqJiA6QcfStAzj8EMMtKA9lOv/rhH7w2Sb0UxM5yVChej6 5cVJeB23LsMqxt4+38KAzDmchmZu9Jjv+3UtkakiGLCChCgngO1qI2ivwZsvPWJNFPsb V8BFrzWigECyc3cu3E+/Y+ZBz8GcCbJEI20OqP3SVuJux7CwbLmGmwe09veINY/clyfN NJOw== MIME-Version: 1.0 X-Received: by 10.220.161.8 with SMTP id p8mr12970753vcx.4.1396105741487; Sat, 29 Mar 2014 08:09:01 -0700 (PDT) Received: by 10.52.69.234 with HTTP; Sat, 29 Mar 2014 08:09:01 -0700 (PDT) In-Reply-To: References: Date: Sat, 29 Mar 2014 11:09:01 -0400 Message-ID: Subject: Re: WordDelimiterFilterFactory splits up hyphenated terms although splitOnNumerics, generateWordParts and generateNumberParts are set to 0 (false) From: Erick Erickson To: solr-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Why do you say at the indexing part: The given search term is: *X-002-99-495* WordDelimiterFilterFactory indexes the following word parts: * X (shouldn't be there) * 00299495 (shouldn't be there) ?? You've set catenateNumbers=3D"1" in your fieldType for the indexig part, so WDFF is doing exactly what it should... smushing all the numbers it separated into a single entity. And the whole _point_ of WDFF is to split on "non alpha nums" and index the parts it splits. This seems like it's behaving exactly as it should. Or I'm missing something totally. Best, Erick On Thu, Mar 27, 2014 at 11:25 AM, Malte H=FCbner wrote= : > I am using Solr 4.7 and have got a serious problem with > WordDelimiterFilterFactory. > > WordDelimiterFilterFactory behaves different on hyphenated terms if they > contain charaters (a-Z) or characters AND numbers. > > > > Splitting up hyphenated terms is deactivated in my configuration: > > > > *This is the fieldType setup from my schema:* > > > > {code} > > class=3D"solr.TextField" positionIncrementGap=3D"100"> > > > > class=3D"solr.WhitespaceTokenizerFactory" /> > > class=3D"solr.StopFilterFactory" ignoreCase=3D"true" > words=3D"lang/stopwords_de.txt" enablePositionIncrements=3D"true" /> > > class=3D"solr.WordDelimiterFilterFactory" stemEnglishPossessive=3D"0" > generateWordParts=3D"0" generateNumberParts=3D"0" catenateWords=3D"1" > catenateNumbers=3D"1" catenateAll=3D"0" splitOnCaseChange=3D"0" > splitOnNumerics=3D"0" preserveOriginal=3D"1"/> > > class=3D"solr.LowerCaseFilterFactory" /> > > > > > > class=3D"solr.WhitespaceTokenizerFactory" /> > > class=3D"solr.SynonymFilterFactory" synonyms=3D"lang/synonyms_de.txt" > ignoreCase=3D"true" expand=3D"true" /> > > class=3D"solr.StopFilterFactory" ignoreCase=3D"true" > words=3D"lang/stopwords_de.txt" enablePositionIncrements=3D"true" /> > > class=3D"solr.WordDelimiterFilterFactory" generateWordParts=3D"0" > generateNumberParts=3D"0" catenateWords=3D"1" catenateNumbers=3D"0" > catenateAll=3D"0" splitOnCaseChange=3D"0" splitOnNumerics=3D"0" > preserveOriginal=3D"1"/> > > class=3D"solr.LowerCaseFilterFactory" /> > > > > > > {code} > > > > The given search term is: *X-002-99-495* > > > > WordDelimiterFilterFactory indexes the following word parts: > > > > * X-002-99-495 > > * X (shouldn't be there) > > * 00299495 (shouldn't be there) > > * X00299495 > > > > But the 'X' should not be indexed or queried as a single term. You can se= e > that splitting is completely deactivated in the schema. > > > > I can move the charater part around in the search term: > > > > Searching for *002-abc-99-495* gives me > > > > * 002-abc-99-495 > > * 002 (shouldn't be there) > > * abc (shouldn't be there) > > * 99495 (shouldn't be there) > > * 002abc99495 > > > > Searching for Searching for *002-99-495* (no character) gives me > > * 002-99-495 > > * 00299495 > > This result is what I would expect. > > > > Any ideas?