Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com
 designates 209.85.220.171 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <c08fb2ed05c53bf2881be10816a2d0ef@mail.gmail.com>
References: <c08fb2ed05c53bf2881be10816a2d0ef@mail.gmail.com>
Date: Sat, 29 Mar 2014 11:09:01 -0400
Message-ID: 
 <CAN4YXvf8kcjgRbb2TjS5kD0nww7jZ=P-39MDtfzvs1TxKG4iLg@mail.gmail.com>
Subject: Re: WordDelimiterFilterFactory splits up hyphenated terms although
 splitOnNumerics,
 generateWordParts and generateNumberParts are set to 0 (false)
From: Erick Erickson <erickerickson@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Why do you say at the indexing part:

The given search term is: *X-002-99-495*
WordDelimiterFilterFactory indexes the following word parts:
* X (shouldn't be there)
* 00299495 (shouldn't be there)

??
You've set catenateNumbers=3D"1" in your fieldType for the indexig part,
so WDFF is doing exactly what it should... smushing all the numbers it
separated into a single entity.

And the whole _point_ of WDFF is to split on "non alpha nums" and
index the parts it splits.

This seems like it's behaving exactly as it should.

Or I'm missing something totally.

Best,
Erick


On Thu, Mar 27, 2014 at 11:25 AM, Malte H=FCbner <huebner@innobox.de> wrote=
:
> I am using Solr 4.7 and have got a serious problem with
> WordDelimiterFilterFactory.
>
> WordDelimiterFilterFactory behaves different on hyphenated terms if they
> contain charaters (a-Z) or characters AND numbers.
>
>
>
> Splitting up hyphenated terms is deactivated in my configuration:
>
>
>
> *This is the fieldType setup from my schema:*
>
>
>
> {code}
>
>                                <fieldType name=3D"text"
> class=3D"solr.TextField" positionIncrementGap=3D"100">
>
>                                                <analyzer type=3D"index">
>
>                                                                <tokenizer
> class=3D"solr.WhitespaceTokenizerFactory" />
>
>                                                                <filter
> class=3D"solr.StopFilterFactory" ignoreCase=3D"true"
> words=3D"lang/stopwords_de.txt" enablePositionIncrements=3D"true" />
>
>                                                                <filter
> class=3D"solr.WordDelimiterFilterFactory" stemEnglishPossessive=3D"0"
> generateWordParts=3D"0" generateNumberParts=3D"0" catenateWords=3D"1"
> catenateNumbers=3D"1" catenateAll=3D"0" splitOnCaseChange=3D"0"
> splitOnNumerics=3D"0" preserveOriginal=3D"1"/>
>
>                                                                <filter
> class=3D"solr.LowerCaseFilterFactory" />
>
>                                                </analyzer>
>
>                                                <analyzer type=3D"query">
>
>                                                                <tokenizer
> class=3D"solr.WhitespaceTokenizerFactory" />
>
>                                                                <filter
> class=3D"solr.SynonymFilterFactory" synonyms=3D"lang/synonyms_de.txt"
> ignoreCase=3D"true" expand=3D"true" />
>
>                                                                <filter
> class=3D"solr.StopFilterFactory" ignoreCase=3D"true"
> words=3D"lang/stopwords_de.txt" enablePositionIncrements=3D"true" />
>
>                                                                <filter
> class=3D"solr.WordDelimiterFilterFactory" generateWordParts=3D"0"
> generateNumberParts=3D"0" catenateWords=3D"1" catenateNumbers=3D"0"
> catenateAll=3D"0" splitOnCaseChange=3D"0" splitOnNumerics=3D"0"
> preserveOriginal=3D"1"/>
>
>                                                                <filter
> class=3D"solr.LowerCaseFilterFactory" />
>
>                                                </analyzer>
>
>                                </fieldType>
>
> {code}
>
>
>
> The given search term is: *X-002-99-495*
>
>
>
> WordDelimiterFilterFactory indexes the following word parts:
>
>
>
> * X-002-99-495
>
> * X (shouldn't be there)
>
> * 00299495 (shouldn't be there)
>
> * X00299495
>
>
>
> But the 'X' should not be indexed or queried as a single term. You can se=
e
> that splitting is completely deactivated in the schema.
>
>
>
> I can move the charater part around in the search term:
>
>
>
> Searching for *002-abc-99-495* gives me
>
>
>
> * 002-abc-99-495
>
> * 002 (shouldn't be there)
>
> * abc (shouldn't be there)
>
> * 99495 (shouldn't be there)
>
> * 002abc99495
>
>
>
> Searching for Searching for *002-99-495* (no character) gives me
>
> * 002-99-495
>
> * 00299495
>
> This result is what I would expect.
>
>
>
> Any ideas?