lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: WordDelimiterFilter Leading & Trailing Special Character
Date Tue, 21 Jul 2015 23:52:43 GMT
You can also use the types attribute to change the type of specific
characters, such as to treat the "!" or "&" as an <ALPHA>.

-- Jack Krupansky

On Tue, Jul 21, 2015 at 7:43 PM, Sathiya N Sundararajan <ausathya@gmail.com>
wrote:

> Upayavira,
>
> thanks for the helpful suggestion, that works. I was looking for an option
> to turn off/circumvent that particular WordDelimiterFilter's behavior
> completely. Since our indexes are hundred's of Terabytes, every time we
> find a term that needs to be added, it will be a cumbersome process to
> reload all the cores.
>
>
> thanks
>
> On Tue, Jul 21, 2015 at 12:57 AM, Upayavira <uv@odoko.co.uk> wrote:
>
> > Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
> > this config:
> >
> >  <fieldType name="text_wd" class="solr.TextField"
> >  positionIncrementGap="100">
> >    <analyzer>
> >      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >      <filter class="solr.WordDelimiterFilterFactory"
> >      protected="protectedword.txt"
> >              preserveOriginal="0" splitOnNumerics="1"
> >              splitOnCaseChange="1"
> >              catenateWords="0" catenateNumbers="0" catenateAll="0"
> >              generateWordParts="1" generateNumberParts="1"
> >              stemEnglishPossessive="1"
> >              types="wdfftypes.txt" />
> >    </analyzer>
> >  </fieldType>
> >
> > Note the protected="xxxxx" attribute. I suspect if you put Yahoo! into a
> > file referenced by that attribute, it may survive analysis. I'd be
> > curious to hear whether it works.
> >
> > Upayavira
> >
> > On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
> > > Question about WordDelimiterFilter. The search behavior that we
> > > experience
> > > with WordDelimiterFilter satisfies well, except for the case where
> there
> > > is
> > > a special character either at the leading or trailing end of the term.
> > >
> > > For instance:
> > >
> > > *‘d&b’ *  —>  Works as expected. Finds all docs with ‘d&b’.
> > > *‘p!nk’*  —>  Works fine as above.
> > >
> > > But on cases when, there is a special character towards the trailing
> end
> > > of
> > > the term, like ‘Yahoo!’
> > >
> > > *‘yahoo!’* —> Turns out to be a search for just *‘yahoo’* with
the
> > > special
> > > character *‘!’* stripped out.  This WordDelimiterFilter behavior is
> > > documented
> > >
> >
> http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
> > >
> > > What I would like to have is, the search performed without stripping
> out
> > > the leading & trailing special character. Is there a way to achieve
> this
> > > behavior with WordDelimiterFilter.
> > >
> > > This is current config that we have for the field:
> > >
> > > <fieldType name="text_wdf" class="solr.TextField"
> > > positionIncrementGap="100">
> > >         <analyzer type="index">
> > >             <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >             <filter class="solr.WordDelimiterFilterFactory"
> > > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > > preserveOriginal="1"
> > > types="specialchartypes.txt"/>
> > >             <filter class="solr.LowerCaseFilterFactory" />
> > >         </analyzer>
> > >         <analyzer type="query">
> > >             <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > >             <filter class="solr.WordDelimiterFilterFactory"
> > > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > > preserveOriginal="1"
> > > types="specialchartypes.txt"/>
> > >             <filter class="solr.LowerCaseFilterFactory" />
> > >         </analyzer>
> > >     </fieldType>
> > >
> > >
> > > thanks
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message