lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sathiya N Sundararajan <ausat...@gmail.com>
Subject Re: WordDelimiterFilter Leading & Trailing Special Character
Date Wed, 29 Jul 2015 23:54:55 GMT
thanks for the suggestion Jack. We are already using @ and # as <ALPHA>,
will see if it makes sense to go that route.

On Tue, Jul 21, 2015 at 4:52 PM, Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> You can also use the types attribute to change the type of specific
> characters, such as to treat the "!" or "&" as an <ALPHA>.
>
> -- Jack Krupansky
>
> On Tue, Jul 21, 2015 at 7:43 PM, Sathiya N Sundararajan <
> ausathya@gmail.com>
> wrote:
>
> > Upayavira,
> >
> > thanks for the helpful suggestion, that works. I was looking for an
> option
> > to turn off/circumvent that particular WordDelimiterFilter's behavior
> > completely. Since our indexes are hundred's of Terabytes, every time we
> > find a term that needs to be added, it will be a cumbersome process to
> > reload all the cores.
> >
> >
> > thanks
> >
> > On Tue, Jul 21, 2015 at 12:57 AM, Upayavira <uv@odoko.co.uk> wrote:
> >
> > > Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
> > > this config:
> > >
> > >  <fieldType name="text_wd" class="solr.TextField"
> > >  positionIncrementGap="100">
> > >    <analyzer>
> > >      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >      <filter class="solr.WordDelimiterFilterFactory"
> > >      protected="protectedword.txt"
> > >              preserveOriginal="0" splitOnNumerics="1"
> > >              splitOnCaseChange="1"
> > >              catenateWords="0" catenateNumbers="0" catenateAll="0"
> > >              generateWordParts="1" generateNumberParts="1"
> > >              stemEnglishPossessive="1"
> > >              types="wdfftypes.txt" />
> > >    </analyzer>
> > >  </fieldType>
> > >
> > > Note the protected="xxxxx" attribute. I suspect if you put Yahoo! into
> a
> > > file referenced by that attribute, it may survive analysis. I'd be
> > > curious to hear whether it works.
> > >
> > > Upayavira
> > >
> > > On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
> > > > Question about WordDelimiterFilter. The search behavior that we
> > > > experience
> > > > with WordDelimiterFilter satisfies well, except for the case where
> > there
> > > > is
> > > > a special character either at the leading or trailing end of the
> term.
> > > >
> > > > For instance:
> > > >
> > > > *‘d&b’ *  —>  Works as expected. Finds all docs with ‘d&b’.
> > > > *‘p!nk’*  —>  Works fine as above.
> > > >
> > > > But on cases when, there is a special character towards the trailing
> > end
> > > > of
> > > > the term, like ‘Yahoo!’
> > > >
> > > > *‘yahoo!’* —> Turns out to be a search for just *‘yahoo’*
with the
> > > > special
> > > > character *‘!’* stripped out.  This WordDelimiterFilter behavior is
> > > > documented
> > > >
> > >
> >
> http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
> > > >
> > > > What I would like to have is, the search performed without stripping
> > out
> > > > the leading & trailing special character. Is there a way to achieve
> > this
> > > > behavior with WordDelimiterFilter.
> > > >
> > > > This is current config that we have for the field:
> > > >
> > > > <fieldType name="text_wdf" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >         <analyzer type="index">
> > > >             <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > > >             <filter class="solr.WordDelimiterFilterFactory"
> > > > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > > > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > > > preserveOriginal="1"
> > > > types="specialchartypes.txt"/>
> > > >             <filter class="solr.LowerCaseFilterFactory" />
> > > >         </analyzer>
> > > >         <analyzer type="query">
> > > >             <tokenizer class="solr.WhitespaceTokenizerFactory" />
> > > >             <filter class="solr.WordDelimiterFilterFactory"
> > > > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > > > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > > > preserveOriginal="1"
> > > > types="specialchartypes.txt"/>
> > > >             <filter class="solr.LowerCaseFilterFactory" />
> > > >         </analyzer>
> > > >     </fieldType>
> > > >
> > > >
> > > > thanks
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message