lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: "WI" not Wi-Fi
Date Thu, 09 Sep 2010 02:47:28 GMT
As far as I know, there's not much at all you can do with StandardAnalyzer
to emulate what's happening on Solr in Lucene. What you might be able to
do is use a different Analyzer, perhaps SimpleAnalyzer would do the trick,
see the API docs in lucene...

Beyond that, you might have to make your own analyzer in Lucene. Lucene
in Action has an example of making your own analyzer that can server as
a model (SynonymAnalyzer).

Beyond that, I'm going to have to defer to wiser heads than me.

Best
Erick

On Wed, Sep 8, 2010 at 6:29 PM, Max Lynch <ihasmax@gmail.com> wrote:

> Sorry to be confusing.  I'm actually using both.  I use Solr for its web
> application features and Lucene for my background searches.  In this case,
> the issue is with my Lucene side of things.
>
> The analysis feature on the Solr admin page shows the analysis being
> correct
> and wi-fi no longer matches "WI".  Here is the schema snippe for this type.
> I changed generateWordParts="1" to "0" and that fixed the solr side of
> thingst:
>
>        <fieldType name="text_standard" class="solr.TextField"
> positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>
>
>
> However, I am using a StandardAnalyzer on the index beneath Solr and my
> hits
> are still showing up with Wi-Fi.  I was curious if there was something
> special I had to do with the StandardAnalyzer on the lucene side of things
> in order to remove the word split functionality.  I know it's kind of an
> odd
> relationship with Solr and Lucene, but I haven't had any other issues so
> far.
>
>
> Please let me know if you think this belongs on the Solr list instead.
>
> Thanks.
>
>
> On Wed, Sep 8, 2010 at 5:23 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > I'm a bit confused, this is the Lucene list, but it sounds like you're
> > using
> > SOLR. If you are, could you post the relevant parts of your schema,
> > especially the field type definition for the field in question? If you
> are,
> > why not just take WordDelimiterFilterFactory out of your field type
> > definition?
> >
> > The analysis page will help you lots here if you're in SOLR.
> >
> > StandardAnalyzer could well be splitting on '-' if you're using that.
> >
> > Best
> > Erick
> >
> > On Wed, Sep 8, 2010 at 5:27 PM, Max Lynch <ihasmax@gmail.com> wrote:
> >
> > > Hi,
> > > I am using the StandardAnalyzer, but I am not interested in converting
> > > words
> > > like Wi-Fi into "Wi" and "Fi".  Rather, "WI" is an important word for
> my
> > > users (indicating the state of Wisconsin) and I need "WI" to only match
> > the
> > > distinct word.
> > >
> > > I know in Solr I can set generateWordParts="0" for my
> > > solr.WordDelimiterFilterFactory, but for some reason when I read the
> > index
> > > with Lucene the tokens are still separated.
> > >
> > > Is there a way to disable this?  Thanks.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message