lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dyer, James" <James.D...@ingramcontent.com>
Subject RE: Protwords in solr spellchecker
Date Fri, 10 Jul 2015 14:07:07 GMT
Kamal,

Given the constraint that you cannot re-index the data, your best bet might be to simply filter
out the suggestions at the application level, or maybe even have a proxy do it.

Possibly another option, you might be able to extend DirectSolrSpellchecker and override #getSuggestions(),
calling super(), then post-filtering out your stop words from the response.  You'll want to
request a few more terms so you're more likely to get results even if a term or two get filtered
out.  You can specify your custom spell checker in solrconfig.xml.

James Dyer
Ingram Content Group


-----Original Message-----
From: Alessandro Benedetti [mailto:benedetti.alex85@gmail.com] 
Sent: Friday, July 10, 2015 7:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Protwords in solr spellchecker

So let's try to analyse the situation from the spellchecking point of view .
First of all we follow David suggestions and we add in the QueryTime
analysis, the StopWordsFilter, with our configured "bad" words.

*Starting scenario*
- we have the protected words in our index, we still want them to be in
there

Let's explore the different kind of Spellcheckers available, where do they
take the suggestions ? :

*Index Based Spellchecker*
The suggestions will come from an auxiliary index.

*Direct Spellchecker*
The suggestions will come from the current index.

*File based spellchecker*
It uses an external file to get the spelling suggestions from, so we can
curate this file properly with only good words, and we are fine.
But I guess you would like to use a blacklist, in this case we are going to
have a white list.

*Query Time*
At query time *the query is analysed *and a token stream is provided.
Then depending on the implementation we trigger a different lookup.
In the case of the Direct Spellchecker, if I remember well :
For each token a FST with all the supported inflections is generated and an
intersection happen with the Index FST ( based on the field), and the
suggestion is returned.

Unfortunately a proper* query time analysis will not help .*
When we analyse the query we have the misspelled word "sexe" that is not
going to be recognised as the bad word.
Then the inflections are calculated, the FST built and the intersection
will actually produce the feared suggestion "sex" .
This because the word is in the index.

If we can't modify the index, the *Direct Spellcheck is not an option *if
my understanding is correct.

Let's see if the Index Based spellcheck can help …
Unfortunately also in this case, the auxiliary index produced is based on
the analysed form of the original field.

If you really can not re-index content I would suggest you an
implementation based on a concept similar to the AnalyzingSuggester in Solr.

Open to clarify your further questions.








2015-07-10 9:31 GMT+01:00 davidphilip cherian <davidphilipcherian@gmail.com>
:

> Hi Kamal,
>
> Not necessarily. You can have different filters applied at index time and
> query time. (note that the order in which filters are defined matters). You
> could just add the stop filter at query time.
> Have your own custom data type defined (similar to 'text_en' that will be
> in schem.xml) and perhaps use standard/whitespace tokenizer followed by
> stop filter at query time.
>
> Tip: Use analysis tool that is available in solr admin page to further
> understand the analysis chain of data types.
>
> HTH
>
>
>
> On Fri, Jul 10, 2015 at 1:03 PM, Kamal Kishore Aggarwal <
> kkroyal.192@gmail.com> wrote:
>
> > Hi David,
> >
> > This one is a good suggestion. But, if add these *adult* keywords in the
> > stopwords.txt file, it will be requiring the re-indexing of these
> keywords
> > related data.
> >
> > How can I see the change instantly. Is there any other great suggestion
> > that you can suggest me.
> >
> >
> >
> >
> > On Thu, Jul 9, 2015 at 12:09 PM, davidphilip cherian <
> > davidphilipcherian@gmail.com> wrote:
> >
> > > The best bet is to use solr.StopFilterFactory.
> > > Have all such words added to stopwords.txt and add this filter to your
> > > analyzer.
> > >
> > > Reference links
> > >
> > >
> >
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-StopFilter
> > >
> > > HTH
> > >
> > >
> > > On Thu, Jul 9, 2015 at 11:50 AM, Kamal Kishore Aggarwal <
> > > kkroyal.192@gmail.com> wrote:
> > >
> > > > Hi Team,
> > > >
> > > > I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. Is
> > there
> > > > any feature by which I can refrain the following words to appear in
> > spell
> > > > suggestion.
> > > >
> > > > For example: Somebody searches for sexe, I does not want to show him
> > sex
> > > as
> > > > the spell suggestion via solr. How can I stop these type of keywords
> to
> > > be
> > > > shown in suggestion.
> > > >
> > > > Any help is appreciated.
> > > >
> > > >
> > > > Regards
> > > > Kamal Kishore
> > > > Solr Beginner
> > > >
> > >
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England
Mime
View raw message