lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Indexing word with plus sign
Date Tue, 23 May 2017 18:02:19 GMT
Years ago at Netflix, I had to deal with a DVD from a band named “+/-“. I gave up and translated
that to “plusminus” at index and query time.

http://plusmin.us/ <http://plusmin.us/>

Luckily, “.hack//Sign” and other related dot-hack anime matched if I just deleted all
the punctuation. And everyone searched for "[•REC]²” as “rec2”. The middot is supposed
to be red. Movie studios are clueless about searchable strings.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 23, 2017, at 10:41 AM, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> You need to distinguish between
> 
> PatternReplaceCharFilterFactory
> 
> and
> 
> PatternReplaceFilterFactory
> 
> The first one is applied to the entire input _before_ tokenization.
> The second is applied _after_ tokenization to individual tokens, by
> that time it's too late.
> 
> It's an easy thing to miss.
> 
> And at query time you'll have to be careful to keep the + sign from
> being interpreted as an operator.
> Best,
> Erick
> 
> On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
> <funderadeveloper@outlook.com> wrote:
>> I have also tried this option, by using a PatternReplaceFilterFactory, like this:
>> 
>> <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" replacement="investigación
y desarrollo"/>
>> 
>> but it gets processed AFTER the Tokenizer, so when it executes there is no longer
an "i+d" token, but two "i" and "d" independent tokens.
>> 
>> Is there a way I could make the filter execute before the Tokenizer? I have tried
to place it first in the Analyzer definition like this:
>> 
>>     <analyzer type="index">
>>       <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
>>       <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" replacement="investigación
y desarrollo"/>
>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
/>
>>     </analyzer>
>> 
>> But I had no luck.
>> 
>> Are there any other approaches I could be missing?
>> 
>> Thanks!
>> 
>> 
>> El 22/05/17 a las 20:50, Rick Leir escribió:
>> 
>> Fundera,
>> You need a regex which matches a '+' with non-blank chars before and after. It should
not replace a  '+' preceded by white space, that is important in Solr. This is not a perfect
solution, but might improve matters for you.
>> Cheers -- Rick
>> 
>> On May 22, 2017 1:58:21 PM EDT, Fundera Developer <funderadeveloper@outlook.com><mailto:funderadeveloper@outlook.com>
wrote:
>> 
>> 
>> Thank you Zahid and Erik,
>> 
>> I was going to try the CharFilter suggestion, but then I doubted. I see
>> the indexing process, and how the appearance of 'i+d' would be handled,
>> but, what happens at query time? If I use the same filter, I could
>> remove '+' chars that are added by the user to identify compulsory
>> tokens in the search results, couldn't I?  However, if i do not use the
>> CharFilter I would not be able to match the 'i+d' search tokens...
>> 
>> Thanks all!
>> 
>> 
>> 
>> El 22/05/17 a las 16:39, Erick Erickson escribió:
>> 
>> You can also use any of the other tokenizers. WhitespaceTokenizer for
>> instance. There are a couple that use regular expressions. Etc. See:
>> https://cwiki.apache.org/confluence/display/solr/Tokenizers
>> 
>> Each one has it's considerations. WhitespaceTokenizer won't, for
>> instance, separate out punctuation so you might then have to use a
>> filter to remove those. Regex's can be tricky to get right ;). Etc....
>> 
>> Best,
>> Erick
>> 
>> On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
>> <zahid.iqbal@northbaysolutions.net><mailto:zahid.iqbal@northbaysolutions.net><mailto:zahid.iqbal@northbaysolutions.net><mailto:zahid.iqbal@northbaysolutions.net>
>> wrote:
>> 
>> 
>> Hi,
>> 
>> 
>> Before applying tokenizer, you can replace your special symbols with
>> some
>> phrase to preserve it and after tokenized you can replace it back.
>> 
>> For example:
>> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\+)"
>> replacement="xxx" />
>> 
>> 
>> Thanks,
>> Zahid iqbal
>> 
>> On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
>> funderadeveloper@outlook.com<mailto:funderadeveloper@outlook.com><mailto:funderadeveloper@outlook.com><mailto:funderadeveloper@outlook.com>>
>> wrote:
>> 
>> 
>> 
>> Hi all,
>> 
>> I am a bit stuck at a problem that I feel must be easy to solve. In
>> Spanish it is usual to find the term 'i+d'. We are working with Solr
>> 5.5,
>> and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
>> the
>> index documents both in Spanish and Catalan, and in Catalan it is
>> frequent
>> to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
>> documents as results.
>> 
>> I have tried to use the SynonymFilter, with something like:
>> 
>> i+d => investigacionYdesarrollo
>> 
>> But it does not seem to change anything.
>> 
>> Is there a way I could set an exception to the Tokenizer so that it
>> does
>> not split this word?
>> 
>> Thanks in advance!
>> 
>> 
>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message