lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mauro Asprea (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (SOLR-1279) ApostropheTokenizer
Date Thu, 16 Feb 2012 09:02:59 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209231#comment-13209231
] 

Mauro Asprea edited comment on SOLR-1279 at 2/16/12 9:02 AM:
-------------------------------------------------------------

I confirm this is working using the WordDelimiterFilterFactory like Robert said:

{code}
<filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="0"  
preserveOriginal="1"
catenateAll="1"/>      
{code}

Then using Solr Admin Analysis page I get the following:
Value: McDonald's

||Indexed Term|
|McDonald's|
|Mc|
|Donald|
|s|
|McDonalds|

One thing: You have to be sure that no previous filters remove the trailing "'s". In my case
I had the StandardFilterFactory which does remove tailing apostrophes.
                
      was (Author: brutuscat):
    I confirm this is working using the WordDelimiterFilterFactory like Robert said:

{code}
<filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="0"  
preserveOriginal="1"
catenateAll="1"/>      
{code}

The using Solr Admin Analysis page I get the following:
Value: McDonald's

||Indexed Term|
|McDonald's|
|Mc|
|Donald|
|s|
|McDonalds|

One thing: You have to be sure that no previous filters remove the trailing "'s". In my case
I had the StandardFilterFactory which does remove tailing apostrophes.
                  
> ApostropheTokenizer
> -------------------
>
>                 Key: SOLR-1279
>                 URL: https://issues.apache.org/jira/browse/SOLR-1279
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Sergey Borisov
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: ApostropheTokenizer.zip
>
>
> ApostropheTokenizer creates extra tokens during the analysis stage for the fields containing
apostrophes. The reason for adding this is to ensure that documents that differ only by apostrophe
have the same relevancy score. 
> For example, if the document contains string "McDonald's", it will be tokenized as "McDonald's
McDonalds". This way when the search is performed against "McDonald's" or "McDonalds" will
produce similar score.
> This code handles up to two apostrophes in a token.
> To use this tokenizer add the following line in schema.xml
> <analyzer type="index">
>       <filter class="org.apache.lucene.analysis.ApostropheTokenFactory"/>
> ...
> </analyzer>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message