lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens
Date Fri, 01 Jun 2012 14:37:24 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287447#comment-13287447
] 

Robert Muir commented on LUCENE-4063:
-------------------------------------

{quote}
That's why on the mailing I also suggested we could have each stemmer share a common interface
that would filter non-stemmable literals out of the way
{quote}

We actually have this in place, but its too limited. Its called KeywordAttribute. When this
is set, the stemmer will not touch the word.

Currently the only way to set this out of box is to use KeywordMarkerFilter which takes a
Set of protected words.

But to make your idea more flexible: I could imagine a couple more filters:
* one that marks as Keyword based on a set of types. In this case you would just add NUM to
that set, and no stemmers would touch any numbers. Of course
  for french this is solved already, but imagine if you are using the URLEmail tokenizer:
I think a set like { URL, EMAIL } would be very useful,
  otherwise stemmers will probably muck with them.
* one that marks as Keyword based on a regular expression. This could be good for fine-tuning
stemmers for a lot of general purpose needs: e.g. on the 
  mailing list before someone was unhappy about how russian stemmers would treat russian place
names and they had a certain set of suffixes they didnt
  want stemmed.

Anyway, I would really like to see these filters, I think they would be pretty simple to implement
as well. 
                
> FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in
long tokens
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4063
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4063
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.4, 4.0
>            Reporter: Tanguy Moal
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch
>
>
> FrenchLightStemmer performs aggressive deletions on repeated character sequences, even
on numbers.
> This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message