lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters
Date Sun, 17 Jan 2010 17:35:54 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801446#action_12801446
] 

Simon Willnauer commented on LUCENE-2198:
-----------------------------------------

I kind of agree with both of you. When I started implementing this attribute I had FlagAttribute
in mind but I didn't choose it because users can randomly choose a bit of the word which might
lead to unexpected behavior. 

Another solution I had in mind is to introduce another Attribute (or extend FlagAttribute)
holding a Lucene private (not the java visibility keyword) Enum that can be extended in the
future. Internally this could use a word or a Bitset (a word will do I guess) where bits can
be set according to the enum ord. That way we could encode way more than only one single boolean
and the cost of adding new "flags" / enum values would be minimal.

{code}
booleanAttribute.isSet(BooelanAttributeEnum.Keyword)
{code}

something like that, thoughts?

> support protected words in Stemming TokenFilters
> ------------------------------------------------
>
>                 Key: LUCENE-2198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2198
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2198.patch, LUCENE-2198.patch
>
>
> This is from LUCENE-1515
> I propose that all stemming TokenFilters have an 'exclusion set' that bypasses any stemming
for words in this set.
> Some stemming tokenfilters have this, some do not.
> This would be one way for Karl to implement his new swedish stemmer (as a text file of
ignore words).
> Additionally, it would remove duplication between lucene and solr, as they reimplement
snowballfilter since it does not have this functionality.
> Finally, I think this is a pretty common use case, where people want to ignore things
like proper nouns in the stemming.
> As an alternative design I considered a case where we generalized this to CharArrayMap
(and ignoring words would mean mapping them to themselves), which would also provide a mechanism
to override the stemming algorithm. But I think this is too expert, could be its own filter,
and the only example of this i can find is in the Dutch stemmer.
> So I think we should just provide ignore with CharArraySet, but if you feel otherwise
please comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message