lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (Commented) (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SOLR-2764) Create a NorwegianLightStemmer and NorwegianMinimalStemmer
Date Thu, 02 Feb 2012 21:12:54 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199227#comment-13199227
] 

Jan Høydahl commented on SOLR-2764:
-----------------------------------

When looking at words enging in -het and -dom in dictionaries (such as Ooo nb_NO.dic), the
base word has the same meaning in the vast majority of cases. But of course there will be
exceptions. Take the word "brennhet" (het as in hot), it will be stemmed to "brenn" ->
"bren" which is kind of wrong, but then "bren" is not a valid word so it won't cause errors.
There may be such cases where the final stem clashes with another word, but not more than
the base rules. I.e. there is a Norwegian surname "Brenna" which will be stemmed to "brenn"
by the "-a" rule, believing it's a fem.definite ending, and then we get a clash with the verb
"brenn" (burn). And the first name "Tore" (boy) or "Tora" (girl) will be stemmed to "Tor"
(boy) which is another valid first name...

My hunch is that the -dom/-het rules make more good than wrong. Mainly because in the majority
of cases it leads to the base word and the -het/-dom word being stemmed to the same stem in
cases where the "-en/-et/-a/-e/-n" rule are applied wrongly. Example:

{noformat}
One pass                       Two passes
forlegen        forleg         forlegen        forleg
forlegenhet     forlegen       forlegenhet     forleg
forlegenheten   forlegen       forlegenheten   forleg
forlegenhetens  forlegen       forlegenhetens  forleg
firkantet       firkant        firkantet       firkant
firkantethet    firkantet      firkantethet    firkant
firkantetheten  firkantet      firkantetheten  firkant
{noformat}

But I think maybe the rules -dommer and -dommen should be removed, because the word dommer
(judge) and dommen (the sentence) are both common words valid in word endings. So the word
"linjedommer" (linesman) would be stemmed to "linje" (line) which is too aggressive.

I see that it soon gets complicated to try to be clever. Should we go back to the one-pass
again for the light stemmer? Christian?
                
> Create a NorwegianLightStemmer and NorwegianMinimalStemmer
> ----------------------------------------------------------
>
>                 Key: SOLR-2764
>                 URL: https://issues.apache.org/jira/browse/SOLR-2764
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Jan Høydahl
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch
>
>
> We need a simple light-weight stemmer and a minimal stemmer for plural/singlular only
in Norwegian

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message