lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5013) ScandinavianInterintelligableASCIIFoldingFilter
Date Wed, 22 May 2013 18:13:20 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664364#comment-13664364
] 

Robert Muir commented on LUCENE-5013:
-------------------------------------

This is conceptually similar to the one for german (algorithm created by the snowball folks,
but factored out of their stemmer):
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanNormalizationFilter.java?view=markup

I think its nice to provide filters like this with language-specific normalizations. Though
maybe the name could be simpler, (ScandinavianNormalizationFilter?)
                
> ScandinavianInterintelligableASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-5013
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5013
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.3
>            Reporter: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-5013.txt
>
>
> This filter is an augmentation of output from ASCIIFoldingFilter,
> it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the first one.
> blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
> räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas
> Caveats:
> Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been folded
down to aoaoae when handled by this filter it will cause effects such as:
> bøen -> boen -> bon
> åene -> aene -> ane
> I find this to be a trivial problem compared to not finding anything at all.
> Background:
> Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus interchangeable
in when used between these languages. They are however folded differently when people type
them on a keyboard lacking these characters and ASCIIFoldingFilter handle ä and æ differently.
> When a Swedish person is lacking umlauted characters on the keyboard they consistently
type a, a, o instead of å, ä, ö. Foreigners also tend to use a, a, o.
> In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use a, a,
o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark but the pattern is
probably the same.
> This filter solves that problem, but might also cause new.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message