lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Petite Abeille <petite_abei...@mac.com>
Subject Re: Blåbærsyltetøy v.s. Räksmörgås
Date Wed, 22 May 2013 18:29:03 GMT

On May 22, 2013, at 7:08 PM, Karl Wettin <karl.wettin@kodapan.se> wrote:

>> * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, oo,
and other combination of double vowels, just keeping the first one.
> 
> I ended up with that solution.
> 
> https://issues.apache.org/jira/browse/LUCENE-5013

Interesting problem… perhaps you could generalize your solution a bit… for example, in,
say, German, one could substitute 'ue' for 'ü', etc… so it looks like what you are after
is folding double vowels… irrespectively of how they got there…

So… assuming something along the lines of Sean M. Burke Unidecode [1] for the purpose of
ASCII transliteration, what's left is simply to fold double vowels, e.g.:

print( 1, Unidecode( 'blåbærsyltetøj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 2, Unidecode( 'blåbärsyltetöj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 3, Unidecode( 'blaabaarsyltetoej' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 4, Unidecode( 'blabarsyltetoj' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 5, Unidecode( 'Räksmörgås' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 6, Unidecode( 'Göteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 7, Unidecode( 'Gøteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 8, Unidecode( 'Über' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 9, Unidecode( 'ueber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 10, Unidecode( 'uber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 11, Unidecode( 'uuber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )

> 1	blabarsyltetoj
> 2	blabarsyltetoj
> 3	blabarsyltetoj
> 4	blabarsyltetoj
> 5	raksmorgas
> 6	goteborg
> 7	goteborg	
> 8	uber	
> 9	uber	
> 10	uber	
> 11	uber	



[1] http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message