incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Herbert Duerr <...@apache.org>
Subject i18nregexp replaced with ICU regexp => heads up
Date Fri, 30 Sep 2011 13:08:18 GMT
Hi,

for removing "category X excluded licenses" from Apache OpenOffice I 
replaced the formerly used LGPL licensed module i18nregexp with the 
regular expression engine of module ICU which is already widely use in 
OpenOffice.

The replacement fixes a lot of problems: e.g. in a text "abcabc" trying 
to "find all backwards" for "b" resulted in it only finding the last 
"b", now it actually finds all of them. It also introduces some changes, 
e.g. i18nregexp had two modes "classic" and "extended" regexp whereas 
the ICU based engine treats all patterns as extended-regexp.

I18nregexp used an approach where it transliterated and compared each 
codepoint pair of the pattern and text string. The new engine does the 
transliteration only once per pattern and text string. This is much 
faster, but it only works because the transliteration was tweaked to 
preserve the special regexp control characters.

The reporters of any issues in the lists below are encouraged to check 
the problems they saw with the new engine.
https://issues.apache.org/ooo/buglist.cgi?quicksearch=regexp
https://issues.apache.org/ooo/buglist.cgi?quicksearch=regular\ expression
Please make sure to have the "More Options -> Regular Expressions" 
checkbox activated for testing.

I'm afraid the regexp replacement resulted in changes mostly for 
Japanese users, because there a lot of non-trivial transliterations are 
active. For reference I'm enumerating the active rules: 
"ProlongedSoundMark", "IterationMark", "Ignore-Width", "BaFa", "SeZe", 
"HyuByu", "IandEfollowedByYa" and "KiKuFollowedBySa".

Herbert

Mime
View raw message