incubator-ooo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RGB ES <rgb.m...@gmail.com>
Subject Re: i18nregexp replaced with ICU regexp => heads up
Date Tue, 03 Jan 2012 18:13:39 GMT
2011/9/30 Herbert Duerr <hdu@apache.org>

> Hi,
>
> for removing "category X excluded licenses" from Apache OpenOffice I
> replaced the formerly used LGPL licensed module i18nregexp with the regular
> expression engine of module ICU which is already widely use in OpenOffice.
>
> The replacement fixes a lot of problems: e.g. in a text "abcabc" trying to
> "find all backwards" for "b" resulted in it only finding the last "b", now
> it actually finds all of them. It also introduces some changes, e.g.
> i18nregexp had two modes "classic" and "extended" regexp whereas the ICU
> based engine treats all patterns as extended-regexp.
>
> I18nregexp used an approach where it transliterated and compared each
> codepoint pair of the pattern and text string. The new engine does the
> transliteration only once per pattern and text string. This is much faster,
> but it only works because the transliteration was tweaked to preserve the
> special regexp control characters.
>
> The reporters of any issues in the lists below are encouraged to check the
> problems they saw with the new engine.
> https://issues.apache.org/ooo/**buglist.cgi?quicksearch=regexp<https://issues.apache.org/ooo/buglist.cgi?quicksearch=regexp>
> https://issues.apache.org/ooo/**buglist.cgi?quicksearch=**regular\<https://issues.apache.org/ooo/buglist.cgi?quicksearch=regular%5C>expression
> Please make sure to have the "More Options -> Regular Expressions"
> checkbox activated for testing.
>
> I'm afraid the regexp replacement resulted in changes mostly for Japanese
> users, because there a lot of non-trivial transliterations are active. For
> reference I'm enumerating the active rules: "ProlongedSoundMark",
> "IterationMark", "Ignore-Width", "BaFa", "SeZe", "HyuByu",
> "IandEfollowedByYa" and "KiKuFollowedBySa".
>
> Herbert
>

Sorry for reactivating this old thread, but I have a question about the new
regexp engine: it seems that some regular expressions do not work any more
on AOO test builds. For example, on OOo 3.3 you can use
\<[0-9]+[,|\.][0-9]*\>
to find decimal numbers no matter if the decimal separator is a colon or a
dot (the expression will find 125.25 and 1253,586) but this expression do
not work on AOO builds.
Are there changes on the regexp syntax? If yes, where are those changes
documented?
Thanks in advance
Ricardo

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message