harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Ellison <t.p.elli...@gmail.com>
Subject Re: [jira] Created: (HARMONY-62) java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence
Date Thu, 02 Mar 2006 15:28:26 GMT
Art,

(Found your note languishing in my reader -- sorry it took so long to
reply.)

While the wording of the Java spec may allow us to vary the behavior of
the break iterator, it will be of cold comfort to any apps that we
disrupt if the results are significantly different.

As Richard points out, we are relying upon the fine folk in the ICU
project to implement the break algorithms, and I know that they have
done a lot of work to conform to the latest Unicode specs.  However,
where there is a significant difference (and this bug report may well be
one of those cases) I believe we should tune ICU's default break
iterator with some custom rules to better match the reference
implementation behavior.

Do you have any examples of applications that layout text which we could
use as test cases?

Regards,
Tim

Art - Arthit Suriyawongkul wrote:
>> As you may know, our (Harmony) implementation just wraps ICU4J's
>> BreakIterator. And the rules of ICU4J's BreakIterator is compliant with
>> Unicode TR29 which is different with the rules of RI.
>>
>> This is a common issue for most of the classes in "text". If we want
>> implementation to have the same behavior as RI, we should get the rules
>> of RI. However, I think the rules must be controlled by some kinds of
>> license. So a better solution may be wrapping icu4j's implementation for
>> all text (internationalization) classes. As I know, ICU4J is special for
>> i18n.
> 
> Imho, I don't think that different BreakIterator implementations have
> to produce exactly the result ("boundary analysis").
> 
> What I meant is, the "Behavior" of them should be all the same,
> conform to what described in the Java API doc
>   http://java.sun.com/j2se/1.5.0/docs/api/java/text/BreakIterator.html
> 
>  Line boundary analysis determines where ...
>  Sentence boundary analysis allows ...
>  Word boundary analysis is ...
>  Character boundary analysis ...
> 
> But their result, the "Boundary Analysis", need not to be the same,
> just depends on how good each implementation could perform.
> 
> That's my opinion.
> 
> cheers,
> Art
> 
> --
> :: Art / Arthit Suriyawongkul
> :: Applied Computational Linguistics Lab, Uni Potsdam
> :: http://www.ling.uni-potsdam.de/acl-lab/
> :: http://bact.blogspot.com/
> 
> **  Impeach Thaksin   http://tuthaprajan.org

-- 

Tim Ellison (t.p.ellison@gmail.com)
IBM Java technology centre, UK.

Mime
View raw message