opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <>
Subject Re: encoding...
Date Fri, 18 Nov 2011 09:18:35 GMT
On 11/18/11 2:26 AM, James Kosin wrote:
> As everyone may know I'm on an encoding head hunt... but would like some
> feedback on some changes coming soon.
> For the CLI only... really.   The CLI often times uses the platforms
> default encoding; which may or may not be desirable.  One of the reasons
> is that the input or output may become corrupted causing training issues
> or even usage issues for the operator.  I'm not sure if the | pipe
> operator has the same issues; however a recent check of some converted
> files proved that the platform encoding may be undesirable, especially
> if the output encoding is unable to handle the input characters from
> another encoding.  Internally to the classes and opening and reading
> files don't have this issue; so, the libraries themselves are safe.

In my opinion it was just a bad decision to let the format package write the
transformed text to standard out.
I suggest that we change it and always write to an output file instead.

We should maybe also echo the encoding to the console, so the user
knows which one was used.

Should we also change our small demo tools? There I believe it is confusing
when the user uses an encoding and then cannot see the result on the 


View raw message