opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "william.colen@gmail.com" <william.co...@gmail.com>
Subject Re: svn commit: r1145578 - in /incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline: BasicEvaluationParameters.java sentdetect/SentenceDetectorEvaluatorTool.java tokenizer/TokenizerMEEvaluatorTool.java
Date Tue, 12 Jul 2011 14:49:33 GMT
Jörn,

I'm wondering how to implement the EncodingParameter interface.

This is not allowed:

  @ParameterDescription(valueName = "charsetName", description = "specifies
the encoding which should be used for reading and writing text")
  @OptionalParameter(defaultValue=Charset.defaultCharset().name())
  Charset getEncoding();

Will we need to do some special handling in ArgumentParser for that? Maybe
setting a constant "DEFAULT_CHARSET" and handle it at ArgumentParse.Parse ?


On Tue, Jul 12, 2011 at 10:55 AM, Jörn Kottmann <kottmann@gmail.com> wrote:

> On 7/12/11 3:45 PM, william.colen@gmail.com wrote:
>
>> Yes, Jörn. I don't think UTF-8 is a good choice for the default. None of
>> the
>> data I use for Portuguese takes advantage of UTF-8 been the default
>> because
>> all corpus I have are Latin1 and my system default is neither UTF-8 or
>> Latin1.
>>
>> Using the system default looks nice because often we have to use the
>> converter tools, and that outputs the system default. If we convert, train
>> and evaluate in the same system we would need to set the encoding
>> parameter
>> only once.
>>
>
> This is actually a weakness. I have a macbook, and my default encoding
> is MacRoman. I once tried to write japanese text to stdout, but that didn't
> work
> with MacRoman and more or less all chars have been replaced with a question
> mark
> (if I remember correctly).
>
> We might need to change that one day, so the output is always written to a
> file.
>
> I don't really know which of the both ways is better, always specify the
> encoding
> or use the default, anyway I am +1 for both. If you think we should go the
> more standard
> way and use the default encoding, then lets do that.
>
> Jörn
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message