opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: German Umlauts broken while using Command Line?
Date Sat, 02 Mar 2013 03:46:02 GMT
The Mac defaults to the proprietary MacRoman (I think?) encoding from 
decades past (really). Technical decisions can haunt your entire career.

On 03/01/2013 10:08 AM, Leonel de Alencar wrote:
> Running Mac OS 10.4 and the original opennlp bash script, I've saved the file input.txt
in the utf-8 encoding and got the correct output both on the Terminal and in an ouptut file,
which was also saved in unicode utf-8. My Terminal display is configured for unicode utf-8.
I don't know if these facts are of any help for Linux users...
>
>   $ opennlp SimpleTokenizer < input.txt
> Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst so richtig bekannt
gemacht . Wir haben uns mit Vincent Belorgey , besser bekannt als Kavinsky , über sein Debütalbum
, seine Musik und die 80 er Jahre unterhalten .
>
>
> Average: 33,3 sent/s
> Total: 1 sent
> Runtime: 0.03s
>
> $ opennlp SimpleTokenizer < input.txt > output.txt
>
>
> Average: 111,1 sent/s
> Total: 1 sent
> Runtime: 0.0090s
>
> $ cat output.txt
> Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst so richtig bekannt
gemacht . Wir haben uns mit Vincent Belorgey , besser bekannt als Kavinsky , über sein Debütalbum
, seine Musik und die 80 er Jahre unterhalten .
>
>
>
>
>
>
>
> ________________________________
>   De: Jörn Kottmann <kottmann@gmail.com>
> Para: users@opennlp.apache.org
> Enviadas: Sexta-feira, 1 de Março de 2013 5:32
> Assunto: Re: German Umlauts broken while using Command Line?
>   
> The problem here is the ASCII encoding can't encode the German Umlauts
> and therefore they are replaced with the question marks you see in the
> output.
>
> Any ideas on how we can improve this? Anyway, if we can't do much about it
> we should at least document the work around to manually set the encoding via
> file.encoding.
>
> Jörn
>
> On 02/28/2013 06:29 PM, Stefan Matheis wrote:
>> On Thursday, February 28, 2013 at 5:26 PM, Jörn Kottmann wrote:
>>
>>> Hmm, pretty sure there is an encoding mismatch, do you know which
>>> encoding is used by
>>> your JVM? I would guess that is not UTF-8. You can probably get around
>>> the issue by re-encoding the input
>>> file to the encoding the JVM is using.
>>>    
>>> Have a look here:
>>> http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java
>>>    
>>> Would be nice if you can run the println statements there.
>>>    
>>> Jörn
>> Where ever this comes from ..
>>
>> $ java CharsetTest
>> Default Charset=US-ASCII
>> file.encoding=Latin-1
>> Default Charset=US-ASCII
>> Default Charset in Use=ASCII
>>
>> $ echo $JAVA_TOOL_OPTIONS
>> (empty)
>>
>> $ export JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF8'
>>
>> $ java CharsetTest
>> Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
>> Default Charset=UTF-8
>> file.encoding=Latin-1
>> Default Charset=UTF-8
>> Default Charset in Use=UTF8
>>
>>
>>
>> But this change itself didn't help .. output remains unchanged, so i took the road
down to dirty-hack-land, applying the following change to bin/opennlp - for sure not how it
should be .. but works at least for the moment:
>>
>> -$JAVACMD -Xmx1024m -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@
>> +$JAVACMD -Xmx1024m -Dfile.encoding=UTF8 -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar
$@
>>
>>
>>


Mime
View raw message