opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Kosin <james.ko...@gmail.com>
Subject Re: switch to ISO 639-2 codes for languages?
Date Tue, 17 May 2011 22:11:47 GMT
+1

But I'd like to see more mapping of languages to default encoding types
as well for each language.  Or automatic support in java for the
language and encoding via the OS first and override options for those
performing multiple languages than the native.

James

On 5/17/2011 4:45 PM, Jason Baldridge wrote:
> +1
>
> On Tue, May 17, 2011 at 3:39 PM, Jörn Kottmann <kottmann@gmail.com> wrote:
>
>> I can see that, so switching the language codes I think should be something
>> that should be done when we do bigger changes anyway. Maybe for 1.6
>> together
>> with a switch to opennlp-ml and maybe bigger changes in our feature
>> generation
>> code.
>>
>> Jörn
>>
>>
>> On 5/17/11 10:32 PM, Benson Margulies wrote:
>>
>>> there are important distinctions missing in the twos. Farsi / Dari/
>>> etc and others.
>>>
>>> On May 17, 2011, at 4:25 PM, "Jörn Kottmann"<kottmann@gmail.com>  wrote:
>>>
>>>  Is there support for -3 in java? Currently all we do is a check that the
>>>> language is
>>>> a valid 2 letter code. The idea was when we added it that we will be able
>>>> to have language dependent feature generation one day, but up to today we
>>>> only do something special in the sentence detector for thai.
>>>>
>>>> Jörn
>>>>
>>>> On 5/17/11 8:50 PM, Benson Margulies wrote:
>>>>
>>>>> -2 is pretty useless. Use -3 if you want to switch.
>>>>>
>>>>> On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov<oleg@apache.org>
>>>>> wrote:
>>>>>
>>>>>> My two cents, tesseract-ocr also uses ISO 639-3 and it would be great
>>>>>> for
>>>>>> those who builds the solutions such as openNLP + tesseract.
>>>>>>
>>>>>> -Oleg
>>>>>>
>>>>>> On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
>>>>>> <jasonbaldridge@gmail.com>wrote:
>>>>>>
>>>>>>  I think we should change to the three character convention for
>>>>>>> language
>>>>>>> specific materials, e.g. "eng" rather than "en" for English.
>>>>>>>
>>>>>>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>>>>>>
>>>>>>> Do others agree?
>>>>>>>
>>>>>>> --
>>>>>>> Jason Baldridge
>>>>>>> Assistant Professor, Department of Linguistics
>>>>>>> The University of Texas at Austin
>>>>>>> http://www.jasonbaldridge.com
>>>>>>> http://twitter.com/jasonbaldridge
>>>>>>>
>>>>>>>
>


Mime
View raw message