nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Nagel <wastl.na...@googlemail.com>
Subject Re: Usage of Tika LanguageIdentifier in language-identifier plugin
Date Tue, 24 Oct 2017 10:18:07 GMT
Hi Yossi,

sorry while fast-reading I've thought it's about the old LanguageIdentifier.

> it is not possible to initialize the detector in setConf and then reuse it

Could explain why? The API/interface should allow to get an instance and call loadModels()
or not?

>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what

Of course, that's also possible. Or just add a plugin language-identifier-optimaize.

Btw., I recently had a look on various open source language identifier implementations would
prefer
langid (a port from Python/C) because it's faster and has a better precision:
  https://github.com/carrotsearch/langid-java.git
  https://github.com/saffsd/langid.c.git
  https://github.com/saffsd/langid.py.git
Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's C++).

Thanks,
Sebastian

On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> Hi Sebastian,
> 
> Please reread the second paragraph of my email 😊.
> In short, it is not possible to initialize the detector in setConf and then reuse it,
and initializing it per call would be extremely slow.
> 
> 	Yossi.
> 
> 
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: 24 October 2017 12:41
>> To: user@nutch.apache.org
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> why not port it to use
>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>> tector.html
>>
>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>
>> Sebastian
>>
>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>> Hi
>>>
>>>
>>>
>>> The language-identifier plugin uses
>>> org.apache.tika.language.LanguageIdentifier for extracting the
>>> language from the document text. There are two issues with that:
>>>
>>> 1.	LanguageIdentifier is deprecated in Tika.
>>> 2.	It does not support CJK language (and I suspect a lot of other
>>> languages -
>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
>>> with them - in my experience Chinese was recognized as Italian.
>>>
>>>
>>>
>>> Since in Tika LanguageIdentifier was superseded by
>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
>>> make that change in the plugin as well. However, because the design of
>>> LanguageDetector is terrible, it makes the implementation not
>>> reentrant, meaning the full language model would have to be reloaded
>>> on each call to the detector.
>>>
>>>
>>>
>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>> Tika's LanguageDetector uses internally (at least by default). My
>>> question is whether that is a change that should be made to the official plugin.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>                Yossi.
>>>
>>>
> 
> 


Mime
View raw message