nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yossi Tamari" <yossi.tam...@pipl.com>
Subject RE: Usage of Tika LanguageIdentifier in language-identifier plugin
Date Tue, 24 Oct 2017 11:05:19 GMT
Why not LanguageDetector: The API does not separate the Detector object, which contains the
model and should be reused, from the text writer object, which should be request specific.
The same API Object instance contains references to both. In code terms, both loadModels()
and addText() are non-static members of LanguageDetector.

Developing another language-identifier-optimaize is basically what I have done locally, but
it seems to me having both in the Nutch repository would just be confusing for users. 99%
of the code would also be duplicated (the relevant code is about 5 lines).

I chose optimaize mainly because Tika did. Using langid instead should be very simple, but
the fact that the project has not seen a single commit in the last 4 years, and the usage
numbers are also quite low, gives me pause...


> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: 24 October 2017 13:18
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> 
> > it is not possible to initialize the detector in setConf and then reuse it
> 
> Could explain why? The API/interface should allow to get an instance and call
> loadModels() or not?
> 
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> 
> Of course, that's also possible. Or just add a plugin language-identifier-
> optimaize.
> 
> Btw., I recently had a look on various open source language identifier
> implementations would prefer
> langid (a port from Python/C) because it's faster and has a better precision:
>   https://github.com/carrotsearch/langid-java.git
>   https://github.com/saffsd/langid.c.git
>   https://github.com/saffsd/langid.py.git
> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
> C++).
> 
> Thanks,
> Sebastian
> 
> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > Hi Sebastian,
> >
> > Please reread the second paragraph of my email 😊.
> > In short, it is not possible to initialize the detector in setConf and then reuse
it,
> and initializing it per call would be extremely slow.
> >
> > 	Yossi.
> >
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> >> Sent: 24 October 2017 12:41
> >> To: user@nutch.apache.org
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> why not port it to use
> >>
> >>
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> >> tector.html
> >>
> >> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> >>
> >> Sebastian
> >>
> >> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> >>> Hi
> >>>
> >>>
> >>>
> >>> The language-identifier plugin uses
> >>> org.apache.tika.language.LanguageIdentifier for extracting the
> >>> language from the document text. There are two issues with that:
> >>>
> >>> 1.	LanguageIdentifier is deprecated in Tika.
> >>> 2.	It does not support CJK language (and I suspect a lot of other
> >>> languages -
> >>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> >>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> >>> with them - in my experience Chinese was recognized as Italian.
> >>>
> >>>
> >>>
> >>> Since in Tika LanguageIdentifier was superseded by
> >>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> >>> make that change in the plugin as well. However, because the design of
> >>> LanguageDetector is terrible, it makes the implementation not
> >>> reentrant, meaning the full language model would have to be reloaded
> >>> on each call to the detector.
> >>>
> >>>
> >>>
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>> Tika's LanguageDetector uses internally (at least by default). My
> >>> question is whether that is a change that should be made to the official
> plugin.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>                Yossi.
> >>>
> >>>
> >
> >



Mime
View raw message