Hi,
No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for its accuracy nor
speed (but I know the code has been around for years). Another LangID implementation is
at the URL below my name.
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
________________________________
From: revathy arun <revas.34@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage
Does Apache Tika help find the language of the given document?
On 2/17/09, Till Kinstler <kinstler@gbv.de> wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 10000 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>
|