lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Multilanguage
Date Tue, 17 Feb 2009 12:13:42 GMT
Hi,

No, Tika doesn't do LangID.  I haven't used ngramj, so I can't speak for its accuracy nor
speed (but I know the code has been around for years).  Another LangID implementation is
at the URL below my name.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 




________________________________
From: revathy arun <revas.34@gmail.com>
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage

Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler <kinstler@gbv.de> wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 10000 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message