lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageDetection" by GrantIngersoll
Date Sun, 05 Dec 2010 14:23:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageDetection" page has been changed by GrantIngersoll.
http://wiki.apache.org/solr/LanguageDetection

--------------------------------------------------

New page:
= Solr's Language Detection =

<!> [[Solr4.0]]

See https://issues.apache.org/jira/browse/SOLR-1979.

= Introduction =

This feature adds the ability to detect the language of a document before indexing and then
make appropriate decisions about analysis, etc.  It currently relies on Tika's language detection
capabilities, which covers many, but not all, languages.  See http://tika.apache.org/0.8/detection.html
for more information on the languages supported.

= Configuration =

= Input Parameters =

= Examples =

= Caveats =

Since Tika uses an n-gram based approach to detection, it is susceptible to poor detection
on especially short inputs.  We rely on Tika's LanguageIdentifier.isReasonablyCertain() method
to indicate the confidence Tika has in the detection.  There currently is not a way to pass
in your own threshold, but see https://issues.apache.org/jira/browse/TIKA-568 for more info.

= Resources =

 * http://tika.apache.org

Mime
View raw message