Return-Path: X-Original-To: apmail-lucene-solr-commits-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C055A7138 for ; Sun, 16 Oct 2011 04:11:02 +0000 (UTC) Received: (qmail 53656 invoked by uid 500); 16 Oct 2011 04:11:02 -0000 Delivered-To: apmail-lucene-solr-commits-archive@lucene.apache.org Received: (qmail 53518 invoked by uid 500); 16 Oct 2011 04:10:55 -0000 Mailing-List: contact solr-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-commits@lucene.apache.org Received: (qmail 53511 invoked by uid 99); 16 Oct 2011 04:10:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Oct 2011 04:10:52 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Oct 2011 04:10:50 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 94BB9DC5; Sun, 16 Oct 2011 04:10:28 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Sun, 16 Oct 2011 04:10:28 -0000 Message-ID: <20111016041028.2749.32306@eos.apache.org> Subject: =?utf-8?q?=5BSolr_Wiki=5D_Update_of_=22LanguageDetection=22_by_RobertMuir?= Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for chan= ge notification. The "LanguageDetection" page has been changed by RobertMuir: http://wiki.apache.org/solr/LanguageDetection?action=3Ddiff&rev1=3D10&rev2= =3D11 Comment: update documentation for additional implementation = =3D Introduction =3D = - This feature adds the ability to detect the language of a document before= indexing and then make appropriate decisions about analysis, etc. It is im= plemented as an UpdateRequestProcessor, and currently relies on Tika's lang= uage detection capabilities, which covers many, but not all, languages. Se= e http://tika.apache.org/0.10/detection.html for more information on the la= nguages supported. + This feature adds the ability to detect the language of a document before= indexing and then make appropriate decisions about analysis, etc. It is im= plemented as an UpdateRequestProcessor, and there are two implementations: = + * Tika implementation based upon Tika's language detection capabilities,= which covers many, but not all, languages. See http://tika.apache.org/0.1= 0/detection.html for more information on the languages supported. + * LangDetect implementation based upon http://code.google.com/p/language= -detection/ which supports more languages (53) and has some advanced CJK su= pport. = The component also supports automatic renaming of fields according to det= ected language and other advanced parameters, all explained in the next sec= tion. = =3D Configuration =3D The UpdateRequestProcessor is configured in solrconfig.xml, and supports = many parameters. All parameters listed may also be overridded on the update= request itself. A minimal configuration specifies the input fields for lan= guage identification as well as the output field for the detected language = code: {{{ - + + + title,subject,text,keywords + language_s + + + }}} + = + Alternatively, using the implementation based on http://code.google.com/p= /language-detection/ + {{{ + title,subject,text,keywords language_s @@ -152, +164 @@ = =3D Examples =3D = - =3D=3D Detect and map Scandinavian languages and fallback to generic for = other languages =3D=3D + =3D=3D Detect and map Scandinavian languages with Tika and fallback to ge= neric for other languages =3D=3D = {{{ - + true title,body language @@ -168, +180 @@ = =3D Caveats =3D = - Since Tika uses an n-gram based approach to detection, it is susceptible = to poor detection on especially short inputs. The threshold you specify in = langid.threshold is normalized to match a certain similarity score in Tika,= but this is not reliable for thresholds lower than 0.8. In the future, the= detection quality may be improved due to changes in Tika or use of other l= anguage detection libraries. + Since the implementations uses an n-gram based approach to detection, the= y are susceptible to poor detection on especially short inputs. The thresho= ld you specify in langid.threshold is normalized to match a certain similar= ity score in Tika, but this is not reliable for thresholds lower than 0.8. = In the future, the detection quality may be improved due to changes in Tika= or use of other language detection libraries. = =3D Resources =3D = * [[http://tika.apache.org/|Apache Tika]] + * [[http://code.google.com/p/language-detection/|Language detection Libr= ary for Java]] * [[https://issues.apache.org/jira/browse/SOLR-1979|SOLR-1979]] =20