lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageDetection" by RobertMuir
Date Sun, 16 Oct 2011 04:10:28 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageDetection" page has been changed by RobertMuir:
http://wiki.apache.org/solr/LanguageDetection?action=diff&rev1=10&rev2=11

Comment:
update documentation for additional implementation

  
  = Introduction =
  
- This feature adds the ability to detect the language of a document before indexing and then
make appropriate decisions about analysis, etc. It is implemented as an UpdateRequestProcessor,
and currently relies on Tika's language detection capabilities, which covers many, but not
all, languages.  See http://tika.apache.org/0.10/detection.html for more information on the
languages supported.
+ This feature adds the ability to detect the language of a document before indexing and then
make appropriate decisions about analysis, etc. It is implemented as an UpdateRequestProcessor,
and there are two implementations: 
+  * Tika implementation based upon Tika's language detection capabilities, which covers many,
but not all, languages.  See http://tika.apache.org/0.10/detection.html for more information
on the languages supported.
+  * LangDetect implementation based upon http://code.google.com/p/language-detection/ which
supports more languages (53) and has some advanced CJK support.
  
  The component also supports automatic renaming of fields according to detected language
and other advanced parameters, all explained in the next section.
  
  = Configuration =
  The UpdateRequestProcessor is configured in solrconfig.xml, and supports many parameters.
All parameters listed may also be overridded on the update request itself. A minimal configuration
specifies the input fields for language identification as well as the output field for the
detected language code:
  {{{
- <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
+ <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
+    <lst name="defaults">
+      <str name="langid.fl">title,subject,text,keywords</str>
+      <str name="langid.langField">language_s</str>
+    </lst>
+ </processor>
+ }}}
+ 
+ Alternatively, using the implementation based on http://code.google.com/p/language-detection/
+ {{{
+ <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
     <lst name="defaults">
       <str name="langid.fl">title,subject,text,keywords</str>
       <str name="langid.langField">language_s</str>
@@ -152, +164 @@

  
  = Examples =
  
- == Detect and map Scandinavian languages and fallback to generic for other languages ==
+ == Detect and map Scandinavian languages with Tika and fallback to generic for other languages
==
  
  {{{
-  <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
+  <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
     <str name="langid">true</str>
     <str name="langid.fl">title,body</str>
     <str name="langid.langField">language</str>
@@ -168, +180 @@

  
  = Caveats =
  
- Since Tika uses an n-gram based approach to detection, it is susceptible to poor detection
on especially short inputs. The threshold you specify in langid.threshold is normalized to
match a certain similarity score in Tika, but this is not reliable for thresholds lower than
0.8. In the future, the detection quality may be improved due to changes in Tika or use of
other language detection libraries.
+ Since the implementations uses an n-gram based approach to detection, they are susceptible
to poor detection on especially short inputs. The threshold you specify in langid.threshold
is normalized to match a certain similarity score in Tika, but this is not reliable for thresholds
lower than 0.8. In the future, the detection quality may be improved due to changes in Tika
or use of other language detection libraries.
  
  = Resources =
  
   * [[http://tika.apache.org/|Apache Tika]]
+  * [[http://code.google.com/p/language-detection/|Language detection Library for Java]]
   * [[https://issues.apache.org/jira/browse/SOLR-1979|SOLR-1979]]
  

Mime
View raw message