lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "LanguageDetection" by JanHoydahl
Date Sun, 11 Sep 2011 23:17:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "LanguageDetection" page has been changed by JanHoydahl:
http://wiki.apache.org/solr/LanguageDetection?action=diff&rev1=1&rev2=2

Comment:
Documentation of the current state of SOLR-1979

  = Solr's Language Detection =
  
- <!> [[Solr4.0]]
+ <!> [[Solr3.5]]
  
- See https://issues.apache.org/jira/browse/SOLR-1979.
+ <<TableOfContents(3)>>
  
  = Introduction =
  
- This feature adds the ability to detect the language of a document before indexing and then
make appropriate decisions about analysis, etc.  It currently relies on Tika's language detection
capabilities, which covers many, but not all, languages.  See http://tika.apache.org/0.8/detection.html
for more information on the languages supported.
+ This feature adds the ability to detect the language of a document before indexing and then
make appropriate decisions about analysis, etc. It is implemented as an UpdateRequestProcessor,
and currently relies on Tika's language detection capabilities, which covers many, but not
all, languages.  See http://tika.apache.org/0.9/detection.html for more information on the
languages supported.
+ 
+ The component also supports automatic renaming of fields according to detected language
and other advanced parameters, all explained in the next section.
  
  = Configuration =
+ The UpdateRequestProcessor is configured in solrconfig.xml, and supports many parameters.
All parameters listed may also be overridded on the update request itself. A minimal configuration
specifies the input fields for language identification as well as the output field for the
detected language code:
+ {{{
+ <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
+    <defaults>
+      <str name="langid.fl">title,subject,text,keywords</str>
+      <str name="langid.langField">language_s</str>
+    </defaults>
+ </processor>
+ }}}
  
- = Input Parameters =
+ Below follows a list of each configuration parameters and their meaning:
+ 
+ == langid ==
+ Lets you enable/disable this processor
+ 
+ '''Value:''' true/false
+ 
+ '''Default:''' true
+ 
+ == langid.fl ==
+ Specifies the list of field names to take as input for the language detection
+ 
+ '''Value:''' Same format as {{{fl}}}, i.e. a comma or space delimited list of field names
+ 
+ '''Default:''' N/A (This parameter is mandatory)
+ 
+ == langid.langField ==
+ Specifies the field to output detected language into. The value written is the language
code as emitted by Tika.
+ 
+ '''Value:''' Name of field
+ 
+ '''Default:''' N/A (This parameter is mandatory)
+ 
+ == langid.langsField ==
+ Specifies the field to output a list of detected languages into. This must be a multiValued
String field
+ 
+ '''Value:''' Name of field
+ 
+ '''Default:''' (Empty - Nothing is written by default)
+ 
+ == langid.overwrite ==
+ Specifies whether the output in {{{langField}}} and {{{langFields}}} shall be overwritten
if it already contains a value.
+ 
+ '''Value:''' true/false
+ 
+ '''Default:''' false
+ 
+ == langid.threshold ==
+ Specifies a threshold between 0-1 for how close the language identification match must be
before being accepted. For long texts a high value like 0.8 will give the best results, but
for shorter texts you may need to specify lower thresholds, and at the same time risking a
lower quality detection. Experiment on your data to find a good value.
+ 
+ '''Value:''' A float value between 0.0 and 1.0
+ 
+ '''Default:''' 0.5
+ 
+ == langid.whitelist ==
+ Specifies an optional list of language codes that shall be the only allowed outputs from
language identification. This means that if another language is detected, it will not be accepted
and you'll fall back to fallback language. This is great in combination with langid.map=true
to make sure you only index documents into fields that exist in your schema.
+ 
+ '''Value:''' A comma separated list of language codes accepted
+ 
+ '''Default:''' (Empty - all languages are allowed)
+ 
+ == langid.map ==
+ To enable field name mapping, set langid.map=true. It will then map field names for all
fields in langid.fl.
+ 
+ If the set of fields to map is different from langid.fl, supply langid.map.fl. Those fields
will then be renamed with a language suffix equal to the language detected
+ from the langid.fl fields.
+ 
+ '''Value:''' true/false
+ 
+ '''Default:''' false
+ 
+ == langid.map.fl ==
+ Optional list of fields to do field name mapping for. See langid.map
+ 
+ '''Value:''' A comma separated list of fields
+ 
+ '''Default:''' (Empty - by default all fields in langid.fl will be mapped)
+ 
+ == langid.map.overwrite ==
+ If set to true, the detected language will always overwrite langid.langField, even if it
has a value already.
+ 
+ '''Value:''' true/false
+ 
+ '''Default:''' false
+ 
+ == langid.map.keepOrig ==
+ If set to true, the mapping operation will leave the original field in place, i.e. it will
act as a field copy instead of a move/map.
+ 
+ '''Value:''' true/false
+ 
+ '''Default:''' false
+ 
+ == langid.map.individual ==
+ If you require detecting languages separately for each field, supply langid.map.individual=true.
The supplied fields will then be renamed according to detected language on an individual field
basis.
+ 
+ '''Value:''' true/false
+ 
+ '''Default:''' false
+ 
+ == langid.map.individual.fl ==
+ If the set of fields to detect individually is different from the already supplied langid.fl
or langid.map.fl, supply langid.map.individual.fl. The fields listed in langid.map.individual.fl
will then be detected individually, while the rest of the mapping fields will be mapped according
to global document language.
+ 
+ '''Value:''' A comma separated list of fields
+ 
+ '''Default:''' (Empty - by default all fields in langid.fl or langid.map.fl will be mapped)
+ 
+ == langid.fallbackField ==
+ If no language is detected with sufficient score (see langid.threshold), or if the detected
language is not in the whitelist (see langid.whitelist), we will use the value from this field
as the fallback value.
+ 
+ '''Value:''' Name of a field which may contain a language code
+ 
+ '''Default:''' (Empty - not used)
+ 
+ == langid.fallback ==
+ If no language is detected with sufficient score (see langid.threshold), or if the detected
language is not in the whitelist (see langid.whitelist), and no value is found in your fallbackField,
the language code specified in langid.fallback will be used.
+ 
+ '''Value:''' Language code to use as fallback
+ 
+ '''Default:''' (Empty - not used)
+ 
+ == langid.map.lcmap ==
+ If this parameter is specified, it will be used as a language code map. A typical usage
is to map multiple detected languages to the same field name. I.e. to map both Japanese, Korean
and Chinese texts to the same schema field "*_cjk", do: {{{langid.map.lcmap=jp:cjk zh:cjk
ko:cjk}}}. Another use is if your language identification outputs something like en_US or
en_GB but you want only one field with *_en, you say {{{langid.map.lcmap=en_GB:en en_US:en}}}.
Note that this setting does not affect the language codes written to langField.
+ 
+ '''Value:''' A space separated list of language code mappings, on the form <from>:<to>
+ 
+ '''Default:''' (Empty - not used)
+ 
+ == langid.map.pattern and langid.map.replace ==
+ Default field mapping is <field>_<lang>, however you can define your own mapping
pattern using {{{langid.map.pattern}}} and {{{langid.map.replace}}}. You may use normal Java
regEx matching with groups. The text "{lang}" in the pattern will be replaced with the detected
language code (or the mapped equivalent).
+ 
+ '''Value:''' {{{pattern}}} is a java style regex pattern and {{{replace}}} is a java style
replace
+ 
+ '''Default:''' (Empty - not used)
+ 
+ == langid.enforceSchema ==
+ Normally the processor will throw an exception if the result of a mapping is not a valid
schema field. By enabling this option, you turn off validation of field names against schema.
This can be useful if you want to rename or delete fields later in the UpdateChain, i.e. you
know what you're doing.
+ 
+ '''Value:''' true/false
+ 
+ '''Default:''' true
+ 
  
  = Examples =
  
+ == Detect and map Scandinavian languages and fallback to generic for other languages ==
+ 
+ {{{
+  <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
+    <defaults>
+      <str name="langid">true</str>
+      <str name="langid.fl">title,body</str>
+      <str name="langid.langField">language</str>
+      <str name="langid.whitelist">no,sv,da</str>
+      <str name="langid.map">true</str>
+      <str name="langid.fallback">generic</str>
+    </defaults>
+  </processor>
+ }}}
+ 
+ 
  = Caveats =
  
- Since Tika uses an n-gram based approach to detection, it is susceptible to poor detection
on especially short inputs.  We rely on Tika's LanguageIdentifier.isReasonablyCertain() method
to indicate the confidence Tika has in the detection.  There currently is not a way to pass
in your own threshold, but see https://issues.apache.org/jira/browse/TIKA-568 for more info.
+ Since Tika uses an n-gram based approach to detection, it is susceptible to poor detection
on especially short inputs. The threshold you specify in langid.threshold is normalized to
match a certain similarity score in Tika, but this is not reliable for thresholds lower than
0.8. In the future, the detection quality may be improved due to changes in Tika or use of
other language detection libraries.
  
  = Resources =
  
-  * http://tika.apache.org
+  * [[http://tika.apache.org/|Apache Tika]]
+  * [[https://issues.apache.org/jira/browse/SOLR-1979|SOLR-1979]]
  

Mime
View raw message