lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <>
Subject [jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Wed, 22 Jun 2011 12:52:47 GMT


Jan Høydahl updated SOLR-1979:

    Attachment: SOLR-1979.patch

New version. Example of accepted params:

 <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
     <str name="langid">true</str>
     <str name="langid.fl">title,subject,text,keywords</str>
     <str name="langid.langField">language_s</str>
     <str name="langid.langsField">languages</str>
     <str name="langid.overwrite">false</str>
     <float name="langid.threshold">0.5</float>
     <str name="langid.whitelist">no,en,es,dk</str>
     <str name="">true</str>
     <str name="">title,text</str>
     <bool name="">false</bool>
     <bool name="">false</bool>
     <bool name="">false</bool>
     <str name=""></str>
     <str name="langid.fallbackFields">meta_content_language,lang</str>
     <str name="langid.fallback">en</str>

The only mandatory parameter is langid.fl
To enable field name mapping, set It will then map field names for all fields
in langid.fl. If the set of fields to map is different from langid.fl, supply
Those fields will then be renamed with a language suffix equal to the language detected from
the langid.fl fields.

If you require detecting languages separately for each field, supply
The supplied fields will then be renamed according to detected language on an individual basis.
If the set of fields to detect individually is different from the already supplied langid.fl
or, supply The fields listed in
will then be detected individually, while the rest of the mapping fields will be mapped according
to global document language.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>                 Key: SOLR-1979
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch,
> We need the ability to detect language of some random text in order to act upon it, such
as indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor
is configurable like this:
> {code:xml} 
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform language identification
and output the ISO code for the detected language in the outputField. If no language was detected,
fallback language is used.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message