lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Wed, 22 Jun 2011 12:54:54 GMT

     [ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jan Høydahl updated SOLR-1979:
------------------------------

    Description: 
Language identification from document fields, and mapping of field names to language-specific
fields based on detected language.

Wrap the Tika LanguageIdentifier in an UpdateProcessor.

  was:
We need the ability to detect language of some random text in order to act upon it, such as
indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.

To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor is configurable
like this:

{code:xml} 
  <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <str name="inputFields">name,subject</str>
    <str name="outputField">language_s</str>
    <str name="idField">id</str>
    <str name="fallback">en</str>
  </processor>
{code} 

It will then read the text from inputFields name and subject, perform language identification
and output the ISO code for the detected language in the outputField. If no language was detected,
fallback language is used.


> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch,
SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to language-specific
fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message