lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] [Updated] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Sun, 26 Jun 2011 23:13:48 GMT

     [ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jan Høydahl updated SOLR-1979:
------------------------------

    Attachment: SOLR-1979.patch

Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02 gives certainty
0.9. The default threshold of 0.5 now works pretty well, at least for the tests...

*New parameters:*
Field name mapping is now configurable to user defined pattern, so to map ABC_title to title_<lang>,
you set:
{code}
&langid.map.pattern=ABC_(.*)
&langid.map.replace=$1_{lang}
{code}
A parameter to map multiple detected languages to same field regex. I.e. to map both Japanese,
Korean and Chinese texts to a field *_cjk, do:
{code}langid.map.lcmap=jp:cjk zh:cjk ko:cjk{code}
Turn off validation of field names against schema (useful if you want to rename or delete
fields later in the UpdateChain):
{code}&langid.enforceSchema=false{code}

*Other changes*
Removed default on langField, i.e. if langField is not specified, the detected language will
not be written anywhere. A typical minimal config for only detecting language and writing
to a field is now:
{code}
<processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
   <defaults>
     <str name="langid.fl">title,subject,text,keywords</str>
     <str name="langid.langField">language_s</str>
   </defaults>
</processor>
{code}

Also added multiple other languages to the tests.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Minor
>              Labels: UpdateProcessor
>             Fix For: 3.4
>
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch,
SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to language-specific
fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message