lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Tue, 17 Aug 2010 20:39:18 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899568#action_12899568
] 

Jan Høydahl commented on SOLR-1979:
-----------------------------------

I have implemented a first shot patch using the Tika LanguageIdentifier. It is unfortunately
quite limited in features, and for short text segments, isReasonablyCertain() always returns
false :( Also, the number of languages supported is still quite low. But it works as a start,
and then we can focus on improving the Tika code in future releases.

I plan on putting the patch in contrib/extraction, since it depends on Tika. If I put it relative
to main, Solr will not compile unless you put tika jar in lib. Agree?

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Priority: Minor
>
> We need the ability to detect language of some random text in order to act upon it, such
as indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.
> To do this, we should wrap the [Nutch LanguageIdentifier|http://nutch.apache.org/apidocs-1.1/org/apache/nutch/analysis/lang/LanguageIdentifier.html"]
in an UpdateProcessor. The processor should be configured like this:
> {code:xml} 
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">title,teaser,body</str>
>     <str name="isoOutputField">language</str>
>     <str name="fullOutputField">language_display</str>
>   </processor>  
> {code} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message