lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <>
Subject [jira] [Commented] (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Tue, 02 Aug 2011 15:15:28 GMT


Jan Høydahl commented on SOLR-1979:

This has been tested on a real, several hundred thousand docs dataset, including HTML, office
docs and multiple other formats and it works well.

I'd like some more pairs of eyes on this however.

One thing which is less than perfect is that the threshold conversion from Tika currently
parses out the (internal) distance value from a String, in lack of a getDistance() method
(TIKA-568). This is a bit of a hack, but I argue it's a beneficial one since we can now configure
langid.threshold to something meaningful for our own data instead of the preset binary isReasonablyCertain().
As we also normalize to a value between 0-1, we abstract away the TIKA implementation detail,
and are free to use any improved distance measures from TIKA in the future e.g. as a result
of TIKA-369, or even plug in a non-Tika identifier or a hybrid solution.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>                 Key: SOLR-1979
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>            Priority: Minor
>              Labels: UpdateProcessor
>             Fix For: 3.4
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch,
SOLR-1979.patch, SOLR-1979.patch
> Language identification from document fields, and mapping of field names to language-specific
fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message