lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2839) add alternative language detection impl
Date Sun, 16 Oct 2011 14:28:12 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128409#comment-13128409
] 

Robert Muir commented on SOLR-2839:
-----------------------------------

{quote}
How does this impl compare with the Tika one for short texts? And wouldn't it make more sense
to add this on the Tika level letting the detection method be configurable? Then all Tika
users would benefit from it.
{quote}

I have no idea, probably not that great? But i didnt compare to tika.
regarding short texts: http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/

{quote}
And wouldn't it make more sense to add this on the Tika level letting the detection method
be configurable? Then all Tika users would benefit from it.
{quote}

If someone wants to do this, then we can remove this implementation at that time. But for
lucene/solr, I am able to commit to this project, and I think that its important for langid
to be pluggable to different implementations.

For example, maybe someone ports google's detector (http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/)
to java and we expose that too, which might be interesting for short texts.

                
> add alternative language detection impl
> ---------------------------------------
>
>                 Key: SOLR-2839
>                 URL: https://issues.apache.org/jira/browse/SOLR-2839
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.5, 4.0
>
>         Attachments: SOLR-2839.patch
>
>
> based on http://code.google.com/p/language-detection (apache license), supports 53 languages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message