nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
Date Thu, 17 Dec 2009 15:09:18 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791960#action_12791960
] 

Dennis Kubes commented on NUTCH-666:
------------------------------------

BTW, the reason we did this code, which we worked with an NLP firm to create, versus using
the current Langauge identification tool in Nutch was accuracy.  The current tool we were
getting around 70% accuracy level while this new tool routinely came in above 99.5% accuracy.
 We trained off of wikipedia and most of the errors we saw were english characters in other-language
version of the training data.  

> Analysis plugins for multiple language and new Language Identifier Tool
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-666
>                 URL: https://issues.apache.org/jira/browse/NUTCH-666
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.1
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and
thai.  Also includes a new Language Identifier tool that used the new indexing framework in
NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message