nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Kubes (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
Date Thu, 17 Dec 2009 15:05:20 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes updated NUTCH-666:
-------------------------------

    Attachment: NUTCH-666-2-20091217-nf.patch

Here is the patch as I last used it, almost a year ago now.  I am not sure if it is functioning
or not with the current codebase.  It uses a hacky version of textcat to create fingerprint
files on known language content, this creates a dictionary, that dictionary is configured
through the textcat.conf file in the conf directory.  The Language Identifier tool is then
used to create a database of url -> langugage code, which before was included using the
CustomFields job of the fields indexer.  The other language analysis plugins from the previous
patch acted off of locale or chosen language on the query side I think.

> Analysis plugins for multiple language and new Language Identifier Tool
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-666
>                 URL: https://issues.apache.org/jira/browse/NUTCH-666
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.1
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and
thai.  Also includes a new Language Identifier tool that used the new indexing framework in
NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message