opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul (JIRA)" <j...@apache.org>
Subject [jira] Updated: (OPENNLP-67) NameFinderMe detecting organisations in an HTML sample with limited training
Date Wed, 19 Jan 2011 20:41:44 GMT

     [ https://issues.apache.org/jira/browse/OPENNLP-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul updated OPENNLP-67:
------------------------

    Attachment: htmltest.patch

> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>
>                 Key: OPENNLP-67
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-67
>             Project: OpenNLP
>          Issue Type: Question
>          Components: Name Finder
>    Affects Versions: tools-1.5.0-sourceforge
>            Reporter: Paul
>         Attachments: htmltest.patch
>
>
> I have attached a patch named htmltest.patch.  
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources named html1.train
and html.html.  Obviously html1.train is the training sample which is a sample HTML document
marked up with <START:organization> Org <END> tags.  html.html is the same HTML
document without the training mark up.  The HTML has been preprocess with all the line break
characters removed. 
> In the NameFinderMEHtmlTest I am training the data and then using find to retrieve the
names. 
> Was my assumption wrong in thinking that NameFinderME would find the exact names from
the html?  The NameFinderMEHtmlTest fails because it does not find the first name, it does
find part of the name.  Is this because it has limited training or is the find method performing
badly against html document?
> I am new to opennlp so there is an element of guess work as to which streams etc. I should
be using.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message