opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann (JIRA) <>
Subject [jira] Closed: (OPENNLP-67) NameFinderMe detecting organisations in an HTML sample with limited training
Date Tue, 25 Jan 2011 12:40:43 GMT


Jörn Kottmann closed OPENNLP-67.

    Resolution: Fixed

Issue can be closed. Used the html sample from Paul to create a unit test which tests that
training data with html tags is correctly parsed. Thanks for your help by providing the data

We will continue you the discussion we started in this issue on the mailing list since that
is a better place to have a more general discussion about how to train the name finder on
html data.

> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>                 Key: OPENNLP-67
>                 URL:
>             Project: OpenNLP
>          Issue Type: Question
>          Components: Name Finder
>    Affects Versions: tools-1.5.0-sourceforge
>            Reporter: Paul
>         Attachments: html.patch
> I have attached a patch named htmltest.patch.  
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources named html1.train
and html.html.  Obviously html1.train is the training sample which is a sample HTML document
marked up with <START:organization> Org <END> tags.  html.html is the same HTML
document without the training mark up.  The HTML has been preprocess with all the line break
characters removed. 
> In the NameFinderMEHtmlTest I am training the data and then using find to retrieve the
> Was my assumption wrong in thinking that NameFinderME would find the exact names from
the html?  I mean exact in this context because both the training html and the test html are
the same.  The NameFinderMEHtmlTest fails because it does not find the first name, it does
find part of the name.  Is this because it has limited training or is the find method performing
badly against html document?
> I am new to opennlp so there is an element of guess work as to which streams etc. I should
be using.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message