opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann (JIRA) <j...@apache.org>
Subject [jira] Commented: (OPENNLP-67) NameFinderMe detecting organisations in an HTML sample with limited training
Date Wed, 19 Jan 2011 23:47:43 GMT

    [ https://issues.apache.org/jira/browse/OPENNLP-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983936#action_12983936
] 

Jörn Kottmann commented on OPENNLP-67:
--------------------------------------

Hi Paul,

first that might a bit to much to include as a training sample and just label it as AL since
it looks like
its extracted from various website. 

Beside that. Is that the file you used to train the name finder to detect the organization
names in
your content extraction application you are working on ?

You should be really carefully about the way you tokenize the html files, this tokenization
must be
identical in the training and in the new data you run the model on.

In the training file all tokens must be whitespace tokenized. In your case there are often
a few tags which
build one token, I do not think that this is helpful, but even harmful because you would get
different
results depending on how your input html is formated. 

I would also recommend to not push everything in one sentence, the name finder only
looks at a few tokens before the token its analyzing and a few tokens after it, so you
gain nothing from having just one super long line in your training file. 

That said, I think you should do a few simple modification to your feature generation,
the default feature generation uses window feature generators, increase the window a  little
and
see if that helps your result. You can create a training file and a test file, and then measure
the
result on your test file. 

Back to the patch, it would be nice to have the train file modified in a way it has a few
short lines
which contain a few sample records embedded into html tags. That way we can use the
sample data stream to test if we really get what we expect.  Maybe 10 lines would be already
enough.

Hope this helps,
Jörn

> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>
>                 Key: OPENNLP-67
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-67
>             Project: OpenNLP
>          Issue Type: Question
>          Components: Name Finder
>    Affects Versions: tools-1.5.0-sourceforge
>            Reporter: Paul
>         Attachments: htmltest.patch
>
>
> I have attached a patch named htmltest.patch.  
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources named html1.train
and html.html.  Obviously html1.train is the training sample which is a sample HTML document
marked up with <START:organization> Org <END> tags.  html.html is the same HTML
document without the training mark up.  The HTML has been preprocess with all the line break
characters removed. 
> In the NameFinderMEHtmlTest I am training the data and then using find to retrieve the
names. 
> Was my assumption wrong in thinking that NameFinderME would find the exact names from
the html?  I mean exact in this context because both the training html and the test html are
the same.  The NameFinderMEHtmlTest fails because it does not find the first name, it does
find part of the name.  Is this because it has limited training or is the find method performing
badly against html document?
> I am new to opennlp so there is an element of guess work as to which streams etc. I should
be using.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message