opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Galitsky <>
Subject RE: any hints on how to get chunking info from Parse?
Date Thu, 01 Dec 2011 19:08:39 GMT

Hi Jörn
  I spent last couple of weeks understanding how OpenNLP parser does chunking, how chunking
occurs separately in, and I came to conclusion that using independently
trained chunker on the results of parser gives significantly higher accuracy of resultant
parsing, and therefore makes 'similarity' component much more accurate as a result.
Lets look at an example (I added stars):
two NP & VP are extracted, but what kills similarity component is the last part of the
****to-TO drive-NN****
Parse Tree Chunk list = [NP [Its-PRP$ classy-JJ design-NN and-CC the-DT Mercedes-NNP name-NN
], VP [make-VBP it-PRP a-DT very-RB cool-JJ vehicle-NN *******to-TO drive-NN**** ]]

When I apply the chunker which has its own problems ( but most importantly was trained independently)
 I can then apply rules to fix these cases for matching with other sub-VP like 'to-VB'.
I understand it works slower that way.
I would propose we have two version of similarity, one that just does without chunker and
one which uses it (and also an additional 'correction' algo ? ).
I have now both versions, but only the latter passes current tests.

> Date: Thu, 17 Nov 2011 19:49:50 +0100
> From:
> To:
> Subject: Re: any hints on how to get chunking info from Parse?
> On 11/17/11 7:08 PM, Boris Galitsky wrote:
> > Yes, I will try
> >
> > and meanwhile the question is: what is wrong with using
> > ?
> You are doing it then twice. The chunk information is already present inside
> the parse tree. So if you have a Parse object already, you should 
> extract the
> chunk information from it instead of running the chunker again.
> It is also harder to use, because a user then needs to provide you with 
> a Parse
> object and a chunker instance. For the same reason it is harder to test 
> as well.
> It will be slower because chunking needs to be done twice, and I guess 
> there are
> a couple of more reasons why this is not the preferred solution.
> Let me know if you need help.
> Jörn
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message