opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: An observation about the MAXENT tagger and CAPS
Date Tue, 26 Jul 2011 08:23:41 GMT
You can find our current documentation here:
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html

Why do you have more events and less outcomes in your second run?

In 1.5.1 we now have built-in converters for conll06, you can see how
to use it with this command:
bin/opennlp POSTaggerConverter conllx

It is still not described in our documentation,
but any help is welcome.

Jörn

On 7/26/11 1:26 AM, vishvAs vAsuki wrote:
> Here is an observation about the MAXENT tagger which may be of interest to
> others.
>
> I recently tried to replicate the tagging results described in the
> wiki<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model>(
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model),
> while calling the tagging
> API<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Postagger>from
> my Scala code. As in the case of the command line tool, I was using
> the
> parameters numIterations = 100 and event-threshold = 5. The only difference
> was in how the sample-stream  passed to the tagging API was created: I was
> using my own scala code to create the sample stream (which looked fine to
> the naked eye). But, my code was reading the words in all CAPS. This
> resulted in a slight but noticeable decline in performance: eg: 0.96 vs 0.95.
> (More detailed output appended.)
>
> Note that the sample stream for both the test and training data were in CAPS
> - so maybe the model treats “Port” and “port” differently.
>
>
> === Command line case===
> Sorting and merging events... done. Reduced 206678 events to 193001.
> ...
>          Number of Event Tokens: 193001
>              Number of Outcomes: 22
>            Number of Predicates: 29155
> ...done.
> Computing model parameters...
> Performing 100 iterations.
>    1:  .. loglikelihood=-638850.4721742678       0.13807468622688432
> ..
> 100:  .. loglikelihood=-13827.506953520902      0.9901537657612325
> Accuracy: 0.9659110277825124
>
> === My code===
> Sorting and merging events... done. Reduced 206678 events to 193059.
> Done indexing.
> Incorporating indexed data for training...
> done.
>          Number of Event Tokens: 193059
>              Number of Outcomes: 16
>            Number of Predicates: 27709
> ...done.
> Computing model parameters...
> Performing 100 iterations.
>    1:  .. loglikelihood=-573033.0919349034        0.13807468622688432
> ..
> 100:  .. loglikelihood=-18019.22974368408        0.9831041523529355
> Evaluating ... Accuracy: 0.9500596557013806
>
> --
> Cheers,
> vishvAs
>


Mime
View raw message