opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Kosin <>
Subject Re: New to opennlp
Date Sat, 21 Apr 2012 22:43:05 GMT
On 4/21/2012 12:40 PM, Jim - FooBar(); wrote:
> On 13/02/12 23:07, Michael Collins wrote:
>> Does opennlp provide a way to create the *.train file based on a body
>> of text which I provide, or is the *.train file created another way.
> Apart from the sentence detector there is no way to automatically
> create training data for other tasks (POS,NER etc)...these are often
> language and domain dependant. For the sentence detector however it is
> easy to create your own private training data (as Jorn said) targeted
> especially for your problem domain. assuming of course that the
> pre-trained model is not good enough for you...i find it's pretty
> good! :)
> Jim
The training data is based on corpus of text already parsed for POS,
Name or other reasons.  Usually, they are hand done ... or generated and
rechecked by humans to verify accuracy.
Unfortunately for most, the corpus' are usually copyrighted text meaning
they can not be freely distributed.  Most provide some data either only
the data needed to be merged with the original text... ie: you have to
run scripts to take multiple files and merge them with the data to get
the final corpus or they only provide small samples of some corpus'. 
Either way, the copyright usually prohibits commercial usage or usage
for any reason other than research.

We do have projects we want to start to start our own corpus based on
freely available text that we can distribute freely for any purpose
based on OpenNLP.

This is also why our models are currently on sourceforge only... due to
distributing licenses that are not Apache friendly.


View raw message