opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Kosin <>
Subject Re: New to opennlp
Date Mon, 23 Apr 2012 02:02:31 GMT
On 4/22/2012 6:18 AM, Jim - FooBar(); wrote:
> On 21/04/12 23:30, James Kosin wrote:
>> On 4/21/2012 12:40 PM, Jim - FooBar(); wrote:
>>> On 13/02/12 23:07, Michael Collins wrote:
>>>> Does opennlp provide a way to create the *.train file based on a body
>>>> of text which I provide, or is the *.train file created another way.
>>> Apart from the sentence detector there is no way to automatically
>>> create training data for other tasks (POS,NER etc)...these are often
>>> language and domain dependant. For the sentence detector however it is
>>> easy to create your own private training data (as Jorn said) targeted
>>> especially for your problem domain. assuming of course that the
>>> pre-trained model is not good enough for you...i find it's pretty
>>> good! :)
>>> Jim
>> Also, unlike a lot of the other models, the sentence detector can
>> actually be trained and works quite well with just a few sentences to
>> train on.  ~20-30 does really well.
>> James
> Wow!!! did not know that!!! I thought the sentence detector needs
> thousands of sentences just like the other models! Thanks James...
> Jim

The sentence detector is probably the simplest model next would be the

The sentence detector only requires to be trained on knowing the
end-of-sentence.  Most cases this is a '.' or other terminating punctuation.
I even trained with a few sentences with abbreviations that had a '.' in
them as well.  Of course in my case and with so few sentence samples, I
have to use the parameter to change the cutoff to 1 instead of the
default 5.

The tokenizer though is training for more than just splitting
punctuation .... so, it will require a bit more.

The harder ones like POS, NameFinder, etc ... require large volumes of
data to be trained reliably.


View raw message