opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: abbreviation diccionary format
Date Tue, 10 Apr 2012 14:51:26 GMT
On 04/10/2012 04:44 PM, Joan Codina wrote:
> But to train the system I only found that file... which is small.
> http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup

>
> which only contains 121 sentences. i don't know if this is enough or 
> there are other training annotated models 

No, that is not enough. Get some training data set for the language you 
need. Most of the data sets
referenced in the Corpora section can be used to train the tokenizer. 
These corpora are already tokenized
and can be de-tokenized into training data for the tokenizer.

Jörn

Mime
View raw message