opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: abbreviation diccionary format
Date Wed, 11 Apr 2012 07:51:30 GMT
On 04/11/2012 09:16 AM, Joan Codina wrote:
> Ok,
> I will try it,
> but this does not introduce a bias, as the de-tokenizer has a few rules?
>
> There is no way to do incremental train of an existing model, or just 
> add a dictionary of abbreviations to an existing model? 

No, we cannot complement an existing model with additional training data.
You need to re-train the whole thing with all the data.

Well, you can add a dictionary to the model, but the model would not 
know about the
new features you can produce via the dictionary.

You assume that the text was tokenized correctly, undoing it with the 
rule based de-tokenizer
usually produces something which is very close to the original text. In 
some cases you even
want to de-tokenize a bit too much to get a better tokenizer.

Jörn

Mime
View raw message