opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim - FooBar();" <>
Subject Re: abbreviation diccionary format
Date Tue, 10 Apr 2012 09:31:05 GMT
To train models of any type you need training data...The pretrained 
english tokenizer was trained on the CoNNL shared task if i remember 
correctly...Maybe one of the developers can shine some light on 
this...Anyway i don't think you need a dictionary but training data of 
the following form :

Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a 
nonexecutive director Nov. 29<SPLIT>.
Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing 
Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated 
Gold Fields PLC<SPLIT>, was named a nonexecutive director of this 
British industrial conglomerate<SPLIT>.

Hope that helps,


p.s: Did you mean an abbreviation dictionary? Well, you can't really 
train a model using an abbreviation dictionary...

On 10/04/12 09:02, Joan Codina wrote:
> I sent this some days before, but I got no answer :-((  :
> To train a tokenizer I  can use a dictionary, but
> where is the dictionary used to train the current English model? and
> where can I  find information about the dictionary format? , so I can, 
> at least, generate my own one.
> thanks
> Joan Codina

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message