opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joan Codina <Joan.Cod...@upf.edu>
Subject Re: abbreviation diccionary format
Date Tue, 10 Apr 2012 14:44:32 GMT
Thanks
I know I need a training model with the <space> but, but if I can add a 
list of domain abbreviations, I hope, I will be able to solve some 
problems I have with tokenization.
Also I will expand a bit the training set, with some sentences I may 
find conflictive.
But to train the system I only found that file... which is small.
http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup
which only contains 121 sentences. i don't know if this is enough or 
there are other training annotated models


Joan



On 10/04/12 15:20, Jim - FooBar(); wrote:
> On 10/04/12 14:18, Jörn Kottmann wrote:
>> On 04/10/2012 03:15 PM, Jim - FooBar(); wrote:
>>>
>>> But you still cannot "train" anything (maxent/perceptron) on the 
>>> dictionary, can you???
>>> One needs training data for that yes? 
>>
>> The dictionary is used to produce additional features to our standard 
>> feature set.
>> Therefor you need training data to train our statistical tokenizer, 
>> even so the feature
>> generation can use a dictionary to produce features.
>>
>> Jörn
>
> aha ok, that makes sense...
>
> Jim

-- 

Joan Codina Filbà
Departament de Tecnologia
Universitat Pompeu Fabra
_______________________________________________________________________________ 


Abans d'imprimir aquest e-mail, pensa si realment és necessari, i en cas 
de que ho sigui, pensa que si ho fas a doble cara estalvies un 25% del 
paper, els arbres t'ho agrairan.
_______________________________________________________________________________ 


/La informació d'aquest missatge electrònic és confidencial, personal i 
intransferible i només està dirigida a la/les adreça/ces indicades a 
dalt. Si vostè llegeix aquest missatge per equivocació, l'informem que 
queda prohibida la seva divulgació, ús o distribució, completa o en 
part, i li preguem esborri el missatge original juntament amb els seus 
fitxers annexos sense llegir-lo ni gravar-lo./

/Gràcies./


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message