opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joan Codina <Joan.Cod...@upf.edu>
Subject Re: abbreviation diccionary format
Date Fri, 20 Apr 2012 06:33:37 GMT
from this text
"
in an Oct. 19 review of `` The Misanthrope '' at Chicago 's Goodman 
Theatre ( `` Revitalized Classics Take the Stage in Windy City , '' 
Leisure & Arts ) , the role of Celimene , played by Kim Cattrall , was 
mistakenly attributed to Christina Haag .
"

i get
"
in an Oct. 19 review of ``The Misanthrope'' at Chicago's Goodman Theatre 
(``Revitalized Classics Take the Stage in Windy City,'' Leisure & Arts), 
the role of Celimene, played by Kim Cattrall, was mistakenly attributed 
to Christina Haag.
"

So, the processing is corrent but the <SPLIT>'s  are missing at for 
example "Haag." or "Chicago's"
And i wonder if there is a missing parameter or I need another dictionary.


On 04/19/2012 07:11 PM, Jörn Kottmann wrote:
> On 04/19/2012 06:20 PM, Joan Codina wrote:
>>
>>
>> then with the sentences with all tokens separated by spaces y need to 
>> merge the words adding <space> but I don't know how to make it with 
>> the  dictionaryDetokenizer
>> ./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml 
>> <../models/CoNLL2009-ST-English-train.sent
>>
>> as it merges the senteces but does not add the <space> 
>
> It should insert <SPLIT> tags for certain spaces, so the tokenizer can 
> learn
> that there is something to split. Input should be one sentence per line.
>
> What output do you get?
>
> Jörn

Mime
View raw message