opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joan Codina <>
Subject Re: abbreviation diccionary format
Date Thu, 19 Apr 2012 16:20:36 GMT
How can I de-tokenize a conll training set?
I have tried some commands but none seems to work
i did

./ models/CoNLL2009-ST-English-train.txt 
where detokenizer is like:


exec cat $1 | perl -pe "s/^${TAG}${SEP}(${TAG}).*$/\1/g" | perl -pe "s/^\s*$/\n/g" | perl
-pe "s/^$/${SENTENCESEP}/g" | perl -pe "s/\n/ /g" | perl -pe "s/ ${SENTENCESEP} /\n/g"

then with the sentences with all tokens separated by spaces y need to 
merge the words adding <space> but I don't know how to make it with the  
./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml 

as it merges the senteces but does not add the <space>

thanks in advance


On 04/10/2012 04:51 PM, Jörn Kottmann wrote:
> On 04/10/2012 04:44 PM, Joan Codina wrote:
>> But to train the system I only found that file... which is small.

>> which only contains 121 sentences. i don't know if this is enough or 
>> there are other training annotated models 
> No, that is not enough. Get some training data set for the language 
> you need. Most of the data sets
> referenced in the Corpora section can be used to train the tokenizer. 
> These corpora are already tokenized
> and can be de-tokenized into training data for the tokenizer.
> Jörn

View raw message