opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joan Codina <Joan.Cod...@upf.edu>
Subject Re: abbreviation diccionary format
Date Thu, 19 Apr 2012 16:20:36 GMT
How can I de-tokenize a conll training set?
I have tried some commands but none seems to work
i did

./detokenizer.sh models/CoNLL2009-ST-English-train.txt 
 >models/CoNLL2009-ST-English-train.sent
where detokenizer is like:

#!/bin/bash

SEP="\t";
TAG="[^${SEP}]*";
SENTENCESEP="<SENTENCE123456789SEP>";
exec cat $1 | perl -pe "s/^${TAG}${SEP}(${TAG}).*$/\1/g" | perl -pe "s/^\s*$/\n/g" | perl
-pe "s/^$/${SENTENCESEP}/g" | perl -pe "s/\n/ /g" | perl -pe "s/ ${SENTENCESEP} /\n/g"


then with the sentences with all tokens separated by spaces y need to 
merge the words adding <space> but I don't know how to make it with the  
dictionaryDetokenizer
./opennlp DictionaryDetokenizer ../models/en-detokenizer.xml 
<../models/CoNLL2009-ST-English-train.sent

as it merges the senteces but does not add the <space>


thanks in advance

Joan.



On 04/10/2012 04:51 PM, Jörn Kottmann wrote:
> On 04/10/2012 04:44 PM, Joan Codina wrote:
>> But to train the system I only found that file... which is small.
>> http://opennlp.cvs.sourceforge.net/viewvc/opennlp/opennlp/src/test/resources/opennlp/tools/tokenize/token.train?view=markup

>>
>> which only contains 121 sentences. i don't know if this is enough or 
>> there are other training annotated models 
>
> No, that is not enough. Get some training data set for the language 
> you need. Most of the data sets
> referenced in the Corpora section can be used to train the tokenizer. 
> These corpora are already tokenized
> and can be de-tokenized into training data for the tokenizer.
>
> Jörn

Mime
View raw message