ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: sentence detector newline behavior
Date Wed, 22 May 2013 12:03:00 GMT
On 05/22/2013 01:17 PM, Miller, Timothy wrote:
> That's awesome! It might be worth trying at least. How does the training
> process change? Previously the training data would be one sentence per
> line, but with newlines as possible mid-sentence characters that could
> be trouble, is there a new representation for training data? Or would we
> have to use the training api?

Good point, yes that will be a problem with the default training format, 
but it shouldn't be hard
to solve. In the format itself we could define a new line tag e.g. 
<NEWLINE> to mark new lines.
as a hack to make it work with 1.5.3 you could instead use a special 
char as a replacement
for the new line char.
When you pass the text down to the sentence detector a simple string 
replace could be used to
convert all new line chars to the special new line marker char.

If things work out for you performance wise as well we will just 
integrate it properly into OpenNLP
for the next release.

Could you produce a sentence detector training file with a new line 
marker char?

You should try to pick a char you can also pass in on a terminal 
otherwise you have to use the
API to train the model. The build in cross validation could be used to 
evaluate the performance.

Jörn

Mime
View raw message