ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vijay garla <vnga...@gmail.com>
Subject Re: sentence detector newline behavior
Date Mon, 20 Jan 2014 17:45:27 GMT
The sentence detection opennlp model used by ctakes does not split
sentences at newlines - there is additional logic in the takes sentence
splitter that does this (and an alternative impl that doesn't is in the
ytex branch). Afaik no retraining / change to the feature representation is
necessary.

Vj

On Monday, January 20, 2014, Jörn Kottmann <kottmann@gmail.com> wrote:

> Hi all,
>
> currently I have quite a bit of time to work on OpenNLP, and would like to
> help you
> out with this issue.
>
> Here is the follow up issue for this change:
> https://issues.apache.org/jira/browse/OPENNLP-602
>
> I am still trying to figure out what would be the best option to implement
> this.
> In the training data a user could just use a special tag to identify the
> chars.
>
> Instead of <NEWLINE> it might be better to use <CR> and <LF> to encode
> these two chars
> in the training data. Any thoughts?
>
> I am planning to release this as part of OpenNLP 1.6.0.
>
> Thanks,
> Jörn
>
> On 05/22/2013 02:03 PM, Jörn Kottmann wrote:
>
>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>
>>> That's awesome! It might be worth trying at least. How does the training
>>> process change? Previously the training data would be one sentence per
>>> line, but with newlines as possible mid-sentence characters that could
>>> be trouble, is there a new representation for training data? Or would we
>>> have to use the training api?
>>>
>>
>> Good point, yes that will be a problem with the default training format,
>> but it shouldn't be hard
>> to solve. In the format itself we could define a new line tag e.g.
>> <NEWLINE> to mark new lines.
>> as a hack to make it work with 1.5.3 you could instead use a special char
>> as a replacement
>> for the new line char.
>> When you pass the text down to the sentence detector a simple string
>> replace could be used to
>> convert all new line chars to the special new line marker char.
>>
>> If things work out for you performance wise as well we will just
>> integrate it properly into OpenNLP
>> for the next release.
>>
>> Could you produce a sentence detector training file with a new line
>> marker char?
>>
>> You should try to pick a char you can also pass in on a terminal
>> otherwise you have to use the
>> API to train the model. The build in cross validation could be used to
>> evaluate the performance.
>>
>> Jörn
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message