ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: sentence detector newline behavior
Date Tue, 21 Jan 2014 10:29:17 GMT
Yes, exactly, OPENNLP-602 is about training a sentence detector model 
which can use a new line as a end-of-sentence character.

In case you have certain rules to split sentences we should have a look 
at them. The Sentence Detector could be extended to support
a user provided rule based splitter. If there is an interest in that we 
could probably get it into 1.6.0 as well.

Jörn

On 01/20/2014 10:02 PM, Chen, Pei wrote:
> I presume Joern was suggesting that if he supports new lines in the opennlp SentenceDectector
(either part of the trained models or post processing with some rules?) cTAKES will be able
to use it out of the box and we should be able remove any additional custom logic that we
currently have- which seems like a good idea.
>
> [but when to use within cTAKES individual components such as negation might be another
discussion?]
> --Pei
>
>> On Jan 20, 2014, at 12:46 PM, "vijay garla" <vngarla@gmail.com> wrote:
>>
>> The sentence detection opennlp model used by ctakes does not split
>> sentences at newlines - there is additional logic in the takes sentence
>> splitter that does this (and an alternative impl that doesn't is in the
>> ytex branch). Afaik no retraining / change to the feature representation is
>> necessary.
>>
>> Vj
>>
>>> On Monday, January 20, 2014, Jörn Kottmann <kottmann@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> currently I have quite a bit of time to work on OpenNLP, and would like to
>>> help you
>>> out with this issue.
>>>
>>> Here is the follow up issue for this change:
>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>>
>>> I am still trying to figure out what would be the best option to implement
>>> this.
>>> In the training data a user could just use a special tag to identify the
>>> chars.
>>>
>>> Instead of <NEWLINE> it might be better to use <CR> and <LF>
to encode
>>> these two chars
>>> in the training data. Any thoughts?
>>>
>>> I am planning to release this as part of OpenNLP 1.6.0.
>>>
>>> Thanks,
>>> Jörn
>>>
>>>> On 05/22/2013 02:03 PM, Jörn Kottmann wrote:
>>>>
>>>>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>>>>
>>>>> That's awesome! It might be worth trying at least. How does the training
>>>>> process change? Previously the training data would be one sentence per
>>>>> line, but with newlines as possible mid-sentence characters that could
>>>>> be trouble, is there a new representation for training data? Or would
we
>>>>> have to use the training api?
>>>> Good point, yes that will be a problem with the default training format,
>>>> but it shouldn't be hard
>>>> to solve. In the format itself we could define a new line tag e.g.
>>>> <NEWLINE> to mark new lines.
>>>> as a hack to make it work with 1.5.3 you could instead use a special char
>>>> as a replacement
>>>> for the new line char.
>>>> When you pass the text down to the sentence detector a simple string
>>>> replace could be used to
>>>> convert all new line chars to the special new line marker char.
>>>>
>>>> If things work out for you performance wise as well we will just
>>>> integrate it properly into OpenNLP
>>>> for the next release.
>>>>
>>>> Could you produce a sentence detector training file with a new line
>>>> marker char?
>>>>
>>>> You should try to pick a char you can also pass in on a terminal
>>>> otherwise you have to use the
>>>> API to train the model. The build in cross validation could be used to
>>>> evaluate the performance.
>>>>
>>>> Jörn
>>>


Mime
View raw message