ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject RE: apostrophe and sentence detector
Date Mon, 26 Aug 2013 16:44:35 GMT
The  7 lines I referred to as "ending with apostrophe" indeed have apostrophe followed immediately
by newline.

In the training data it is indeed very rare to end on apostrophe. 7 out of >400K sentences.

I second your suggestion of removing the apostrophe from the list of sentence-breaking characters.
 It is straight-forward and cleaner. Thanks

-----Original Message-----
From: dev-return-1887-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1887-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Tim Miller
Sent: Monday, August 26, 2013 11:35 AM
To: dev@ctakes.apache.org
Subject: Re: apostrophe and sentence detector

Ah, so we might suspect that some of those 7 lines in the file were 
indeed followed by newlines in the original training data. In the 
absence of more/better training data which would help us learn this I 
think it would be reasonable to restore the list of sentence-breaking 
characters to not include apostrophe. Seems like it is rare for a 
sentence to end on it, and my preference is to accidentally call 2 
sentences one sentence, rather than splitting one sentence in the 
middle. I think it's probably better for downstream processing.
Just my .02,
Tim

On 08/26/2013 12:29 PM, Masanz, James J. wrote:
> The training data is one sentence per line.
> That's how you feed data to the sentence detector.
>
> -----Original Message-----
> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Tim Miller
> Sent: Monday, August 26, 2013 11:12 AM
> To: dev@ctakes.apache.org
> Subject: Re: apostrophe and sentence detector
>
>
> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch)
is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model
didn't.
>>
>> The training data used for the recently rebuilt model only contains only 7 lines
that end with an apostrophe (single quote)
> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
> sentence detector will currently break on newlines no matter what, so
> the important number is how many sentences end mid-line with an
> apostrophe, right?
> Tim


Mime
View raw message