ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Green" <john.travis.gr...@gmail.com>
Subject RE: apostrophe and sentence detector
Date Mon, 26 Aug 2013 19:22:16 GMT
Karthik, well said. There are many differences. I wonder, what do you think about the logical
division of the two sets? Do they share domain? Is one a subset of the other? I would propose
that it wouldnt be unreasonable to think of clinical notes as being a subset of the english
language. It seems to me that gutenberg is fairly good average of that english language so
the superset could contribute to the recognition of the subset.

    
      


    JG

    
      


    —
Sent from Mailbox for iPhone

On Mon, Aug 26, 2013 at 2:07 PM, Masanz, James J. <Masanz.James@mayo.edu>
wrote:

> The corpus used for cTAKES sentence detection is a combination of some Mayo Clinic clinical
notes that were manually separated into sentences, combined with the Penn Treebank (wall street
journal)
> -- James
> -----Original Message-----
> From: dev-return-1889-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1889-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of John Green
> Sent: Monday, August 26, 2013 11:46 AM
> To: dev@ctakes.apache.org
> Subject: Re: apostrophe and sentence detector
> Just out of curiosity, how was the training data originally built? I mean, who separated
the lines? By hand? Regex? 
>     
>       
>     Question two: has anyone made attempts at adding project gutenberg to the training
data for things like sentence detection? Wide variety of punctuation in the years a lot of
those books were written. 
>     
>       
>     Trying to piece together how it all works,
>     JG
>     
>       
>     —
> Sent from Mailbox for iPhone
> On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
> <timothy.miller@childrens.harvard.edu> wrote:
>> Ah, so we might suspect that some of those 7 lines in the file were 
>> indeed followed by newlines in the original training data. In the 
>> absence of more/better training data which would help us learn this I 
>> think it would be reasonable to restore the list of sentence-breaking 
>> characters to not include apostrophe. Seems like it is rare for a 
>> sentence to end on it, and my preference is to accidentally call 2 
>> sentences one sentence, rather than splitting one sentence in the 
>> middle. I think it's probably better for downstream processing.
>> Just my .02,
>> Tim
>> On 08/26/2013 12:29 PM, Masanz, James J. wrote:
>>> The training data is one sentence per line.
>>> That's how you feed data to the sentence detector.
>>>
>>> -----Original Message-----
>>> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org]
On Behalf Of Tim Miller
>>> Sent: Monday, August 26, 2013 11:12 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: apostrophe and sentence detector
>>>
>>>
>>> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>>>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0
branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating
model didn't.
>>>>
>>>> The training data used for the recently rebuilt model only contains only
7 lines that end with an apostrophe (single quote)
>>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
>>> sentence detector will currently break on newlines no matter what, so
>>> the important number is how many sentences end mid-line with an
>>> apostrophe, right?
>>> Tim
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message