ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthik Sarma <ksa...@ksarma.com>
Subject Re: apostrophe and sentence detector
Date Mon, 26 Aug 2013 17:12:31 GMT
Hmm, one problem there is that medical records tend to be punctuated
completely differently from normal text in my experience.





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksarma@ksarma.com
gchat: ksarma@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Mon, Aug 26, 2013 at 9:46 AM, John Green <john.travis.green@gmail.com>wrote:

> Just out of curiosity, how was the training data originally built? I mean,
> who separated the lines? By hand? Regex?
>
>
>
>
>
>     Question two: has anyone made attempts at adding project gutenberg to
> the training data for things like sentence detection? Wide variety of
> punctuation in the years a lot of those books were written.
>
>
>
>
>
>     Trying to piece together how it all works,
>
>     JG
>
>
>
>
>
>     —
> Sent from Mailbox for iPhone
>
> On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
> <timothy.miller@childrens.harvard.edu> wrote:
>
> > Ah, so we might suspect that some of those 7 lines in the file were
> > indeed followed by newlines in the original training data. In the
> > absence of more/better training data which would help us learn this I
> > think it would be reasonable to restore the list of sentence-breaking
> > characters to not include apostrophe. Seems like it is rare for a
> > sentence to end on it, and my preference is to accidentally call 2
> > sentences one sentence, rather than splitting one sentence in the
> > middle. I think it's probably better for downstream processing.
> > Just my .02,
> > Tim
> > On 08/26/2013 12:29 PM, Masanz, James J. wrote:
> >> The training data is one sentence per line.
> >> That's how you feed data to the sentence detector.
> >>
> >> -----Original Message-----
> >> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
> dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim
> Miller
> >> Sent: Monday, August 26, 2013 11:12 AM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: apostrophe and sentence detector
> >>
> >>
> >> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
> >>> The recently rebuilt sentence detector (currently in trunk and the
> 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where
> the ctakes-3.0.0-incubating model didn't.
> >>>
> >>> The training data used for the recently rebuilt model only contains
> only 7 lines that end with an apostrophe (single quote)
> >> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
> >> sentence detector will currently break on newlines no matter what, so
> >> the important number is how many sentences end mid-line with an
> >> apostrophe, right?
> >> Tim
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message