Return-Path: X-Original-To: apmail-ctakes-dev-archive@www.apache.org Delivered-To: apmail-ctakes-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B828910EEA for ; Tue, 27 Aug 2013 02:00:57 +0000 (UTC) Received: (qmail 87718 invoked by uid 500); 27 Aug 2013 02:00:57 -0000 Delivered-To: apmail-ctakes-dev-archive@ctakes.apache.org Received: (qmail 87683 invoked by uid 500); 27 Aug 2013 02:00:57 -0000 Mailing-List: contact dev-help@ctakes.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ctakes.apache.org Delivered-To: mailing list dev@ctakes.apache.org Received: (qmail 87675 invoked by uid 99); 27 Aug 2013 02:00:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Aug 2013 02:00:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ksarma@gmail.com designates 209.85.223.179 as permitted sender) Received: from [209.85.223.179] (HELO mail-ie0-f179.google.com) (209.85.223.179) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Aug 2013 02:00:53 +0000 Received: by mail-ie0-f179.google.com with SMTP id m16so3532300ieq.24 for ; Mon, 26 Aug 2013 19:00:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=AMddleTeYUYbQxN9x7zt+kpuC70QWpmXvgGH+cHw890=; b=TpUwvNiGIXTey3Z29BMNg2wP5l1bTQn4PsS/udfWwyKsbz/3+N337OKjQ8N4D2TvM1 vl4Mm0EkEXdByqMzMTsVHZ9KIJ4BbYO9K9ge4rssV7eNvcG4xI+uLnZznK8PoYUwBoA5 on16JNILBUphzgcXh4dVohbi76cGvNOyzityVE45DvNoCmg/k7z2LnWem3ilwag16fty XMGsnmaB8LYbmTY4K72G9s7UTS7b1DvwaxLFgdZrjFndk+noIgbPffqnKXziMHkiz19I JiABkR+4jx9PtuGU1OixVP9a3bI/7cgub1UXfnl0Tjq07QNBY2wdUZBrXX1YGjX+Uzhi 0lng== X-Received: by 10.50.45.73 with SMTP id k9mr8510171igm.38.1377568832520; Mon, 26 Aug 2013 19:00:32 -0700 (PDT) MIME-Version: 1.0 Sender: ksarma@gmail.com Received: by 10.50.161.225 with HTTP; Mon, 26 Aug 2013 19:00:02 -0700 (PDT) In-Reply-To: <1377544936042.cd9b5d@Nodemailer> References: <996FC801C05DF64A84246A106FACACD0186918@MSGPEXCHA08A.mfad.mfroot.org> <1377544936042.cd9b5d@Nodemailer> From: Karthik Sarma Date: Mon, 26 Aug 2013 19:00:02 -0700 X-Google-Sender-Auth: gT0F2DS5GYVaejViMllD0sH-j-8 Message-ID: Subject: Re: apostrophe and sentence detector To: "dev@ctakes.apache.org" Content-Type: multipart/alternative; boundary=089e010d9e72ea893b04e4e43c43 X-Virus-Checked: Checked by ClamAV on apache.org --089e010d9e72ea893b04e4e43c43 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I'd have to disagree that it is a subset of the "english language" found in books -- for one thing, one finds a great many more sentence fragments and lists in clinical records. I have no doubt that training on gutenberg would yield a reliable sentence detector, but I fear that sentence detector would be unlikely to perform much better than the existing one. To be honest, I've started to develop more and more concern about some of the models used and the training data that was used. The structure of clinical records vary dramatically between institutions (and, of course, even between departments at a single institution). I've found that I have to remain vigilant about the quality of sentence detection in just about everything I run. This might be unavoidable, but perhaps what we need is an annotated set of clinical documents culled from a variety of institutions. Probably a pie in the sky, though ;) Karthik -- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging & Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association ksarma@ksarma.com gchat: ksarma@gmail.com linkedin: www.linkedin.com/in/ksarma On Mon, Aug 26, 2013 at 12:22 PM, John Green w= rote: > Karthik, well said. There are many differences. I wonder, what do you > think about the logical division of the two sets? Do they share domain? I= s > one a subset of the other? I would propose that it wouldnt be unreasonabl= e > to think of clinical notes as being a subset of the english language. It > seems to me that gutenberg is fairly good average of that english languag= e > so the superset could contribute to the recognition of the subset. > > > > > > JG > > > > > > =97 > Sent from Mailbox for iPhone > > On Mon, Aug 26, 2013 at 2:07 PM, Masanz, James J. > wrote: > > > The corpus used for cTAKES sentence detection is a combination of some > Mayo Clinic clinical notes that were manually separated into sentences, > combined with the Penn Treebank (wall street journal) > > -- James > > -----Original Message----- > > From: dev-return-1889-Masanz.James=3Dmayo.edu@ctakes.apache.org [mailto= : > dev-return-1889-Masanz.James=3Dmayo.edu@ctakes.apache.org] On Behalf Of > John Green > > Sent: Monday, August 26, 2013 11:46 AM > > To: dev@ctakes.apache.org > > Subject: Re: apostrophe and sentence detector > > Just out of curiosity, how was the training data originally built? I > mean, who separated the lines? By hand? Regex? > > > > > > Question two: has anyone made attempts at adding project gutenberg > to the training data for things like sentence detection? Wide variety of > punctuation in the years a lot of those books were written. > > > > > > Trying to piece together how it all works, > > JG > > > > > > =97 > > Sent from Mailbox for iPhone > > On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller > > wrote: > >> Ah, so we might suspect that some of those 7 lines in the file were > >> indeed followed by newlines in the original training data. In the > >> absence of more/better training data which would help us learn this I > >> think it would be reasonable to restore the list of sentence-breaking > >> characters to not include apostrophe. Seems like it is rare for a > >> sentence to end on it, and my preference is to accidentally call 2 > >> sentences one sentence, rather than splitting one sentence in the > >> middle. I think it's probably better for downstream processing. > >> Just my .02, > >> Tim > >> On 08/26/2013 12:29 PM, Masanz, James J. wrote: > >>> The training data is one sentence per line. > >>> That's how you feed data to the sentence detector. > >>> > >>> -----Original Message----- > >>> From: dev-return-1884-Masanz.James=3Dmayo.edu@ctakes.apache.org [mail= to: > dev-return-1884-Masanz.James=3Dmayo.edu@ctakes.apache.org] On Behalf Of T= im > Miller > >>> Sent: Monday, August 26, 2013 11:12 AM > >>> To: dev@ctakes.apache.org > >>> Subject: Re: apostrophe and sentence detector > >>> > >>> > >>> On 08/26/2013 12:05 PM, Masanz, James J. wrote: > >>>> The recently rebuilt sentence detector (currently in trunk and the > 3.1.0 branch) is sometimes taking the apostrophe as a sentence break wher= e > the ctakes-3.0.0-incubating model didn't. > >>>> > >>>> The training data used for the recently rebuilt model only contains > only 7 lines that end with an apostrophe (single quote) > >>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? T= he > >>> sentence detector will currently break on newlines no matter what, so > >>> the important number is how many sentences end mid-line with an > >>> apostrophe, right? > >>> Tim > --089e010d9e72ea893b04e4e43c43--