Return-Path: X-Original-To: apmail-incubator-opennlp-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-opennlp-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B9B9614E for ; Fri, 24 Jun 2011 16:43:39 +0000 (UTC) Received: (qmail 3516 invoked by uid 500); 24 Jun 2011 16:43:39 -0000 Delivered-To: apmail-incubator-opennlp-dev-archive@incubator.apache.org Received: (qmail 3446 invoked by uid 500); 24 Jun 2011 16:43:39 -0000 Mailing-List: contact opennlp-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: opennlp-dev@incubator.apache.org Delivered-To: mailing list opennlp-dev@incubator.apache.org Received: (qmail 3437 invoked by uid 99); 24 Jun 2011 16:43:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2011 16:43:39 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of olivier.grisel@gmail.com designates 74.125.83.175 as permitted sender) Received: from [74.125.83.175] (HELO mail-pv0-f175.google.com) (74.125.83.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2011 16:43:34 +0000 Received: by pvf24 with SMTP id 24so1948410pvf.6 for ; Fri, 24 Jun 2011 09:43:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:from :date:x-google-sender-auth:message-id:subject:to:content-type :content-transfer-encoding; bh=NB4z5xN2kAAxDfLj3/VawOnR+DevvB+egKafA3Xhtak=; b=wKLfdFY9/c5K3DmqcxWKbraxWY37kQFzuD9PbyyN44XDX2ioSk0JFa+nkj996ItyGL iBQf9D2dQPuP4jg4M1kHGltEtj5yIBdqy8roZrVtSLsKQ4Q5/C7ezrh3ZxKDalqbvjiB tm75uEDIddJWvqHrJtQty6w6+6afuEYdDFmCo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type :content-transfer-encoding; b=k/52U/JjvsElSGjISdNjl3RiTE42JIY74HbT44pLbGYwK+rsawqFqpwEqu0wV5O3qr /nf6QIkNjwhJeWBdcTfwiiwLDWMjVebwVC+WLr8wFGt/vQ/dlq5B+I0gGS7r1nyy0xEZ a1anYCaZlEOtdMT1HSg9/U0lAEu3sFVqDeguw= Received: by 10.68.49.227 with SMTP id x3mr2015692pbn.33.1308933794079; Fri, 24 Jun 2011 09:43:14 -0700 (PDT) MIME-Version: 1.0 Sender: olivier.grisel@gmail.com Received: by 10.68.64.10 with HTTP; Fri, 24 Jun 2011 09:42:52 -0700 (PDT) In-Reply-To: <4E04BD67.5080604@gmail.com> References: <4DEE350F.1070407@gmail.com> <858A0A38-E6B9-4A4B-AA3B-462010EBF462@yahoo.com> <4DEFEB7B.8040303@gmail.com> <4E01959F.7090608@gmail.com> <4E01CE73.305@gmail.com> <4E03C4D2.6000808@gmail.com> <4E0444BE.6000503@gmail.com> <4E044ACE.1060306@iais.fraunhofer.de> <4E04BD67.5080604@gmail.com> From: Olivier Grisel Date: Fri, 24 Jun 2011 18:42:52 +0200 X-Google-Sender-Auth: MSd-wUz27WLb9t6uGv5uP2_XdC4 Message-ID: Subject: Re: OpenNLP Annotations Proposal To: opennlp-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable 2011/6/24 J=C3=B6rn Kottmann : > On 6/24/11 11:54 AM, Olivier Grisel wrote: >> >> but we need to agree on a CAS type system first. I don't >> know the opennlp-uima myself and won't have time to invest more effort >> on this project before mid-july unfortunately. > > I suggest that there are two classes of types in the type system. > > The first class contains annotations which describe the input we collect > from our annotators and are also suitable to document comments and > disagreements > between annotators. > > And the second class of annotations contain standard linguistic annotatio= ns > such as sentences, tokens, entities, chunks, parses, etc. > > The idea is that the annotation in the second class can be automatically > be derived from the annotations in the first class. In case the article i= s > not > completely labeled the statistic models could fill the gap. > > For example, we could ask the annotators to label token splits, form thes= e > token splits we can derive the actual token annotations. For english text= s > the annotation ui could make use of the alpha num optimization and only > ask the user for questionable token splits. > > A similar approach could be done for sentence annotations. > > For named entity annotations the user could do BIO style token labeling > through a > special ui, similar to the one in Walter. The BIO labels can then be used= to > compute the > name spans. > > Our models can either be trained directly on the derived annotations, or = we > add a sentence level > annotation where users needs to confirm that the entire sentence is label= ed > correctly, for example > all person annotation are marked in this sentence. I like the ability to move the UI focus from one sentence to another and being able to mark a complete sentence as validated. +1 for the rest of your proposal. --=20 Olivier http://twitter.com/ogrisel - http://github.com/ogrisel