Mailing-List: contact opennlp-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: opennlp-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of olivier.grisel@gmail.com
 designates 74.125.83.175 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:from:date
         :x-google-sender-auth:message-id:subject:to:content-type
         :content-transfer-encoding;
        b=k/52U/JjvsElSGjISdNjl3RiTE42JIY74HbT44pLbGYwK+rsawqFqpwEqu0wV5O3qr
         /nf6QIkNjwhJeWBdcTfwiiwLDWMjVebwVC+WLr8wFGt/vQ/dlq5B+I0gGS7r1nyy0xEZ
         a1anYCaZlEOtdMT1HSg9/U0lAEu3sFVqDeguw=
MIME-Version: 1.0
Sender: olivier.grisel@gmail.com
In-Reply-To: <4E04BD67.5080604@gmail.com>
References: <4DEE350F.1070407@gmail.com>
 <858A0A38-E6B9-4A4B-AA3B-462010EBF462@yahoo.com>
 <BANLkTimD6WFWCEXWmbM-4e7iitYNoEY80Q@mail.gmail.com>
 <4DEFEB7B.8040303@gmail.com>
 <BANLkTimppktn7EzOiuhV+dGPJnd9=Cw_Dg@mail.gmail.com>
 <4E01959F.7090608@gmail.com>
 <BANLkTin0=LyzENanBYhbh4L1S6KKmmbRs-Tcdz-91eNG9KG=xQ@mail.gmail.com>
 <4E01CE73.305@gmail.com> <BANLkTimhTkO-L_KsWf9gmAg=Vnb9QdSidw@mail.gmail.com>
 <BANLkTi=8+Sz76s8-fuavgUV0+t326UQSbQ@mail.gmail.com>
 <4E03C4D2.6000808@gmail.com>
 <BANLkTi=sbMP+C3ZfSN=JG6oxNF6wQs5h98N4N6wDZ-W7=NeiVQ@mail.gmail.com>
 <4E0444BE.6000503@gmail.com> <4E044ACE.1060306@iais.fraunhofer.de>
 <BANLkTimmS0xNe3oTL=Awgkfc2MTEJshUJw@mail.gmail.com>
 <4E04BD67.5080604@gmail.com>
From: Olivier Grisel <olivier.grisel@ensta.org>
Date: Fri, 24 Jun 2011 18:42:52 +0200
Message-ID: <BANLkTi=qiiWy=CMF6j0zno8wj9vBGPxbjQ@mail.gmail.com>
Subject: Re: OpenNLP Annotations Proposal
To: opennlp-dev@incubator.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

2011/6/24 J=C3=B6rn Kottmann <kottmann@gmail.com>:
> On 6/24/11 11:54 AM, Olivier Grisel wrote:
>>
>> but we need to agree on a CAS type system first. I don't
>> know the opennlp-uima myself and won't have time to invest more effort
>> on this project before mid-july unfortunately.
>
> I suggest that there are two classes of types in the type system.
>
> The first class contains annotations which describe the input we collect
> from our annotators and are also suitable to document comments and
> disagreements
> between annotators.
>
> And the second class of annotations contain standard linguistic annotatio=
ns
> such as sentences, tokens, entities, chunks, parses, etc.
>
> The idea is that the annotation in the second class can be automatically
> be derived from the annotations in the first class. In case the article i=
s
> not
> completely labeled the statistic models could fill the gap.
>
> For example, we could ask the annotators to label token splits, form thes=
e
> token splits we can derive the actual token annotations. For english text=
s
> the annotation ui could make use of the alpha num optimization and only
> ask the user for questionable token splits.
>
> A similar approach could be done for sentence annotations.
>
> For named entity annotations the user could do BIO style token labeling
> through a
> special ui, similar to the one in Walter. The BIO labels can then be used=
 to
> compute the
> name spans.
>
> Our models can either be trained directly on the derived annotations, or =
we
> add a sentence level
> annotation where users needs to confirm that the entire sentence is label=
ed
> correctly, for example
> all person annotation are marked in this sentence.

I like the ability to move the UI focus from one sentence to another
and being able to mark a complete sentence as validated. +1 for the
rest of your proposal.

--=20
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel