opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Baldridge <jasonbaldri...@gmail.com>
Subject Re: OpenNLP Annotations Proposal
Date Wed, 22 Jun 2011 21:12:22 GMT
I defer to all of you on the specifics of the annotation infrastructure.
Great to see this moving forward!

One thing to throw in is that we may be able to take advantage of some
resources to bootstrap initial components. For example, I'm working with a
student to bootstrap multilingual POS taggers using Wiktionary as the tag
dictionary and a combination of label propagation and HMMs. This will have
lots of errors, but could be a useful starting point.

+1 for interest in Spark.

Jason


On Wed, Jun 22, 2011 at 2:59 PM, Jörn Kottmann <kottmann@gmail.com> wrote:

> On 6/22/11 8:13 PM, Hannes Korte wrote:
>
>> On 06/22/2011 07:53 PM, Olivier Grisel wrote:
>>
>>> 2011/6/22 Jörn Kottmann<kottmann@gmail.com>:
>>>
>>>> On 6/22/11 6:50 PM, Olivier Grisel wrote:
>>>>
>>>>> I am ok with switching to UIMA CAS. We might need additional metadata
>>>>> outside of the CAS annotations though. For instance if the annotators
>>>>> fixes a typo in the Sofa it-self, we might need to be able to tell
>>>>> that Sofa1 is subject to being replaced by Sofa2 according to
>>>>> annotator A1 for instance.
>>>>>
>>>>>  I am not sure if we should fix such mistakes, the system will also
>>>> encounter
>>>> them in real data it needs to process. Fixing typos, or correcting
>>>> things in
>>>> the text is
>>>> always difficult when there are already existing annotations.
>>>>
>>>> Do you feel fixing mistakes in the text is important?
>>>>
>>> We can leave that issue as a low priority discussion for later and
>>> just ignore it for now.
>>>
>>>
>>>  We can also fix by having an option to delete "garbage" texts from the
>>>> corpus.
>>>>
>>> Yes, discarding a whole CAS. But if the CAS is document level instead
>>> of sentence level, that might be an issue.
>>>
>> Let's say we have a CAS type Sentence, which will not be changed, and
>> another type AnnotatedSentence. Each time a sentence was annotated by a
>> user, a new AnnotatedSentence annotation will be created in the same
>> span containing information about the user and the state of the sentence
>> (e.g. correct, unsure, or discarded). This way we can store all that
>> without the need for changes to the Sofa. Alternatively, each Sentence
>> could have a List of something like AnnotationMetadata.
>>
>
> The only reason to change a sofa is, when the user wants to change the text
> itself, right? How would the AnnotatedSentence annotation do that?
> Would it just store the changed text a string feature?
>
>
>  I believe the Corpus server should be independent of the other components
>>>> and define some kind of remote API for data interchange.
>>>>
>>> Is there a JSON version of XMI? Hannes, what is your opinion on this?
>>>
>> A separate corpus server sounds good to me. But this server can simply
>> deliver the default XMI representation of the CASes. I think the
>> documents have to be preprocessed for annotation on the server side of
>> the WebGUI anyways. The JS client should not call the corpus server
>> directly.
>>
> +1
>
> Jörn
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message