opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Baldridge <jasonbaldri...@gmail.com>
Subject Re: OpenNLP Annotations Proposal
Date Wed, 08 Jun 2011 16:36:50 GMT
+1 This is awesome.

Here is a tool that could be relevant in getting the ball rolling on some
datasets:

http://code.google.com/p/dualist/

Jason

On Tue, Jun 7, 2011 at 12:58 PM, Chris Collins <chris_j_collins@yahoo.com>wrote:

> Thanks Jörn I agree with your assessment.  This is exactly where I am at
> the moment and I am sure many others.  You hit the nail on the head,
> currently people have to start from scratch and thats daunting.  For the
> phase when you start crowd sourcing I am wondering what this web based UI
> would look like.  I am assuming that with some basic instructions things
> like:
>
> - sentence boundary markup
> - name identification (people, money, dates, locations, products)
>
> These are narrowly focused crowd source-able with somewhat trivial ui tasks
> (For the following sentences highlight names of people, such as "steve
> jobs", "prince william")
>
> When it comes to POS tagging (which is my current challenge) you can
> approach it like the above ("For the following sentences select all the
> nouns"). And re-assemble all the observations and perhaps use something like
> triple judgements to look for disagreement, or you could have an editor that
> lets a user markup the whole sentence (perhaps we fill in the parts we are
> guessing already from a pre-learnt model).  Not sure if the triple judgement
> is necessary, maybe sentences labeled with a collage of people would still
> converge well in training.
>
> Both can be assisted by some prior trained model to help keep people awake
> and on track :-}  I think you mentioned in a prior mail that you can even
> use the models that were built with proprietary data to bootstrap the
> assistance process.
>
> These are two ends of the spectrum, one assumes you are using people with
> limited language skills and the other potentially much more competent.  One
> you need to gather data probably from many more people, one much less.
>  Personally I like the crowd sourced approach, but I wonder if OpenNLP could
> find enough language "experts" per language that it makes better sense to
> build a non web based app that perhaps is a little more expedient to
> operate.
>
> For giggles assuming we needed to generate labels:
> 60k lines of text
> average word length == 11
> number of judgments ==3
>
> We would be collecting almost 2M judgements from people that we would
> reassemble into our training data after throwing out the bath water.
>
> Maybe with the competent language expert case we only get a sentence judged
> once by a person.  There is perhaps no labeled sentence to be re-assembled,
> but we may want to keep peoples judgements separate so we could validate
> their work against others.
>
> The data processing pipeline looks somewhat different in each case. The
> competent POS labeler case simplifies the process greatly for the pipeline.
>
> I would love to help in whatever way I can and can also find people to help
> label data at my companies own expense to help accelerate this.
>
> Best
>
> C
>
> On Jun 7, 2011, at 7:26 AM, Jörn Kottmann wrote:
>
> > Hi all,
> >
> > based on some discussion we had in the past I put together
> > a short proposal for a community based labeling project.
> >
> > Here is the link:
> > https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
> >
> > Any comments and opinions are very welcome.
> >
> > Thanks,
> > Jörn
>
>


-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message