opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joern Kottmann <kottm...@gmail.com>
Subject Re: Performances of OpenNLP tools
Date Mon, 04 Jul 2016 09:46:47 GMT
You should get a copy of OntoNotes (it is for free) and OpenNLP already has
support to train models on it.
So the entry barrier to get started with this corpus is very low.

Jörn

On Wed, Jun 29, 2016 at 11:22 AM, Anthony Beylerian <
anthony.beylerian@gmail.com> wrote:

> How about we keep track of the sets used for performance evaluation and
> results in this doc for now:
>
>
> https://docs.google.com/spreadsheets/d/15c0-u61HNWfQxiDSGjk49M1uBknIfb-LkbP4BDWTB5w/edit?usp=sharing
>
> Will try to take a better look at OntoNotes and what to use from it.
> Otherwise, if anyone would like to suggest proper data-sets for testing
> each component that would be really helpful
>
> Anthony
>
> On Thu, Jun 23, 2016 at 12:18 AM, Joern Kottmann <kottmann@gmail.com>
> wrote:
>
> > It would be nice to get MASC support into the OpenNLP formats package.
> >
> > Jörn
> >
> > On Tue, Jun 21, 2016 at 6:18 PM, Jason Baldridge <
> jasonbaldridge@gmail.com
> > >
> > wrote:
> >
> > > Jörn is absolutely right about that. Another good source of training
> data
> > > is MASC. I've got some instructions for training models with MASC here:
> > >
> > > https://github.com/scalanlp/chalk/wiki/Chalk-command-line-tutorial
> > >
> > > Chalk (now defunct) provided a Scala wrapper around OpenNLP
> > functionality,
> > > so the instructions there should make it fairly straightforward to
> adapt
> > > MASC data to OpenNLP.
> > >
> > > -Jason
> > >
> > > On Tue, 21 Jun 2016 at 10:46 Joern Kottmann <kottmann@gmail.com>
> wrote:
> > >
> > > > There are some research papers which study and compare the
> performance
> > of
> > > > NLP toolkits, but be careful often they don't train the NLP tools on
> > the
> > > > same data and the training data makes a big difference on the
> > > performance.
> > > >
> > > > Jörn
> > > >
> > > > On Tue, Jun 21, 2016 at 5:44 PM, Joern Kottmann <kottmann@gmail.com>
> > > > wrote:
> > > >
> > > > > Just don't use the very old existing models, to get good results
> you
> > > have
> > > > > to train on your own data, especially if the domain of the data
> used
> > > for
> > > > > training and the data which should be processed doesn't match. The
> > old
> > > > > models are trained on 90s news, those don't work well on todays
> news
> > > and
> > > > > probably much worse on tweets.
> > > > >
> > > > > OntoNots is a good place to start if the goal is to process news.
> > > OpenNLP
> > > > > comes with build-in support to train models from OntoNotes.
> > > > >
> > > > > Jörn
> > > > >
> > > > > On Tue, Jun 21, 2016 at 4:20 PM, Mattmann, Chris A (3980) <
> > > > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > > > >
> > > > >> This sounds like a fantastic idea.
> > > > >>
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> Chris Mattmann, Ph.D.
> > > > >> Chief Architect
> > > > >> Instrument Software and Science Data Systems Section (398)
> > > > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > >> Office: 168-519, Mailstop: 168-527
> > > > >> Email: chris.a.mattmann@nasa.gov
> > > > >> WWW:  http://sunset.usc.edu/~mattmann/
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> Director, Information Retrieval and Data Science Group (IRDS)
> > > > >> Adjunct Associate Professor, Computer Science Department
> > > > >> University of Southern California, Los Angeles, CA 90089 USA
> > > > >> WWW: http://irds.usc.edu/
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 6/21/16, 12:13 AM, "Anthony Beylerian" <
> > > anthonybeylerian@hotmail.com
> > > > >
> > > > >> wrote:
> > > > >>
> > > > >> >+1
> > > > >> >
> > > > >> >Maybe we could put the results of the evaluator tests for
each
> > > > component
> > > > >> somewhere on a webpage and on every release update them.
> > > > >> >This is of course provided there are reasonable data sets
for
> > testing
> > > > >> each component.
> > > > >> >What do you think?
> > > > >> >
> > > > >> >Anthony
> > > > >> >
> > > > >> >> From: mondher.bouazizi@gmail.com
> > > > >> >> Date: Tue, 21 Jun 2016 15:59:47 +0900
> > > > >> >> Subject: Re: Performances of OpenNLP tools
> > > > >> >> To: dev@opennlp.apache.org
> > > > >> >>
> > > > >> >> Hi,
> > > > >> >>
> > > > >> >> Thank you for your replies.
> > > > >> >>
> > > > >> >> Please Jeffrey accept once more my apologies for receiving
the
> > > email
> > > > >> twice.
> > > > >> >>
> > > > >> >> I also think it would be great to have such studies
on the
> > > > >> performances of
> > > > >> >> OpenNLP.
> > > > >> >>
> > > > >> >> I have been looking for this information and checked
in many
> > > places,
> > > > >> >> including obviously google scholar, and I haven't found
any
> > serious
> > > > >> studies
> > > > >> >> or reliable results. Most of the existing ones report
the
> > > > performances
> > > > >> of
> > > > >> >> outdated releases of OpenNLP, and focus more on the
execution
> > time
> > > or
> > > > >> >> CPU/RAM consumption, etc.
> > > > >> >>
> > > > >> >> I think such a comparison will help not only evaluate
the
> overall
> > > > >> accuracy,
> > > > >> >> but also highlight the issues with the existing models
(as a
> > matter
> > > > of
> > > > >> >> fact, the existing models fail to recognize many of
the
> hashtags
> > in
> > > > >> tweets:
> > > > >> >> the tokenizer splits them into the "#" symbol and a
word that
> the
> > > PoS
> > > > >> >> tagger also fails to recognize).
> > > > >> >>
> > > > >> >> Therefore, building Twitter-based models would also
be useful,
> > > since
> > > > >> many
> > > > >> >> of the works in academia / industry are focusing on
Twitter
> data.
> > > > >> >>
> > > > >> >> Best regards,
> > > > >> >>
> > > > >> >> Mondher
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> On Tue, Jun 21, 2016 at 12:45 AM, Jason Baldridge <
> > > > >> jasonbaldridge@gmail.com>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > It would be fantastic to have these numbers. This
is an
> example
> > > of
> > > > >> >> > something that would be a great contribution by
someone
> trying
> > to
> > > > >> >> > contribute to open source and who is maybe just
getting into
> > > > machine
> > > > >> >> > learning and natural language processing.
> > > > >> >> >
> > > > >> >> > For Twitter-ish text, it'd be great to look at
models trained
> > and
> > > > >> evaluated
> > > > >> >> > on the Tweet NLP resources:
> > > > >> >> >
> > > > >> >> > http://www.cs.cmu.edu/~ark/TweetNLP/
> > > > >> >> >
> > > > >> >> > And comparing to how their models performed, etc.
Also, it's
> > > worth
> > > > >> looking
> > > > >> >> > at spaCy (Python NLP modules) for further comparisons.
> > > > >> >> >
> > > > >> >> > https://spacy.io/
> > > > >> >> >
> > > > >> >> > -Jason
> > > > >> >> >
> > > > >> >> > On Mon, 20 Jun 2016 at 10:41 Jeffrey Zemerick <
> > > > jzemerick@apache.org>
> > > > >> >> > wrote:
> > > > >> >> >
> > > > >> >> > > I saw the same question on the users list
on June 17. At
> > least
> > > I
> > > > >> thought
> > > > >> >> > it
> > > > >> >> > > was the same question -- sorry if it wasn't.
> > > > >> >> > >
> > > > >> >> > > On Mon, Jun 20, 2016 at 11:37 AM, Mattmann,
Chris A (3980)
> <
> > > > >> >> > > chris.a.mattmann@jpl.nasa.gov> wrote:
> > > > >> >> > >
> > > > >> >> > > > Well, hold on. He sent that mail (as
of the time of this
> > > mail)
> > > > 4
> > > > >> >> > > > mins previously. Maybe some folks need
some time to reply
> > ^_^
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> >> > > > Chris Mattmann, Ph.D.
> > > > >> >> > > > Chief Architect
> > > > >> >> > > > Instrument Software and Science Data
Systems Section
> (398)
> > > > >> >> > > > NASA Jet Propulsion Laboratory Pasadena,
CA 91109 USA
> > > > >> >> > > > Office: 168-519, Mailstop: 168-527
> > > > >> >> > > > Email: chris.a.mattmann@nasa.gov
> > > > >> >> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > > >> >> > > >
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> >> > > > Director, Information Retrieval and Data
Science Group
> > (IRDS)
> > > > >> >> > > > Adjunct Associate Professor, Computer
Science Department
> > > > >> >> > > > University of Southern California, Los
Angeles, CA 90089
> > USA
> > > > >> >> > > > WWW: http://irds.usc.edu/
> > > > >> >> > > >
> > > > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > On 6/20/16, 8:23 AM, "Jeffrey Zemerick"
<
> > > jzemerick@apache.org>
> > > > >> wrote:
> > > > >> >> > > >
> > > > >> >> > > > >Hi Mondher,
> > > > >> >> > > > >
> > > > >> >> > > > >Since you didn't get any replies
I'm guessing no one is
> > > aware
> > > > >> of any
> > > > >> >> > > > >resources related to what you need.
Google Scholar is a
> > good
> > > > >> place to
> > > > >> >> > > look
> > > > >> >> > > > >for papers referencing OpenNLP and
its methods (in case
> > you
> > > > >> haven't
> > > > >> >> > > > >searched it already).
> > > > >> >> > > > >
> > > > >> >> > > > >Jeff
> > > > >> >> > > > >
> > > > >> >> > > > >On Mon, Jun 20, 2016 at 11:19 AM,
Mondher Bouazizi <
> > > > >> >> > > > >mondher.bouazizi@gmail.com> wrote:
> > > > >> >> > > > >
> > > > >> >> > > > >> Hi,
> > > > >> >> > > > >>
> > > > >> >> > > > >> Apologies if you received multiple
copies of this
> > email. I
> > > > >> sent it
> > > > >> >> > to
> > > > >> >> > > > the
> > > > >> >> > > > >> users list a while ago, and
haven't had an answer yet.
> > > > >> >> > > > >>
> > > > >> >> > > > >> I have been looking for a while
if there is any
> relevant
> > > > work
> > > > >> that
> > > > >> >> > > > >> performed tests on the OpenNLP
tools (in particular
> the
> > > > >> Lemmatizer,
> > > > >> >> > > > >> Tokenizer and PoS-Tagger) when
used with short and
> noisy
> > > > >> texts such
> > > > >> >> > as
> > > > >> >> > > > >> Twitter data, etc., and/or compared
it to other
> > libraries.
> > > > >> >> > > > >>
> > > > >> >> > > > >> By performances, I mean accuracy/precision,
rather
> than
> > > time
> > > > >> of
> > > > >> >> > > > execution,
> > > > >> >> > > > >> etc.
> > > > >> >> > > > >>
> > > > >> >> > > > >> If anyone can refer me to a
paper or a work done in
> this
> > > > >> context,
> > > > >> >> > that
> > > > >> >> > > > >> would be of great help.
> > > > >> >> > > > >>
> > > > >> >> > > > >> Thank you very much.
> > > > >> >> > > > >>
> > > > >> >> > > > >> Mondher
> > > > >> >> > > > >>
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message