opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joern Kottmann <kottm...@gmail.com>
Subject Re: Question about deprecated NameFinderME constructors
Date Tue, 08 Mar 2016 10:49:31 GMT
There is a custom xml element where it can load a user defined class
 for feature generation.

So you would add an element like this:
<custom
class="com.x.y.AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator""/>

https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen

I think we should remove the deprecated training methods so it is no longer
possible to train models which can't be loaded.

Jörn

On Mon, Mar 7, 2016 at 6:45 PM, Cohan Sujay Carlos <cohan@aiaioo.com> wrote:

> Dear Rodrigo,
>
> Thank you for the informative reply.
>
> I just wanted to say I feel there is a use-case that the new constructor
> still does not support.  Let me explain with an example.
>
> Let's first take the example of brown-feature.xml, which is defined as ...
>
> <generators>
>   <cache>
>     <generators>
>       <window prevLength = "2" nextLength = "2">
>         <token/>
>       </window>
>       <window prevLength = "2" nextLength = "2">
>         <brownclustertoken dict="brownBllipClusters" />
>       </window>
>     </generators>
>   </cache>
> </generators>
>
> ... In this feature generator, I believe "window" maps to the
> WindowFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html
> >
> and "token" maps to TokenFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/TokenFeatureGenerator.html
> >
> .
>
> It's clear that we can create new feature generators that are combinations
> of existing feature generators.
>
> However, let's say I have a task / language where none of the existing
> feature generators or combinations work very well.
>
> Say, for example, that I want to create a new feature generator that pulls
> out morphemes from agglutinative South Indian languages ... let's call it
> "AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator".
>
> It's not clear how one could create XML tags for this feature generator
> using the new constructor.
>
> The same thing is easy to do programmatically using the old constructors ->
> I would just extend the AdaptiveFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
> >
> .
>
> So, I was wondering ... are we giving up some API flexibility and
> simplicity by removing the constructors that enable me to use subclasses of
> AdaptiveFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
> >
> while
> there is no easy way to create something like a
> AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator and use
> it as a feature generator in the NameFinderME using the new constructor's
> XML specification.
>
> Cohan Sujay Carlos
> Aiaioo Labs, +91-77605-80015, http://www.aiaioo.com
>
> On Mon, Mar 7, 2016 at 4:37 PM, Rodrigo Agerri <ragerri@apache.org> wrote:
>
> > Hi,
> >
> > You can do all those tasks by using the create method in the
> > TokenNameFinderFactory:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/TokenNameFinderFactory.java?revision=1712553&view=markup#l100
> >
> > For that you need to:
> >
> > 1. Provide the name of the factory class you are using, it could be
> > the same factory class: TokenNameFinderFactory.class.getName()
> > 2. Create an XML descriptor and pass it as a byte[] array
> > 3. Load the resources (e.g., clusters) in a resources map consisting
> > of the id of the resource and the serializer.
> > 4. The sequenceCodec: BIO or BILOU.
> >
> > There Namefinder documentation was already updated:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?view=markup
> >
> > There is sample code to do that in the CLI class:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTrainerTool.java?revision=1674262&view=markup
> >
> > and to run it from the CLI:
> >
> > 1. Create an XML feature descriptor, e.g., brown-feature.xml
> >
> > <generators>
> >   <cache>
> >     <generators>
> >       <window prevLength = "2" nextLength = "2">
> >         <token/>
> >       </window>
> >       <window prevLength = "2" nextLength = "2">
> >         <brownclustertoken dict="brownBllipClusters" />
> >       </window>
> >     </generators>
> >   </cache>
> > </generators>
> >
> > 2. Put your clustering lexicon(s) in a directory, .e.g, clusters
> > 3. Train: bin/opennlp TokenNameFinderTrainer -featuregen brown.xml
> > -resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang
> > en -model brown.bin -data
> > ~/experiments/nerc/opennlp/en/conll03/en-testb.opennlp -encoding UTF-8
> >
> > If you open the brown.bin model you will see the cluster lexicon
> > seralized inside the model.
> >
> > You can now use it like any other model, the TokenNameFinderFactory
> > will read again all the required resources when loading the model in
> > the TokenNameFinderME class.
> >
> > HTH,
> >
> > R
> >
> >
> >
> >
> >
> >
> > On Mon, Feb 15, 2016 at 7:59 AM, Cohan Sujay Carlos <cohan@aiaioo.com>
> > wrote:
> > > Hi,
> > >
> > > I noticed that in the OpenNLP SVM 'trunk', the formerly deprecated
> > > constructors for the class *NameFinderME*:
> > >
> > > *public NameFinderME(TokenNameFinderModel model,
> AdaptiveFeatureGenerator
> > > generator, int beamSize, SequenceValidator<String> sequenceValidator);*
> > >
> > > and
> > >
> > >
> > > *public NameFinderME(TokenNameFinderModel model,
> AdaptiveFeatureGenerator
> > > generator, int beamSize)*
> > >
> > > have been removed, along with
> > >
> > > *public NameFinderME(TokenNameFinderModel model, int beamSize)*
> > >
> > > The deprecation comments said:
> > >
> > > @deprecated the beam size is now configured during training time in the
> > > trainer parameter file via beamSearch.beamSize
> > >
> > > and
> > >
> > > @deprecated Use {@link #NameFinderME(TokenNameFinderModel)} instead and
> > use
> > > the {@link TokenNameFinderFactory} to configure it.
> > >
> > > I wanted to point out a few potential problems:
> > >
> > > 1.  The corresponding train methods have not been removed.  So, it is
> > > possible to train a NameFinderME using a *custom*
> > AdaptiveFeatureGenerator
> > > class to do feature engineering, but once a model has been so trained,
> > > there is no way to load and use the stored model with the same
> > > AdaptiveFeatureGenerator.
> > >
> > > 2.  There is still no documentation on the TokenNameFinderFactory which
> > is
> > > supposed to replace the constructor with the AdaptiveFeatureGenerator.
> > >
> > > 3.  I went over the code of TokenNameFinderFactory and a few places
> where
> > > it is used and it seemed to be designed for working with an XML
> > > specification of feature combinations.  I have also in the references
> > > included a mailing list conversation that says this class should be
> used
> > > with an XML file.  However, it turns out that custom feature sets for
> > > sequential classification are often important, so might we be dropping
> > > valuable feature engineering support?
> > >
> > > Finally, in light of the above, could we keep the deprecated
> constructors
> > > around until the alternative constructor (using TokenNameFinderFactory)
> > > enters into production, and examples and documentation for it become
> > widely
> > > available?
> > >
> > > References:
> > >
> > > On the TokenNameFinderFactory using XML:
> > >
> >
> https://mail-archives.apache.org/mod_mbox/opennlp-dev/201410.mbox/%3CCAKvDkVDfAx5BMvwVOrbvpZm7xV9erRQzrzbCDpfd+Cq6m=xqQw@mail.gmail.com%3E
> > >
> > > Relevant JIRA issues:
> > > https://issues.apache.org/jira/browse/OPENNLP-718
> > > https://issues.apache.org/jira/browse/OPENNLP-717
> > >
> > > Thank you,
> > >
> > > Cohan Sujay Carlos
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message