opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rodrigo Agerri <rodrigo.age...@ehu.es>
Subject Re: [opennlp-dev] TokenNameFinderFactory new features and extension
Date Fri, 03 Oct 2014 22:53:50 GMT
Hi,

As a followed up, it turns out that currently we can provide a feature
generator via -featuregen parameter if you provide a subclass via the
-factory parameter only. I do not know if that is intended. Also, I
have noticed a very weird behaviour: I pass several descriptors via
CLI (starting with token features only, then adding tokenclass, etc.)
and it all goes well until I add either the Prefix or the
SuffixFeatureGenerator on which the performance drops alarmingly to
49.65 F1 when prefix and suffix are added to the default descriptor:

bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory
opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO
-params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin
-data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train

I get this behaviour with all conll03 and conll02 four datasets.

<generators>
  <cache>
    <generators>
      <window prevLength = "2" nextLength = "2">
        <tokenclass/>
      </window>
      <window prevLength = "2" nextLength = "2">
        <token/>
      </window>
      <definition/>
      <prevmap/>
      <bigram/>
      <sentence begin="true" end="false"/>
      <prefix/>
      <suffix/>
    </generators>
  </cache>
</generators>

Cheers,

Rodrigo

On Fri, Oct 3, 2014 at 5:55 PM, Rodrigo Agerri <rodrigo.agerri@ehu.es> wrote:
> Hi Jörn,
>
> On Fri, Oct 3, 2014 at 12:40 PM, Jörn Kottmann <kottmann@gmail.com> wrote:
>>
>> There are two things you need to do:
>> 1. Implement the feature generators
>> - Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if you
>> need to pass parameters to it
>
> OK, lets say I have this descriptor:
>
> <generators>
>   <cache>
>     <generators>
>       <window prevLength="2" nextLength="2">
>         <token />
>       </window>
>       <custom class="es.ehu.si.ixa.pipe.nerc.features.Prefix34FeatureGenerator"
> />
>     </generators>
>   </cache>
> </generators>
>
> Now, if I understand correctly the implementation (and your comments):
>
> 1. I should just create a Prefix34FeatureGenerator class extending
> FeatureGeneratorAdapter.
> 2. If I wanted to pass parameters, e.g. descriptor attributes, then I
> should extend CustomFeatureGenerator.
> 3. If I load such descriptor as argument of -featuregen in the CLI,
> the CLI should complain if such class is not in the classpath, I
> guess. If it is in the classpath, then it should use the custom
> generator.
> 4. As it is now, no matter what value you pass to the -featuregen, it
> always train the default features. It does not complain even if the
> custom feature generator is not well-formed. Even if I only pass the
> token features, it still loads the default generator. With version
> 1.5.3 it works fine though. I am looking into it, but any hints
> welcome :)
> 5. When I do this programatically, e.g., load the featuregenerator
> descriptor to an extension of the TokenNameFinderFactory, it seems to
> load the custom generators,  the GeneratorFactory loads the descriptor
> I pass, e.g., if only tokens then it trains successfully only with
> tokens. However, if I pass a custom generator, it does not complain,
> it trains and the performance drops to 40 F1. For the record, I build
> the descriptor programatically like this
>
>  Element prefixFeature = new Element("custom");
>  prefixFeature.setAttribute("class", Prefix34FeatureGenerator.class.getName());
>  generators.addContent(prefixFeature);
>
> and then the GeneratorFactory does get it without errors.
>
>> 2.//Implement support for load and serialize the data they need
>> - This class should implement SerializableArtifact
>> - And if you want to load use it the Feature Generator should implement
>> ArtifactToSerializerMapper, that one tells
>> the loader which class to use to read the data file
>
> This is only for the clustering features resources and such, I guess.
>
>> The above is the procedure you should use if you want to have a real custom
>> feature generator which is not part of
>> the OpenNLP Tools jar.
>
> Yes, what I do is include opennlp as maven dependency in an uber jar,
> e.g., with all classes inside, including opennlp and my custom feature
> generators. The classpath should be ok in this case, but I still
> cannot make them work.
>
>>
>>> 6.*Some*  of the new features work. If an Element name in the
>>> descriptor does not match in the GeneratorFactory, then the
>>> TokenNameFinderFactory.createFeatureGenerators() gives a null and the
>>> TokenNameFinderFactory.createContextGenerator() automatically stops
>>> the feature creation and goes for the
>>> NameFinderME.createFeatureGenerator().
>>> Is this the desired behaviour? Perhaps we could add a log somewhere?
>>> To inform of the backoff to the default features if one descriptor
>>> element does not match?
>>
>>
>> That sounds really bad. If there is a problem in the mapping it should fail
>> hard and throw an
>> exception. The user should be forced to decide by himself what do to, either
>> fix his descriptor
>> or use defaults.
>
> I can open an issue and look into it.
>
>> The idea is that we always use the xml descriptor to define the feature
>> generation, that way we can have different
>> configurations without changing the OpenNLP code itself, and don't need
>> special user code to integrate a
>> customized name finder model. If a model makes use of external classes these
>> of course need to be on the classpath
>> since we can't ship them as a part of the model.
>
> OK, but I think what I did above is what you meant, is it not?
>
> Thanks,
>
> Rodrigo

Mime
View raw message