opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rodrigo Agerri <rodrigo.age...@ehu.es>
Subject Re: [opennlp-dev] TokenNameFinderFactory new features and extension
Date Tue, 07 Oct 2014 16:40:56 GMT
Hello,

One question regarding the WordClusterFeatureGenerator implementation
which I am using as template for the Brown features and so on. I
cannot seem to make it work, it complains all the time that the value
of the attribute "dict" I provide is not an instance of a
W2VClassesDictionary:

Exception in thread "main"
opennlp.tools.namefind.TokenNameFinderModel$FeatureGeneratorCreationError:
opennlp.tools.util.InvalidFormatException: Not a W2VClassesDictionary
resource for key: opennlp.tools.util.featuregen.W2VClassesDictionary

I have tried both from the CLI and programatically and I get the same result.
>From the CLI I write an element like this:
<w2vwordcluster dict="opennlp.tools.util.featuregen.W2VClassesDictionary" />

which i add to the default descriptor. I also pass the relevant
directory containing the word2vec clusters via the -resources
parameter.

Programatically I build the same element by:

Element word2vecClusterFeatures = new Element("w2vwordcluster");
InputStream inputStream = new FileInputStream(word2vecClusterPath);
ArtifactSerializer serializer = new
W2VClassesDictionary.W2VClassesDictionarySerializer();
word2vecClusterFeatures.setAttribute("dict",
serializer.create(inputStream).getClass().getName());

and it prints the same descriptor as above as well as the same error
in the GeneratorFactory class.

This is the command: bin/opennlp TokenNameFinderTrainer -featuregen
lang/en/namefinder/en-namefinder.xml -factory
opennlp.tools.namefind.TokenNameFinderFactory -resources ~/word2vec/
-params lang/ml/PerceptronTrainerParams.txt -lang en -model test.bin
-data ~/experiments/nerc/opennlp/data/en/conll2003/opennlp-eng.train

Thanks,

Rodrigo

On Fri, Oct 3, 2014 at 12:40 PM, Jörn Kottmann <kottmann@gmail.com> wrote:
> On 10/03/2014 11:58 AM, Rodrigo Agerri wrote:
>>
>> I have implemented a number of new features for the name finder. These
>> include Brown clusters features (duplicated per Brown path for each
>> feature activated involving a token) and Clark cluster features
>> (similar to the WordClusterFeatureGenerator currently available) among
>> other local extra features which interact well with the clustering
>> ones.
>>
>> I think it will be nice to include them before the new release. I will
>> open issues about each of them. What do you think?
>
>
> Yes please open issues for them. It would be really nice to receive them as
> a contribution.
>
> There are two things you need to do:
> 1. Implement the feature generators
> - Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if you
> need to pass parameters to it
>
> 2.//Implement support for load and serialize the data they need
> - This class should implement SerializableArtifact
> - And if you want to load use it the Feature Generator should implement
> ArtifactToSerializerMapper, that one tells
> the loader which class to use to read the data file
>
> The above is the procedure you should use if you want to have a real custom
> feature generator which is not part of
> the OpenNLP Tools jar.
>
> When you contribute it, things are slightly different. You should add a
> XmlFeatureGeneratorFactory inside the GeneratorFactory
> class. This factory creates the feature generator based on a defined xml
> element inside the descriptor.
>
>> 6.*Some*  of the new features work. If an Element name in the
>> descriptor does not match in the GeneratorFactory, then the
>> TokenNameFinderFactory.createFeatureGenerators() gives a null and the
>> TokenNameFinderFactory.createContextGenerator() automatically stops
>> the feature creation and goes for the
>> NameFinderME.createFeatureGenerator().
>> Is this the desired behaviour? Perhaps we could add a log somewhere?
>> To inform of the backoff to the default features if one descriptor
>> element does not match?
>
>
> That sounds really bad. If there is a problem in the mapping it should fail
> hard and throw an
> exception. The user should be forced to decide by himself what do to, either
> fix his descriptor
> or use defaults.
>
> The steps 4 and 5 you describe should not be necessary to add new feature
> generators.
>
> The idea is that we always use the xml descriptor to define the feature
> generation, that way we can have different
> configurations without changing the OpenNLP code itself, and don't need
> special user code to integrate a
> customized name finder model. If a model makes use of external classes these
> of course need to be on the classpath
> since we can't ship them as a part of the model.
>
> HTH,
> Jörn

Mime
View raw message