opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rodrigo Agerri <>
Subject Re: GSoC 2015 - WSD Module
Date Mon, 08 Jun 2015 14:54:50 GMT

On Mon, Jun 8, 2015 at 3:49 PM, Mondher Bouazizi
<> wrote:
> Dear Rodrigo,
> As Anthony mentioned in his previous email, I already started the
> implementation of the IMS approach. The pre-processing and the extraction
> of features have already been finished. Regarding the approach itself, it
> shows some potential according to the author though the features proposed
> are not so many, and are basic.

Hi, yes, the features are not that complex, but it is good to have a
working system and then if needed the feature set can be
improved/enriched. As stated in the paper, the IMS approach leverages
parallel data to obtain state of the art results in both lexical
sample and all words for senseval 3 and semeval 2007 datasets.

I think it will be nice to have a working system with this algorithm
as part of the WSD component in OpenNLP (following the API discussion
previous in this thread) and perform some evaluations to know where
the system is with respect to state of the art results in those
datasets. Once this is operative, I think it will be a good moment to
start discussing additional/better features.

> I think the approach itself might be
> enhanced if we add more context specific features from some other
> approaches... (To do that, I need to run many experiments using different
> combinations of features, however, that should not be a problem).

Speaking about the feature sets, in the API google doc I have not seen
anything about the implementation of the feature extractors, could you
perhaps provide some extra info (in that same document, for example)
about that?

> But the approach itself requires a linear SVM classifier, and as far as I
> know, OpenNLP has only a Maximum Entropy classifier. Is it OK to use libsvm
> ?

I think you can try with a MaxEnt to start with and in the meantime,
@Jörn has commented sometimes that there is a plugin component in
OpenNLP to use third-party ML libraries and that he tested it with
Mallet. Perhaps he could comment on this to use that functionality to
use SVMs.

> Regarding the training data, I started collecting some from different
> sources. Most of the existing rich corpora are licensed (Including the ones
> mentioned in the paper). The free ones I got for now are from the Senseval
> and Semeval websites. However, these are used just to evaluate the proposed
> methods in the workshops. Therefore, the words to disambiguate are few in
> number though the training data for each word are rich enough.
> In any case, the first tests with Senseval and Semeval collected should be
> finished soon. However, I am not sure if there is a rich enough Dataset we
> can use to make our model for the WSD module in the OpenNLP library.
> If you have any recommendation, I would be grateful if you can help me on
> this point.

Well, as I said in my previous email, research around "word senses" is
moving from WSD towards Supersense tagging where there are recent
papers and freely available tweet datasets, for example. In any case,
we can look more into it but in the meantime the Semcor for training
and senseval/semeval2007 datasets for evaluation should be enough to
compare your system with the literature.

> As Jörn mentioned sending an initial patch, should we separate our codes
> and upload two different patches to the two issues we created on the Jira
> (however, this means a lot of redundancy in the code), or shall we keep
> them in one project and upload it? If we opt for the latter case, which
> issue should we upload the patch to ?

In my opinion, it should be the same patch and same Component with
different algorithm implementations within it. Any other opinions?



View raw message