opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damiano Porta <damianopo...@gmail.com>
Subject Re: Lemmatizer BUG
Date Mon, 05 Dec 2016 14:55:39 GMT
Perfect! Thank you!


2016-12-05 15:46 GMT+01:00 Rodrigo Agerri <rodrigo.agerri@ehu.eus>:

> Hello,
>
> The javadoc says that the implementation of the statistical lemmatizer is
> based on:
>
> http://grzegorz.chrupala.me/papers/phd-single.pdf
>
> Check Chapter 6.
>
> This paper summarizes greatly that chapter
>
> http://grzegorz.chrupala.me/papers/chrupala-etal-2008a/paper.pdf
>
> To cut a long story short, the statistical lemmatizer does not learn the
> lemmas themselves, but the automatically induced classes obtained from
> calculating how many permutations are required to go from the word form to
> the lemma. This is because it is much easier to generalize (e.g., many
> word-lemma pairs are captured by the same permutation class) to learn over
> those permutation classes than on the lemmas themselves.
>
> HTH,
>
> Rodrigo
>
>
> On Mon, Dec 5, 2016 at 3:40 PM, Damiano Porta <damianoporta@gmail.com>
> wrote:
>
> > Hello Rodrigo!
> > Thank you so much! It works perfectly... but, what is the reason behind
> the
> > use of the permuations? Why can we not have the lemma directly?
> >
> > Thanks for the clarification
> > Damiano
> >
> >
> > 2016-12-05 12:12 GMT+01:00 Rodrigo Agerri <ragerri@apache.org>:
> >
> > > Hello,
> > >
> > > The String[] lemmatize(String[] toks, String[] tags) method will give
> you
> > > predicted "lemma class" which consists of the number of permutations
> > > required to go from the word form to the lemma.
> > >
> > > If the output is O that means that no permutation is required, namely,
> > the
> > > lemma and the word form are considered to be the same string. The last
> > item
> > > in the array is for iniziata, and the class means "replace the letter t
> > in
> > > position 1 with r; replace letter a with letter e in position 0",
> > resulting
> > > in "iniziare". The word form and lemma strings are reversed for
> > comparison.
> > > I am assuming that you added the asterisks...
> > >
> > > Once you have that lemma class prediction array, you need to apply the
> > > String[] decodeLemmas(String[] toks, String[] preds) in the same
> > > LemmatizerME class, which as the javacode states, it requires the
> arrays
> > of
> > > tokens and predicted lemma classes, to perform the decoding (apply the
> > > permutations) and output the actual lemma (iniziare in your example).
> > >
> > > Cheers,
> > >
> > > Rodrigo
> > >
> > > On Mon, Dec 5, 2016 at 11:19 AM, Damiano Porta <damianoporta@gmail.com
> >
> > > wrote:
> > >
> > > > Hello,
> > > > I am doing some tests with the lemmatizerME.
> > > > It is returning a wrong word, a word that never occurs in the
> training
> > > > data. Basically it is NOT an italian word :)
> > > >
> > > > The output is:
> > > >
> > > > [O, O, O, O, *R1trR0ae*]
> > > >
> > > > The code:
> > > >
> > > >         try (InputStream in = new
> > > > FileInputStream("/home/damiano/lemmas.bin")) {
> > > >             LemmatizerModel lemmatizerModel = new
> LemmatizerModel(in);
> > > >
> > > >             LemmatizerME lem = new LemmatizerME(lemmatizerModel);
> > > >
> > > >             String[] tokens = new String[] {
> > > >                 "ultimo", "capitolo", "della", "saga", "iniziata"
> > > >             };
> > > >
> > > >             String[] pos = new String[] {
> > > >                 "As", "Ss", "EA", "Ss", "Vp"
> > > >             };
> > > >
> > > >             System.out.println(Arrays.toString(lem.lemmatize(tokens,
> > > > pos)));
> > > >         }
> > > >
> > > > How can i analyze what happened?
> > > >
> > > > Thanks
> > > > Damiano
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message