Mailing-List: contact opennlp-users-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: opennlp-users@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of tommaso.teofili@gmail.com
 designates 74.125.83.175 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type;
        b=KLSaVPha8ihKWpQAxH82NQukI5L+clX3tzQYr9T5A/zF8rWLQkibTz99mnLgoTlmeT
         JgxcqGu4wn5BPmZi965SSzPxrghUf8Pk6YJDtmzZ/gGZuPfwJDxCv7OjWuGYKeVS08yQ
         WQdNQponf2SpVuiPJbNhPGfuoFpq6Dbbtb0do=
MIME-Version: 1.0
In-Reply-To: <BANLkTimX79BWcYedYYX28iV3nxFhuQGH0w@mail.gmail.com>
References: <BANLkTinzwzTuE9AFAy=S9hoN2LhYUntXZg@mail.gmail.com>
 <BANLkTinR-VYtsR27bWmeFpJ-AW4O1-Mk1Q@mail.gmail.com>
 <BANLkTi=XPP7zKAQ4Jhk9m635Ka15PmNZbA@mail.gmail.com>
 <BANLkTimX79BWcYedYYX28iV3nxFhuQGH0w@mail.gmail.com>
From: Tommaso Teofili <tommaso.teofili@gmail.com>
Date: Mon, 20 Jun 2011 11:14:45 +0200
Message-ID: <BANLkTik4xWmeetzenSynZ-_Zq6BmpoiLKw@mail.gmail.com>
Subject: Re: UIMA TokenizerTrainer component : the model file is not created
To: nicolas.hernandez@univ-nantes.fr
Cc: opennlp-users@incubator.apache.org
Content-Type: multipart/alternative; boundary=001636e0a671f4139d04a6212cc8

--001636e0a671f4139d04a6212cc8
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello Nicolas,

2011/6/17 Nicolas Hernandez <nicolas.hernandez@gmail.com>

> Tommaso you said you successfully used the OpenNLP UIMA trainers.
>
> I am currently attempting to build French models for the various tasks
> OpenNLP can deal with. But since I am also involved in UIMA stuff, I
> wanted to test the OpenNLP UIMA components for doing that.
> My goal is to donate the models to the OpenNLP community (i.e. in
> http://opennlp.sourceforge.net/models-1.5/)
>
> Before testing the tokenizerTrainer, I tested the SentenceDetector. I
> found at least two problems with the UIMA component
> https://issues.apache.org/jira/browse/OPENNLP-197
> One of them is not yet referenced in the jira. But I am currious to
> know whether you encountered it.
>
> I noted that models trained with the UIMA component give wrong
> begin/end offset despite the fact they manage to split text in
> sentences. I observed that the begin of a current sentence starts
> including as a first token the punctuation character of the previous
> one while the
> previous one does not include it as its last one.
>
> Have you noticed the problem ?
>

I didn't noticed that but I will rerun my tests to check it out, I may have
missed that.
I'll let you know how it goes.
Regards,
Tommaso


>
> I think that, most of all, my problems are due to the lack of
> documentation for the uima integration. I plan to blog post about my
> experience. Since I see there is an open issue for that
> https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find
> the time to blog spot, I can try to write it in some way it can also
> be used to contribute to the documentation too (if you are interested
> in).
>
>
>
> On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez
> <nicolas.hernandez@gmail.com> wrote:
> > Hello Tommaso,
> >
> > after some more tests... I think I have found how to reproduce my
> problem.
> >
> > Tommaso, you re right it works fine with the pipeline you described
> > (i.e. with the WhitespaceTokenizer followed by the token trainer
> > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as
> > 'normal' texts...
> > I tested the pipeline with texts already formatted in a 'wst' way (a
> > sentence per line and tokens separated by a whitespace character) and
> > like that it does not work any longer (despite the presence of the
> > sentence and token annotations).
> >
> > So my guess is that in command line the tokenTrainer needs to input a
> > wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
> > needs (in some way a 'detokenized' text).
> >
> > If needed, I can open a 'question' issue and attach the texts I used
> > to produce the problem.
> >
> > /Nicolas
> >
> > ---------- Forwarded message ----------
> > From: Tommaso Teofili <tommaso.teofili@gmail.com>
> > Date: Wed, Jun 15, 2011 at 5:30 PM
> > Subject: Re: UIMA TokenizerTrainer component : the model file is not
> created
> > To: opennlp-users@incubator.apache.org, nicolas.hernandez@univ-nantes.f=
r
> >
> >
> > Hello Nicolas,
> > I successfully used the OpenNLP UIMA TokenizerTrainer and also the
> > other trainers, for a simple proof I created an aggregate analysis
> > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
> > TokenizerTrainer in a fixed flow, then used a
> > FileSystemCollectionReader to to feed the pipeline.
> > In the TokenizerTrainer I set:
> >         <nameValuePair>
> >   <name>opennlp.uima.TokenType</name>
> >   <value>
> >      <string>org.apache.uima.TokenAnnotation</string>
> >   </value>
> > </nameValuePair>
> >         <nameValuePair>
> >   <name>opennlp.uima.language</name>
> >   <value>
> >      <string>en-US</string>
> >   </value>
> > </nameValuePair>
> >         <nameValuePair>
> >   <name>opennlp.uima.ModelName</name>
> >   <value>
> >      <string>target/Tokens.bin</string>
> >   </value>
> > </nameValuePair>
> >
> > which then created the Tokens.bin model that I was able to test from
> > command line and via APIs.
> > Are you using it in a different way?
> > Regards,
> > Tommaso
> >
> > 2011/6/15 Nicolas Hernandez <nicolas.hernandez@gmail.com>
> >>
> >> Hello
> >>
> >> Does someone have already used the UIMA TokenizerTrainer component ? I
> >> am a bit confused since it does not create any model file.
> >>
> >> In my stdout I got this :
> >> Indexing events using cutoff of 5
> >>        Computing event counts...
> >>
> >> done. 69669 events
> >>        Indexing...  done.
> >> Sorting and merging events... done. Reduced 69669 events to 16467.
> >> Done indexing.
> >> Incorporating indexed data for training...
> >> done.
> >>        Number of Event Tokens: 16467
> >>            Number of Outcomes: 1
> >>          Number of Predicates: 5624
> >> ...done.
> >> Computing model parameters...
> >> Performing 100 iterations.
> >>  1:  .. loglikelihood=3D0.0      1.0
> >>  2:  .. loglikelihood=3D0.0      1.0
> >>
> >> This look like a problem I got when I trained the model in command
> >> line without using the '<SPLIT>' tag. In command line, It differs
> >> since in command line I also got the following exception
> >> Exception in thread "main" java.lang.IllegalArgumentException: The
> >> maxent model is not compatible!
> >>
> >> I solved this problem by adding the tag as it is mentioned in the post
> >> of maxent model is not compatible with Tokenizer training       Fri, 1=
3
> May,
> >> 09:33
> >>
> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.m=
box/browser
> >>
> >> Does anyone know if it is the same problem ? In that case, how to
> >> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
> >> its role, it is important to let the user the possibility of setting
> >> it.
> >>
> >> More globaly I am interested by any return on experience of people who
> >> successfully managed to build models with the UIMA OpenNLP * Trainer
> >> components. For now, I also got some trouble with the SentenceTrainer
> >> and I do not have test the others.
> >>
> >> /Nicolas
> >>
> >>
> >> --
> >> nicolas.hernandez@univ-nantes.fr
> >> #
> >> http://enicolashernandez.blogspot.com
> >> http://www.univ-nantes.fr/hernandez-n
> >> #
> >> Laboratoire LINA-TALN CNRS UMR 6241
> >> tel. +33 (0)2 51 12 58 55
> >> #
> >> Universit=E9 de Nantes - Institut Universitaire de Technologie -
> >> D=E9partement Informatique
> >> tel. +33 (0)2 40 30 60 67
> >
> >
> >
> >
> > --
> > nicolas.hernandez@univ-nantes.fr
> > #
> > http://enicolashernandez.blogspot.com
> > http://www.univ-nantes.fr/hernandez-n
> > #
> > Laboratoire LINA-TALN CNRS UMR 6241
> > tel. +33 (0)2 51 12 58 55
> > #
> > Universit=E9 de Nantes - Institut Universitaire de Technologie -
> > D=E9partement Informatique
> > tel. +33 (0)2 40 30 60 67
> >
>
>
>
> --
> nicolas.hernandez@univ-nantes.fr
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire LINA-TALN CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Universit=E9 de Nantes - Institut Universitaire de Technologie -
> D=E9partement Informatique
> tel. +33 (0)2 40 30 60 67
>

--001636e0a671f4139d04a6212cc8--