Return-Path: X-Original-To: apmail-incubator-opennlp-users-archive@minotaur.apache.org Delivered-To: apmail-incubator-opennlp-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1CD0E6132 for ; Mon, 20 Jun 2011 09:15:52 +0000 (UTC) Received: (qmail 98614 invoked by uid 500); 20 Jun 2011 09:15:52 -0000 Delivered-To: apmail-incubator-opennlp-users-archive@incubator.apache.org Received: (qmail 98592 invoked by uid 500); 20 Jun 2011 09:15:52 -0000 Mailing-List: contact opennlp-users-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: opennlp-users@incubator.apache.org Delivered-To: mailing list opennlp-users@incubator.apache.org Received: (qmail 98584 invoked by uid 99); 20 Jun 2011 09:15:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jun 2011 09:15:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tommaso.teofili@gmail.com designates 74.125.83.175 as permitted sender) Received: from [74.125.83.175] (HELO mail-pv0-f175.google.com) (74.125.83.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jun 2011 09:15:47 +0000 Received: by pvf24 with SMTP id 24so815881pvf.6 for ; Mon, 20 Jun 2011 02:15:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=GnU0RPsV6vQE/gXw+tQhedXoR88Sz014YeXHlomiQAY=; b=dJ1qJfobbg79gWaPC9gDbDxH9gN3BMK+eMIDlGDebtznvdxiczKkMqEgMShsz6WFt/ R0QWc2oes1D9SIICpPcCVJFEEcnrsZFbKxzUXiIWt6c+4qdq7l0lIpa7HtoiktKBUms9 1kBY3OwWH+eSKqa0FuMPdLhBA2/FujQ5URBzQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; b=KLSaVPha8ihKWpQAxH82NQukI5L+clX3tzQYr9T5A/zF8rWLQkibTz99mnLgoTlmeT JgxcqGu4wn5BPmZi965SSzPxrghUf8Pk6YJDtmzZ/gGZuPfwJDxCv7OjWuGYKeVS08yQ WQdNQponf2SpVuiPJbNhPGfuoFpq6Dbbtb0do= Received: by 10.142.122.20 with SMTP id u20mr716506wfc.388.1308561325167; Mon, 20 Jun 2011 02:15:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.143.156.12 with HTTP; Mon, 20 Jun 2011 02:14:45 -0700 (PDT) In-Reply-To: References: From: Tommaso Teofili Date: Mon, 20 Jun 2011 11:14:45 +0200 Message-ID: Subject: Re: UIMA TokenizerTrainer component : the model file is not created To: nicolas.hernandez@univ-nantes.fr Cc: opennlp-users@incubator.apache.org Content-Type: multipart/alternative; boundary=001636e0a671f4139d04a6212cc8 --001636e0a671f4139d04a6212cc8 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hello Nicolas, 2011/6/17 Nicolas Hernandez > Tommaso you said you successfully used the OpenNLP UIMA trainers. > > I am currently attempting to build French models for the various tasks > OpenNLP can deal with. But since I am also involved in UIMA stuff, I > wanted to test the OpenNLP UIMA components for doing that. > My goal is to donate the models to the OpenNLP community (i.e. in > http://opennlp.sourceforge.net/models-1.5/) > > Before testing the tokenizerTrainer, I tested the SentenceDetector. I > found at least two problems with the UIMA component > https://issues.apache.org/jira/browse/OPENNLP-197 > One of them is not yet referenced in the jira. But I am currious to > know whether you encountered it. > > I noted that models trained with the UIMA component give wrong > begin/end offset despite the fact they manage to split text in > sentences. I observed that the begin of a current sentence starts > including as a first token the punctuation character of the previous > one while the > previous one does not include it as its last one. > > Have you noticed the problem ? > I didn't noticed that but I will rerun my tests to check it out, I may have missed that. I'll let you know how it goes. Regards, Tommaso > > I think that, most of all, my problems are due to the lack of > documentation for the uima integration. I plan to blog post about my > experience. Since I see there is an open issue for that > https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find > the time to blog spot, I can try to write it in some way it can also > be used to contribute to the documentation too (if you are interested > in). > > > > On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez > wrote: > > Hello Tommaso, > > > > after some more tests... I think I have found how to reproduce my > problem. > > > > Tommaso, you re right it works fine with the pipeline you described > > (i.e. with the WhitespaceTokenizer followed by the token trainer > > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as > > 'normal' texts... > > I tested the pipeline with texts already formatted in a 'wst' way (a > > sentence per line and tokens separated by a whitespace character) and > > like that it does not work any longer (despite the presence of the > > sentence and token annotations). > > > > So my guess is that in command line the tokenTrainer needs to input a > > wst format (with '' tags) but the opennlp uima tokenTrainer > > needs (in some way a 'detokenized' text). > > > > If needed, I can open a 'question' issue and attach the texts I used > > to produce the problem. > > > > /Nicolas > > > > ---------- Forwarded message ---------- > > From: Tommaso Teofili > > Date: Wed, Jun 15, 2011 at 5:30 PM > > Subject: Re: UIMA TokenizerTrainer component : the model file is not > created > > To: opennlp-users@incubator.apache.org, nicolas.hernandez@univ-nantes.f= r > > > > > > Hello Nicolas, > > I successfully used the OpenNLP UIMA TokenizerTrainer and also the > > other trainers, for a simple proof I created an aggregate analysis > > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP > > TokenizerTrainer in a fixed flow, then used a > > FileSystemCollectionReader to to feed the pipeline. > > In the TokenizerTrainer I set: > > > > opennlp.uima.TokenType > > > > org.apache.uima.TokenAnnotation > > > > > > > > opennlp.uima.language > > > > en-US > > > > > > > > opennlp.uima.ModelName > > > > target/Tokens.bin > > > > > > > > which then created the Tokens.bin model that I was able to test from > > command line and via APIs. > > Are you using it in a different way? > > Regards, > > Tommaso > > > > 2011/6/15 Nicolas Hernandez > >> > >> Hello > >> > >> Does someone have already used the UIMA TokenizerTrainer component ? I > >> am a bit confused since it does not create any model file. > >> > >> In my stdout I got this : > >> Indexing events using cutoff of 5 > >> Computing event counts... > >> > >> done. 69669 events > >> Indexing... done. > >> Sorting and merging events... done. Reduced 69669 events to 16467. > >> Done indexing. > >> Incorporating indexed data for training... > >> done. > >> Number of Event Tokens: 16467 > >> Number of Outcomes: 1 > >> Number of Predicates: 5624 > >> ...done. > >> Computing model parameters... > >> Performing 100 iterations. > >> 1: .. loglikelihood=3D0.0 1.0 > >> 2: .. loglikelihood=3D0.0 1.0 > >> > >> This look like a problem I got when I trained the model in command > >> line without using the '' tag. In command line, It differs > >> since in command line I also got the following exception > >> Exception in thread "main" java.lang.IllegalArgumentException: The > >> maxent model is not compatible! > >> > >> I solved this problem by adding the tag as it is mentioned in the post > >> of maxent model is not compatible with Tokenizer training Fri, 1= 3 > May, > >> 09:33 > >> > http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.m= box/browser > >> > >> Does anyone know if it is the same problem ? In that case, how to > >> specify the '' tag in the UIMA version? As much as I understand > >> its role, it is important to let the user the possibility of setting > >> it. > >> > >> More globaly I am interested by any return on experience of people who > >> successfully managed to build models with the UIMA OpenNLP * Trainer > >> components. For now, I also got some trouble with the SentenceTrainer > >> and I do not have test the others. > >> > >> /Nicolas > >> > >> > >> -- > >> nicolas.hernandez@univ-nantes.fr > >> # > >> http://enicolashernandez.blogspot.com > >> http://www.univ-nantes.fr/hernandez-n > >> # > >> Laboratoire LINA-TALN CNRS UMR 6241 > >> tel. +33 (0)2 51 12 58 55 > >> # > >> Universit=E9 de Nantes - Institut Universitaire de Technologie - > >> D=E9partement Informatique > >> tel. +33 (0)2 40 30 60 67 > > > > > > > > > > -- > > nicolas.hernandez@univ-nantes.fr > > # > > http://enicolashernandez.blogspot.com > > http://www.univ-nantes.fr/hernandez-n > > # > > Laboratoire LINA-TALN CNRS UMR 6241 > > tel. +33 (0)2 51 12 58 55 > > # > > Universit=E9 de Nantes - Institut Universitaire de Technologie - > > D=E9partement Informatique > > tel. +33 (0)2 40 30 60 67 > > > > > > -- > nicolas.hernandez@univ-nantes.fr > # > http://enicolashernandez.blogspot.com > http://www.univ-nantes.fr/hernandez-n > # > Laboratoire LINA-TALN CNRS UMR 6241 > tel. +33 (0)2 51 12 58 55 > # > Universit=E9 de Nantes - Institut Universitaire de Technologie - > D=E9partement Informatique > tel. +33 (0)2 40 30 60 67 > --001636e0a671f4139d04a6212cc8--