opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim - FooBar();" <jimpil1...@gmail.com>
Subject Re: Problem with openNLP Name Finder API....
Date Wed, 08 Feb 2012 20:39:52 GMT
Ok so anyone who wants to reproduce my problem will need 2 files.

 1. A dummy training file (a single annotated pharmacology paper)
 2. The same paper but raw. In other words the file before being annotated.
 3. As an extra i'm including the model i just trained on this paper.

What i'm getting back from the trained model when i feed it with the 
tokenised sentences of the original raw text is:
/
("hydroxysaclofen") ("muscimol" "hydroxysaclofen") ("Hydroxysaclofen")/

This is really  not good!!! Only 3 sentences mentioned drugs?
No multi-word drugs even though they exist in training text.

_File description:_

  * PROPER-TRAIN-DATA.txt ---> as the name implies...one sentence per
    line, spaces in sgml tags, empty line at the end
  * my-2NER-short.bin  ----> the model i just trained using
    PROPER-TRAIN-DATA.txt
  * 48147.nex.txt  ----> the original raw paper.

It takes  less than 5 seconds to train from such a small file and even 
less to run the trained NER so try it out and let me know how get on...


Regards,
Dimitris


On 08/02/12 19:09, Jörn Kottmann wrote:
> I see the following issues:
> - Multiple sentences in a line
> - You data is not tokenized
> - Adaptive data is not cleared
>
> You can use our sentence detector to split
> your paragraphs. If you know your document
> boundaries you should write an empty line to
> that file to clear the adaptive data. If you cannot
> do that write an empty line after every sentence.
>
> Do you use our command line tools for training?
>
> Jörn
>
> On 02/08/2012 06:46 PM, Jim - FooBar(); wrote:
>>> Would it be possible for you to show us a sample of your training data?
>>> Maybe one paper.
>>
>> Absolutely here you go....a sample has been attached...Let me know if 
>> you want more but i can assure you that since the sgml tags are 
>> generated automatically (with regex replacement) they are all of the 
>> same format...
>>
>> Jim
>>
>> p.s: fire up your favourite editor press ctrl+f and search for 
>> "<START" just to see locate them easily!
>>
>>
>> On 08/02/12 17:09, Joern Kottmann wrote:
>>> On Wed, Feb 8, 2012 at 5:56 PM, Jim - 
>>> FooBar();<jimpil1985@gmail.com>wrote:
>>>
>>>> aaa ok i see what you mean...but then again if it recognised it as 
>>>> a mere
>>>> token it would not throw "IncompatibleFormat" exceptions but rather 
>>>> skip it
>>>> as a token that is not of interest wouldn't it? I don't have any 
>>>> patches to
>>>> send you, i just think that not including spaces in the sgml tag is 
>>>> a more
>>>> wise approach...Unless of course you're extracting the sgml tags via
>>>> regex...The truth is i've not looked at the source but i would 
>>>> expect you
>>>> to use some sort of xml-ish means to extract the sgml tags. If your 
>>>> parser
>>>> is using regex then i'm sure you have your reasons for including the
>>>> spaces. But anyway, this is a very small problem for me cos i can 
>>>> indeed
>>>> sort it manually...My big problem still remains!!!
>>>>
>>> The code splits the input string by line and then by white space. 
>>> Then the
>>> individual parts either
>>> match our start and end tags or not.
>>>
>>>
>>>
>>>> Anyway I'll stop bugging you...the fact that you tried to help 
>>>> means a lot
>>>> and certainly if i sort everything out i'll post what the problem 
>>>> was for
>>>> future users...
>>>>
>>>>
>>> We are also interested why it does not work for you, we usually use 
>>> this
>>> kind of experience to
>>> improve OpenNLP.
>>>
>>> Would it be possible for you to show us a sample of your training data?
>>> Maybe one paper.
>>>
>>> Jörn
>>>
>>
>


Mime
View raw message