uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Hernandez <nicolas.hernan...@gmail.com>
Subject Re: [jira] [Commented] (UIMA-2106) Handling tokens not present in the language model (and also with no suffix present in the model) causes a null pointer exception in the tagger process
Date Fri, 01 Apr 2011 09:13:50 GMT
Thanks

I do that.

On Thu, Mar 31, 2011 at 8:28 PM, Richard Eckart de Castilho (JIRA)
<dev@uima.apache.org> wrote:
>
>    [ https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014124#comment-13014124
]
>
> Richard Eckart de Castilho commented on UIMA-2106:
> --------------------------------------------------
>
> I believe only users with the role "developer" can assign issues. But you can already
attach a patch.
>
>> Handling tokens not present in the language model (and also with no suffix present
in the model) causes a null pointer exception in the tagger process
>> ------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>                 Key: UIMA-2106
>>                 URL: https://issues.apache.org/jira/browse/UIMA-2106
>>             Project: UIMA
>>          Issue Type: Bug
>>          Components: Sandbox-Tagger
>>    Affects Versions: 2.3
>>         Environment: OS
>> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
>> JVM
>> java version "1.6.0_17"
>> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
>> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>>            Reporter: Nicolas Hernandez
>>            Priority: Minor
>>             Fix For: 2.3
>>
>>   Original Estimate: 5m
>>  Remaining Estimate: 5m
>>
>> The HMMTagger Analysis Engine class uses the org.apache.uima.examples.tagger.Viterbi.java
implementation to determine the pos tag list of a given sentence.
>> In practice this implementation is partially dependant on the part of speech tagging
(likewise the remaining HMMTagger classes actually).
>> For exemple it makes strong assumptions on the kind of tokens it can take as input.
It assumes no restriction about the token covertext values.
>> It results in using some covertext probabilities for initialization or default value
when the tagger processes an unknown token...
>> As a consequence if the coveredText used for setting the default value is not present
in the training model an error occurs. Roughly speaking, the process looks first for probability
associated to the current token coverText, if the coverText is not present in the model, it
looks in the model for the probability of its longest suffix, and finally if it does not found
a match, the process assigns to the unknown coverText the probability of the arbitrary coverText
: "("
>> The problem is that if the probability of this coverText is not available in the
model, the probability of the unknown token is not defined and a null pointer exception occurs
latter when the variable is called.
>> Why the probability of the "(" text would not be available in the model ? In a large
training corpus if we consider all the tokens, there is little chance not to find at least
one occurrence of "(".
>> Nevertheless if we use the HMM training  AE to build a model for predicting noun
gender and number, or verb tense and person, or "being a part of" named entity... these tokens
won t have the "(" coverText... and consequently an error will occurs when the tagging will
be performed.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>



-- 
nicolas.hernandez@univ-nantes.fr
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire LINA-TALN CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Mime
View raw message