opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: Case Sensitive POSDictionary usage
Date Wed, 31 Aug 2011 09:12:59 GMT
Hello,

we now have the following jiras to improve things:
https://issues.apache.org/jira/browse/OPENNLP-286
https://issues.apache.org/jira/browse/OPENNLP-287
https://issues.apache.org/jira/browse/OPENNLP-288

Contributions are very welcome.

Jörn

On 8/31/11 1:25 AM, Jörn Kottmann wrote:
> On 8/30/11 10:55 PM, mark meiklejohn wrote:
>> Hi,
>>
>> What is the best way to go about instantiating the POSDictionary with 
>> a custom tag dictionary and with case sensitive flag set to false??
>>
>
> The preferred way is to create it from the xml dictionary file. There 
> you can set
> the case_sensitive attribute to false (in the xml), and it will create 
> a case insensitive POS Dictionary.
> We don't really have an API to create this dictionary, looks like 
> something we should add.
>
> For example:
> <dictionary case_sensitive="false">
> <entry tags="JJ VB">
> <token>brave</token>
> </entry>
> </dictionary>
>
> Such a dictionary should output JJ and VB for BRAVE as input.
> Looks like we could need a unit test for that.
>
>> I need this for a full parse tree output as I have no control over 
>> the input. I've managed to create a parse model, but the only thing 
>> holding me back is the defining the case insensitive tag dictionary.
>>
>> Now I know a new fix was recently done here and I have a copy of 
>> 1.5.2rc, but I just can't see how to go about it.
>>
>> The majority of the methods within POSDictionary are deprecated and 
>> its recommend to use POSDictionary.create(), but there is no way to 
>> set the case sensitivity flag, which is true by default.
>>
>> I can use the POSDictionary(String file, boolean caseSensitive) 
>> (deprecated constructor), but this leads to a NPE when calling 
>> getTags(String word) as when it attempts to find a word loaded into 
>> the dictionary, it is not found because it is in its proper case i.e. 
>> ('Italy', 'italy')
>>
>
> That is a bug. The constructor does not transform the words into lower 
> case token, when they are case insensitive,
> but that is assumed during look up. We will fix that before we release.
>
> If you create a dictionary this way, then serialize it to xml, and 
> re-create it, it should work.
>
> Hope this helps,
> Jörn


Mime
View raw message