opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <>
Subject Re: Dictionary and Case-Sensitivity
Date Thu, 28 Jul 2011 08:19:26 GMT
James, make sure also to look at this other thread we had here about this
issue before.

There we decided to add a case flag to our dictionaries.

On 7/28/11 5:23 AM, James Kosin wrote:
> The case sensitivity flag was either a bad idea or somewhere it got 
> lost in the usage.
In the POS Tagger dictionary it should be supported as it was in 1.3 or 
1.4. I simply missed
this flag when I added the new training, evaluation and model package code.

> Questions & Discussion Points:
> -----------------------------------------
> (a)  When building the dictionary, usually if we have case-sensitivity 
> set to false entries really only need to be added once if they are 
> already not there.  'a' and 'A' in a case-insensitive dictionary are 
> really the same and only one will match.  If we impose this assumption 
> then we really need the false setting to mean that we will always 
> compare without regard to case even if we are comparing to an entry 
> that wants case sensitivity and is set to true.
The behavior of our dictionary is currently not really defined if it 
contains duplicate entries, I guess our current implementation just
adds every entry, and overwrites existing entries. So the last specified 
entry wins. We could change this and make the dictionary fail
fast. This way a user can fix any issues.

Maybe that is annoying because then he might need to manually fix a 
couple of issues.

> (b)  When using the dictionary, since the caseSensitivity flag is not 
> final, the dictionary default can be changed for new entries ONLY, the 
> change here doesn't affect already added items to the dictionary.  
> This is both a good and bad thing.  Good in that we could change the 
> default for the comparisons, bad in that if we allow the change the 
> dictionary could be modified to add new entries with the flag not set 
> to the creation setting.  It isn't a problem now; but, if we allow the 
> user to change the flag without forcing it at creation; we could end 
> up with issues.
The dictionary should be immutable, because it can be access from more 
than one thread, and we encourage our users to do so.
I know Dictionary is not, but it really should be. POS Dictionary can 
only be changed if extended, right?

> (c)  Coming to usage.  The change I talked about for the 
> isCaseSensitive test for the other entry doesn't really make sense 
> since the dictionary object itself will create a new string list with 
> a caseSensitive flag for the dictionary.  There really isn't any way 
> to change this without creating a new dictionary with the flag set to 
> true/false.

This we don't do anymore when we make the Dictionary immutable, right?

> (d)  The case-sensitivity setting needs to be saved with the 
> dictionary to the file.  This is one place where we really need to be 
> careful.  I've looked somewhat at the problem and unfortunately, there 
> isn't an easy fix.  Saving is okay, it is getting the setting from the 
> file... reason being is that due to the way some of it works, we could 
> append dictionaries causing a mixed case-sensitivity setting.  Really 
> bad news; since, the dictionary has one flag and each entry has 
> another copy of the flag for the StringListWrapper class.  Another way 
> would be adding the settings to the properties for the model and 
> saving the dictionary inside the model as well.

When we load a dictionary (or create one) we need to know the case flag, 
when we serialize it, the case flag should be written in the
dictionary, but it not important for the way the entries are written.

Lets refactor the Dictionary class a bit to get rid of this 
StringListWrapper badness.


View raw message