opennlp-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: Problems training my own sentence splitter, with dictionary
Date Wed, 28 Sep 2011 09:46:16 GMT
On 9/28/11 11:34 AM, Riccardo Tasso wrote:
> This isn't a bug, but why can I load a POSDictionary from an xml 
> format which is undocumented?

We previously had a plain/text format, which was replaced by this xml 
format. Because of
encoding issues. I think we will do a couple of refactoring and redesign 
of the POS Tagger and
then again improve the POS Dictionary and other dictionaries we 
currently have.

There are a couple of things which can be done better, e.g. when the 
dictionary only allows one tag
we do not need to call the classifier to make a decision, the dictionary 
should also support token sequences,
etc.

You are welcome to submit a patch to document our pos dict xml format.

> I would prefear a String[] get(String word) and a void put(String 
> word, String[] tags) methods. 

For safety and thready safety reasons all our resources used during 
tagging should be immutable,
well, that doesn't mean that we should not have an easy way to create 
these resources.

We have the get method, but it is called getTags.

Jörn

Mime
View raw message