uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kl├╝gl <peter.klu...@averbis.com>
Subject Re: Problem with WORDLISTs and WORDTABLEs where an entry starts with a shared substring of another entry
Date Mon, 28 Sep 2015 14:50:54 GMT
Hi,

this problem is most likely caused by the whitespaces in the worldist.
Let me know if the following description does not help. I will then take
a closer look.

There is a comment in the context of UIMA-4453:
"The problem is caused by the combination of filtering settings in the
rule script and the entries in the table. The table lookup is not able
to see whitespaces since these are filtered by default. However, the
table contains entries with spaces. This can cause problems since the
table uses a trie structure for representing the column data. There is
no lookahead when automatically skipping spaces in the entries.
Therefore, the matches for entries fail that have chars that also occur
after whitespaces in other entries."

Could you remove the whitespace in the entries in order to test if this
is the source of the problem?
Resulting in:

BillClinton
Billy

If this is the reason, then there are several option to resolve this
problem. There is, for example, the configuration parameter
"dictRemoveWS". Set it to true, and the engine will automatically remove
the white spaces when loading the wordlist.

This sounds probably quite strange, but there are actual reasons for
this behavior.

Best,

Peter

Am 28.09.2015 um 13:06 schrieb Ronny Hapke:
> I've stumbled upon a problem with UIMA Ruta Workbench 2.3.1 in Eclipse 
> Luna 4.4.2. Whenever working with a WORDLIST or WORDTABLE where one entry 
> starts with a common substring of another one, it will not be recognized 
> and therefore not annotated.
>
> Consider this minimal example:
>
> WORDLIST "Keywords.txt"in resources directory with the following entries:
> Bill Clinton
> Billy
>
> Input file in input directory with the following contents:
> Billy wished he was president, just like Bill Clinton once was.
>
> Main.ruta script in scripts directory:
> WORDLIST list = 'Keywords.txt';
> DECLARE president;
> Document {->MARKFAST(president, list)};
>
> Upon execution, only Bill Clinton will be annotated while Billy will be 
> ignored.
>
> Any help/hints/comments appreciated!
>
> Best regards, 
> Ronny
>


Mime
View raw message