lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENE-2001) wordnet parsing bug
Date Wed, 21 Oct 2009 11:33:59 GMT
wordnet parsing bug
-------------------

                 Key: LUCENE-2001
                 URL: https://issues.apache.org/jira/browse/LUCENE-2001
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/*
    Affects Versions: 2.9
            Reporter: Robert Muir
            Priority: Minor


A user reported that wordnet parses the prolog file incorrectly.

Also need to check the wordnet parser in the memory contrib for this problem.

If this is a false alarm, i'm not worried, because the test will be the first unit test wordnet
package ever had.

{noformat}
For example, looking up the synsets for the
word "king", we get:

java SynLookup wnindex king
baron
magnate
mogul
power
queen
rex
scrofula
struma
tycoon

Here, "scrofula" and "struma" are extraneous. This happens because, the line
parser code in Syns2Index.java interpretes the two consecutive single quotes
in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
termination
of the string and separates into "king". This entry concerns
synset of words "scrofula" and "struma", and thus they get inserted in the
synset of "king". *There 1382 such entries, in wn_s.pl* and more in other
WordNet
Prolog data-base files, where such use of two consecutive single quotes
appears.

We have resolved this by adding a statement in the line parsing portion of
Syns2Index.java, as follows:

           // parse line
           line = line.substring(2);
          * line = line.replaceAll("\'\'", "`"); // added statement*
           int comma = line.indexOf(',');
           String num = line.substring(0, comma);  ... ... etc.
In short we replace "''" by "`" (a back-quote). Then on recreating the
index, we get:

java SynLookup zwnindex king
baron
magnate
mogul
power
queen
rex
tycoon
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message