lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From parag dave <phdaveoffic...@gmail.com>
Subject Parsing Error while indexing in Lucene WordNet package
Date Wed, 21 Oct 2009 06:05:20 GMT
While using the Lucene WordNet package, we found that the Syns2Index program
indexes the Synsets wrongly. For example, looking up the synsets for the
word "king", we get:

java SynLookup wnindex king
baron
magnate
mogul
power
queen
rex
scrofula
struma
tycoon

Here, "scrofula" and "struma" are extraneous. This happens because, the line
parser code in Syns2Index.java interpretes the two consecutive single quotes
in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
termination
of the string and separates into "king". This entry concerns
synset of words "scrofula" and "struma", and thus they get inserted in the
synset of "king". *There 1382 such entries, in wn_s.pl* and more in other
WordNet
Prolog data-base files, where such use of two consecutive single quotes
appears.

We have resolved this by adding a statement in the line parsing portion of
Syns2Index.java, as follows:

            // parse line
            line = line.substring(2);
           * line = line.replaceAll("\'\'", "`"); // added statement*
            int comma = line.indexOf(',');
            String num = line.substring(0, comma);  ... ... etc.
In short we replace "''" by "`" (a back-quote). Then on recreating the
index, we get:

java SynLookup zwnindex king
baron
magnate
mogul
power
queen
rex
tycoon

*Recently lucene-2.9.0 has been released, but wordnet package included in it
still has the same problem given above.*

-- Parag H. Dave

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message