lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Parsing Error while indexing in Lucene WordNet package
Date Wed, 21 Oct 2009 15:52:25 GMT
Hi, thanks again for reporting this.

I created an issue here: http://issues.apache.org/jira/browse/LUCENE-2001

On Wed, Oct 21, 2009 at 2:05 AM, parag dave <phdaveofficial@gmail.com>wrote:

> While using the Lucene WordNet package, we found that the Syns2Index
> program
> indexes the Synsets wrongly. For example, looking up the synsets for the
> word "king", we get:
>
> java SynLookup wnindex king
> baron
> magnate
> mogul
> power
> queen
> rex
> scrofula
> struma
> tycoon
>
> Here, "scrofula" and "struma" are extraneous. This happens because, the
> line
> parser code in Syns2Index.java interpretes the two consecutive single
> quotes
> in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
> termination
> of the string and separates into "king". This entry concerns
> synset of words "scrofula" and "struma", and thus they get inserted in the
> synset of "king". *There 1382 such entries, in wn_s.pl* and more in other
> WordNet
> Prolog data-base files, where such use of two consecutive single quotes
> appears.
>
> We have resolved this by adding a statement in the line parsing portion of
> Syns2Index.java, as follows:
>
>            // parse line
>            line = line.substring(2);
>           * line = line.replaceAll("\'\'", "`"); // added statement*
>            int comma = line.indexOf(',');
>            String num = line.substring(0, comma);  ... ... etc.
> In short we replace "''" by "`" (a back-quote). Then on recreating the
> index, we get:
>
> java SynLookup zwnindex king
> baron
> magnate
> mogul
> power
> queen
> rex
> tycoon
>
> *Recently lucene-2.9.0 has been released, but wordnet package included in
> it
> still has the same problem given above.*
>
> -- Parag H. Dave
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message