Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 32992 invoked from network); 21 Oct 2009 11:34:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Oct 2009 11:34:24 -0000 Received: (qmail 46171 invoked by uid 500); 21 Oct 2009 11:34:22 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 46104 invoked by uid 500); 21 Oct 2009 11:34:22 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 46096 invoked by uid 99); 21 Oct 2009 11:34:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Oct 2009 11:34:22 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Oct 2009 11:34:19 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 70926234C1F1 for ; Wed, 21 Oct 2009 04:33:59 -0700 (PDT) Message-ID: <91623253.1256124839460.JavaMail.jira@brutus> Date: Wed, 21 Oct 2009 11:33:59 +0000 (UTC) From: "Robert Muir (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Created: (LUCENE-2001) wordnet parsing bug MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 wordnet parsing bug ------------------- Key: LUCENE-2001 URL: https://issues.apache.org/jira/browse/LUCENE-2001 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.9 Reporter: Robert Muir Priority: Minor A user reported that wordnet parses the prolog file incorrectly. Also need to check the wordnet parser in the memory contrib for this problem. If this is a false alarm, i'm not worried, because the test will be the first unit test wordnet package ever had. {noformat} For example, looking up the synsets for the word "king", we get: java SynLookup wnindex king baron magnate mogul power queen rex scrofula struma tycoon Here, "scrofula" and "struma" are extraneous. This happens because, the line parser code in Syns2Index.java interpretes the two consecutive single quotes in entry s(114144247,3,'king''s evil',n,1,1) in wn_s.pl file, as termination of the string and separates into "king". This entry concerns synset of words "scrofula" and "struma", and thus they get inserted in the synset of "king". *There 1382 such entries, in wn_s.pl* and more in other WordNet Prolog data-base files, where such use of two consecutive single quotes appears. We have resolved this by adding a statement in the line parsing portion of Syns2Index.java, as follows: // parse line line = line.substring(2); * line = line.replaceAll("\'\'", "`"); // added statement* int comma = line.indexOf(','); String num = line.substring(0, comma); ... ... etc. In short we replace "''" by "`" (a back-quote). Then on recreating the index, we get: java SynLookup zwnindex king baron magnate mogul power queen rex tycoon {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org