Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 96663 invoked from network); 24 Nov 2004 08:54:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 24 Nov 2004 08:54:24 -0000 Received: (qmail 69284 invoked by uid 500); 24 Nov 2004 08:54:15 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 69100 invoked by uid 500); 24 Nov 2004 08:54:13 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 68927 invoked by uid 99); 24 Nov 2004 08:54:12 -0000 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=SPF_HELO_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from [169.229.70.167] (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 24 Nov 2004 00:54:09 -0800 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id 700765B7C7; Wed, 24 Nov 2004 00:53:52 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id 6EEC57F45E for ; Wed, 24 Nov 2004 00:53:52 -0800 (PST) Date: Wed, 24 Nov 2004 00:53:52 -0800 (PST) From: Chris Hostetter Sender: hossman@hal.rescomp.berkeley.edu To: Lucene Users List Subject: RE: fetching similar wordlist as given word In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N : > can I get the similar wordlist as output. so that I can show the end : > user in the column --------------- do you mean "foam"? : > How can I get similar word list in the given content? This is a non trivial problem, because the definition of "similar" is subject to interpretation. I would look into various dictionary implimentations, and see if you can find a good Java based dictionary that can suggest alternatives based on an input string. Once you have that, then you should be able to use IndexSearcher.docFreq to find out how many docs contains each alternate word, and compare that with the number of docs that contain the initial word ... if one of the alternates has a significantly higher number of matches, then you suggest it. NOTE: The DICT protocol defines a client/server approach to providing spell correction and definitions. Maybe you can leverage some of the spell correction code mentioned in the "Server Software Written in Java" section of this doc... http://www.dict.org/links.html In particular, you might want to take a look at JavaDict's Database.match function using the LevenshteinStrategy... http://ktulu.com.ar/javadict/docs/ar/com/ktulu/dict/Database.html#match(java.lang.String,%20ar.com.ktulu.dict.strategies.Strategy) --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org