Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 112DF4645 for ; Mon, 9 May 2011 15:11:26 +0000 (UTC) Received: (qmail 29321 invoked by uid 500); 9 May 2011 15:11:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 29272 invoked by uid 500); 9 May 2011 15:11:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 29264 invoked by uid 99); 9 May 2011 15:11:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 May 2011 15:11:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of saurabhgokhale@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 May 2011 15:11:17 +0000 Received: by qyk30 with SMTP id 30so4990052qyk.14 for ; Mon, 09 May 2011 08:10:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=klQnKmqijLCn2Th6Uh8OH9lSLkiPuNC/1J0E/G83SN4=; b=lFiUEjeGshexek8pWR/I4k5kwbRwH+tFpd3ZKVoNuxFEJiCNmQLv6KsoxSeQCHkM7d aYQokIpRCjAj+MPuYkK4Z213SBQEgd5uP3U9+zckf6UxI/aIq6o+XAcImeakgGAe7fT8 NH1xdIulSHrkw+ClHJ4vbT47WpMVFI2EpAfaA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=EXK6pamPFQr4oZSUMuwnPo4zgOQSvexCz/UydCukeUMVSWk4/7iYOpDsv/KSQY0k2i i1D/AZLl5+OOBRoHtLQkgje0p/H8AAXqBcQlfTzt4uxCr9Dp+f1el8k4u9I6o44wy30p fV2xGx43gDPSKrWGF3Fbn+Jdj7p5mXo5cess8= MIME-Version: 1.0 Received: by 10.229.46.67 with SMTP id i3mr4782162qcf.234.1304953856488; Mon, 09 May 2011 08:10:56 -0700 (PDT) Received: by 10.229.231.71 with HTTP; Mon, 9 May 2011 08:10:56 -0700 (PDT) Date: Mon, 9 May 2011 10:10:56 -0500 Message-ID: Subject: Question on the use of Synonym Filter while searching using MoreLikeThis From: Saurabh Gokhale To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001636426bd910774604a2d93f6b --001636426bd910774604a2d93f6b Content-Type: text/plain; charset=ISO-8859-1 Hi All, This is my first question for this forum. I am fairly familiar with Lucene and using 2.9.4 in my project (not using Solr). I have a following question for the use of Synonym filter. While indexing contents, I am using following analyzer setup [Analyzer1] == StandardTokenizer --> StandardFilter --> LowerCaseFilter --> StopFilter --> PorterStemFilter And while searching using MoreLikeThis I am using analyzer similar to the previous one but with addition of synonym filter [Analyzer2] == StandardTokenizer --> StandardFilter --> LowerCaseFilter --> StopFilter --> SynonymFilter --> PorterStemFilter *Scenario 1: Analyzer 1 for indexing and searching* Now I index document A, B and C using Analyzer1 and then use MoreLikeThis on document D to find similar documents from the index using Analyzer1 (Not Analyzer2), I get following output A matched 40% B matched 20% C matched 5% *Scenario 2: Analyzer 2 for indexing and searching* My problem is, the moment I use Analyzer2 (with Synonym Filter) to index and search similar documents to document D, all my results gets boost, my results become: A matched 60% B matched 40% C matched 25% *Scenario 3: Analyzer 1 for indexing and Analyzer 2 for searching* But if I use Analyzer1 for indexing and Analyzer2 for searching, then my results go way down A matched 15% B matched 11% C matched 2% When I dig into the reason why the % matching went down, I understood that this is happening because when searching using Synonym analyzer, I tend to get much more interesting terms [moreLikeThis.retrieveInterestingTerms(reader)] and then most of these synonym words match with all the documents bringing down its tf and idf resulting into less matching percentages for the documents. *So my question is:* 1. Is it correct to use Analyzer without synonym filter for indexing and with synonym filter for searching? 2. Is there any other setting that I am missing causing all the matching percentages to go down? My search setting while using MoreLikeThis are MoreLikeThis mlt = new MoreLikeThis(index); SynonymEngine engine = new WordNetSynonymEngine(new File("PATH")); mlt.setMinWordLen(3); mlt.setBoost(true); mlt.setMinTermFreq(2); mlt.setMinDocFreq(0); mlt.setMaxQueryTerms(100); mlt.setAnalyzer(new PorterSynonymStandardAnalyzer(engine)); Thanks in advance Saurabh --001636426bd910774604a2d93f6b--