Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CEA5D6FA2 for ; Fri, 20 May 2011 14:33:47 +0000 (UTC) Received: (qmail 95529 invoked by uid 500); 20 May 2011 14:33:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 95485 invoked by uid 500); 20 May 2011 14:33:45 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 95477 invoked by uid 99); 20 May 2011 14:33:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 May 2011 14:33:45 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of heimann.richard@gmail.com designates 74.125.82.48 as permitted sender) Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 May 2011 14:33:38 +0000 Received: by wwi18 with SMTP id 18so2779816wwi.5 for ; Fri, 20 May 2011 07:33:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=Cbxgzbchb50hNWkpOKZBLUdP7AoCM5WI8AgMm4AwDXw=; b=dt0eVrEodaikSY+M2XHUoKxjXdAHGN8bzfPjZkQ7/lzck2p3uHBoAK3imJpw1b/h+w 08CsXL/zmPdZxvj/SVVEW2ctFu7t83G3vDwior0DcaK0NNTkuQE2SGEWTrKvR/+D7Rte tYNNxorZ+ZcP3JnAe/jY26GybaKx46MT/uL7w= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=RNm3Rc28QGTWsE4Y914NUJkO/peZhmLvulCg1W296t6d7atu3DUI92wGZhcwlYc4U5 3Px/tamjhvvSUg7I+i/hkvdNDZNbFUNw41OQYHbIP9eJA/uIs1RI1CaI5NvvSO0M0Ao6 MrUP7VyhPqHW45SDhYxzK8Im4ihNgC2JS9HzQ= Received: by 10.227.62.210 with SMTP id y18mr4387170wbh.18.1305901998138; Fri, 20 May 2011 07:33:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.227.147.199 with HTTP; Fri, 20 May 2011 07:32:58 -0700 (PDT) In-Reply-To: References: From: Rich Heimann Date: Fri, 20 May 2011 10:32:58 -0400 Message-ID: Subject: Re: Please help me with a basic question... To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=20cf300255ccb5eaf904a3b600d1 X-Virus-Checked: Checked by ClamAV on apache.org --20cf300255ccb5eaf904a3b600d1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Bingo. That appears to be the essence of the problem, which makes sense given TF/IDF. I stumbled upon the 'Explain' function yesterday though it returns a crowde= d message using debug in SOLR admin. Is there another method or interface which returns more or cleaner info? I feel uncomfortable with omitting the normalization factor at this point. = I like the idea that it accounts for the size of the document. In other words= , I would expect to find the search term where there are many terms. Normalization (as you know) controls for document size. It does however (appears) lead to inflated values. SeetSpotSimilarity looks promising. Does it not favor shorter docs by not normalizing or does it make some attempt to standardized. Thank you Doron... Regards, Rich On Thu, May 19, 2011 at 4:38 PM, Doron Cohen wrote: > Hi Rich, > If I understand correctly you are concerned that short documents are > preferred too much over long ones, is this really the case? > It would help to understand what goes on to look at the Explanation of th= e > score for say two result documents - one that you think is ranked too low= , > and one that is ranked too high... > If you are convinced that length normalization is the culprit you could > give > a try to: > - omitting norms all together at indexing > - using e.g. SeetSpotSimilarity which do not favor shorter documents. > Regards, > Doron > > On Thu, May 19, 2011 at 5:20 PM, Rich Heimann >wrote: > > > Thanks Paul, > > > > I do not know what duplicates are in this case and it is the denominato= r > of > > the TF that bothers me more than the numerator of the TF (if that is in > > fact > > what you are suggesting). > > > > What have been the effects of ignoring the IDF? When is it appropriate. > It > > would seem that by doing so rare terms have less (no) weight. Thoughts? > > > > Thanks again, > > Rich > > > > > > On Wed, May 18, 2011 at 3:34 PM, Paul Libbrecht > wrote: > > > > > Richard, > > > > > > in SOLR at least there's an analyzer that avoids duplicates. > > > I think that would solve it. > > > There's also somewhere the option to ignore IDF (in similarity? in > > > solrconfig?). > > > > > > paul > > > > > > > > > Le 18 mai 2011 =E0 21:30, Rich Heimann a =E9crit : > > > > > > > Hello all, > > > > > > > > This is my first time on the list and my first question...forgive m= e > it > > > this > > > > has been hacked out in the past. > > > > > > > > We have set up Lucene/Solr and are getting somewhat spurious result= s. > > It > > > > appears to be a result of heterogeneous document sizes. In other > words, > > > the > > > > top results are sometimes (at least when the user is using typical > > search > > > > terms) monopolized by a distinct type of document, which is otherwi= se > > > small > > > > (in number of terms). It appears that TF/IDF even with the cosine > > > similarity > > > > seems to be sensitive to document size. I have run some tests and i= t > in > > > fact does > > > > appear to be the case. > > > > > > > > (Number of times the term appears in a document)/(Total Number of > terms > > > in > > > > that document) * Log10(Number of total documents/Number of times > search > > > term > > > > appears in all documents) > > > > > > > > Are there any suggestions or best practices to deal with the > intrinsic > > > > heterogeneity in a corpus. > > > > > > > > Thank you, > > > > Rich > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > --20cf300255ccb5eaf904a3b600d1--