Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 331264062 for ; Thu, 19 May 2011 14:21:29 +0000 (UTC) Received: (qmail 42867 invoked by uid 500); 19 May 2011 14:21:27 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 42767 invoked by uid 500); 19 May 2011 14:21:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 42759 invoked by uid 99); 19 May 2011 14:21:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2011 14:21:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of heimann.richard@gmail.com designates 74.125.82.48 as permitted sender) Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2011 14:21:20 +0000 Received: by wwi18 with SMTP id 18so2005125wwi.5 for ; Thu, 19 May 2011 07:21:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=Mi7o96nzYzezEneh2jzeENpPyJ67Hz0UFSi7BDBFvEM=; b=oCg5VxZqGq2Xp8jm1Up3uuZ7GmJRKAhJ/qeycpAkyrKuUi+LiTQC3injO35CRDf8FR W9Me2H5O/WD8TKS4AToh22X0H9q4om5Kz3hr8h+1Vnj8HYULSVEXAvJ3SFHw6wiHdBa4 GI74gi3Vpoxfaeh9g8S+u/UVR1wb1SAN7FaWI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=p15YF5AWTW42syZAg3uxTI7c94KAj5id7V0Mr1xPikznAiMYkiJV1jGhO0Z/nfwqNU LPRcHRoYiwrekRhUrEy50YGqdVNXpg41IzeJQ+YX5i7EycUhEs/X4pWmpJizmtynnxTT OYz1YxerqDeg2iMriRl/6u8ywzs8A1raSJ3w0= Received: by 10.227.205.84 with SMTP id fp20mr3248985wbb.3.1305814860138; Thu, 19 May 2011 07:21:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.227.147.199 with HTTP; Thu, 19 May 2011 07:20:40 -0700 (PDT) In-Reply-To: References: From: Rich Heimann Date: Thu, 19 May 2011 10:20:40 -0400 Message-ID: Subject: Re: Please help me with a basic question... To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001517592688e18ac304a3a1b6e5 X-Virus-Checked: Checked by ClamAV on apache.org --001517592688e18ac304a3a1b6e5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks Paul, I do not know what duplicates are in this case and it is the denominator of the TF that bothers me more than the numerator of the TF (if that is in fac= t what you are suggesting). What have been the effects of ignoring the IDF? When is it appropriate. It would seem that by doing so rare terms have less (no) weight. Thoughts? Thanks again, Rich On Wed, May 18, 2011 at 3:34 PM, Paul Libbrecht wrote: > Richard, > > in SOLR at least there's an analyzer that avoids duplicates. > I think that would solve it. > There's also somewhere the option to ignore IDF (in similarity? in > solrconfig?). > > paul > > > Le 18 mai 2011 =E0 21:30, Rich Heimann a =E9crit : > > > Hello all, > > > > This is my first time on the list and my first question...forgive me it > this > > has been hacked out in the past. > > > > We have set up Lucene/Solr and are getting somewhat spurious results. I= t > > appears to be a result of heterogeneous document sizes. In other words, > the > > top results are sometimes (at least when the user is using typical sear= ch > > terms) monopolized by a distinct type of document, which is otherwise > small > > (in number of terms). It appears that TF/IDF even with the cosine > similarity > > seems to be sensitive to document size. I have run some tests and it in > fact does > > appear to be the case. > > > > (Number of times the term appears in a document)/(Total Number of terms > in > > that document) * Log10(Number of total documents/Number of times search > term > > appears in all documents) > > > > Are there any suggestions or best practices to deal with the intrinsic > > heterogeneity in a corpus. > > > > Thank you, > > Rich > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --001517592688e18ac304a3a1b6e5--