Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 307F89721 for ; Fri, 20 Apr 2012 14:31:10 +0000 (UTC) Received: (qmail 18325 invoked by uid 500); 20 Apr 2012 14:31:07 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 18265 invoked by uid 500); 20 Apr 2012 14:31:07 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 18235 invoked by uid 99); 20 Apr 2012 14:31:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 14:31:07 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of kasunp@opensource.lk designates 209.85.210.53 as permitted sender) Received: from [209.85.210.53] (HELO mail-pz0-f53.google.com) (209.85.210.53) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 14:31:00 +0000 Received: by dajr28 with SMTP id r28so12029281daj.12 for ; Fri, 20 Apr 2012 07:30:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=iLfEsfynCloe+97bkNFAbWYtjDWhUt0+VcGJi7pJqZ4=; b=p1FKLm8y68uCdfT6oVNbsbHetLtVDPH+hWd821V2Iz1Pz53r+p1WBZ6WDrYAi9/+ty T4Ra8UInWcDGyNeDc3+Vnqj45h6N0KAMfaMauy6hM6fkDwN8aTl9/Fdg2mmVCwJ5SO50 tAvQ0F9rzlmUQvWBQ1DL7rL2rAuVTmQi7w1ngu4xKq6R3sXRX+P9R+tVO+GO7976mpVc 8y7dj4dBf5xB+RPzXifJEPVFGWajuy0+s2Alo16PU2ZYrZmOyPvma8SOa2mbLKZJu0PH EwvYjUC30xhjpd341dQK7Yuql5z23dw3H01jG1TPSKx/3E8Y1PG6rQWRvA3QuwFPhPaU KMZQ== Received: by 10.68.189.231 with SMTP id gl7mr13309222pbc.151.1334932238239; Fri, 20 Apr 2012 07:30:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.230.69 with HTTP; Fri, 20 Apr 2012 07:30:18 -0700 (PDT) In-Reply-To: References: From: Kasun Perera Date: Fri, 20 Apr 2012 20:00:18 +0530 Message-ID: Subject: Re: Weighted cosine similarity calculation using Lucene To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=e89a8ff1c8e4dc05c204be1d21ec X-Gm-Message-State: ALoCoQnS9qfOgs8eKLzfEnUWmpS4KGsAJnrCSGPcdTokapy+WC9G4RiXbeAdJX/v9t0XictlvBXY --e89a8ff1c8e4dc05c204be1d21ec Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Erick On Fri, Apr 20, 2012 at 5:14 PM, Erick Erickson wr= ote: > Maybe I'm missing something here, but why not just boost the > terms in the fields at query time? > Yes I can boost the fields in the query time. But I'm using the termFreqVector get term frequencies and then calculate the TFIDF values for documents then calculate the cosine similarity using TFIDF. The field.setboost() function will give NO effect on term Frequencies. Is there anyother way to do the boosting that will give effect on term-frequencies? Thanks > > Best > Erick > > On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera > wrote: > > I have documents that are marked up with Taxonomy and Ontology terms > > separately. > > When I calculate the document similarity, I want to give higher weights > to > > those Taxonomy terms and Ontology terms. > > > > > > When I index the document, I have defined the Document content, Taxonom= y > > and Ontology terms as Fields for each document like this in my program. > > > > > > *Field ontologyTerm=3D new Field("fiboterms", fiboTermList[curDocNo], > > Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* > > > > *Field taxonomyTerm =3D new Field("taxoterms", taxoTermList[curDocNo], > > Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* > > > > *Field document =3D new Field(docNames[curDocNo], strRdElt, > > Field.TermVector.YES);* > > > > > > > > I=92m using Lucene index .TermFreqVector functions to calculate TFIDF > values > > and, then calculate cosine similarity between two documents using TFIDF > > values. > > > > > > For give weights to Ontology and Taxonomy terms when calculating the > cosine > > similarity, what I can do is, programmatically multiply the Taxonomy > > and Ontology > > term frequencies with defined weight factor before calculating the TFID= F > > scores. Will this give higher weight to Taxonomy and Ontology terms in > > document similarity calculation? > > > > > > Are there Lucene functions that can be used to give higher weights to t= he > > certain fields when calculating TFIDF values using TermFreqVector? can = I > > just use the setboost() function for this purpose, then how? > > > > -- > > Regards > > > > Kasun Perera > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --=20 Regards Kasun Perera --e89a8ff1c8e4dc05c204be1d21ec--