Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EC87C95D8 for ; Fri, 20 Apr 2012 08:21:12 +0000 (UTC) Received: (qmail 61358 invoked by uid 500); 20 Apr 2012 08:21:10 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61235 invoked by uid 500); 20 Apr 2012 08:21:10 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61217 invoked by uid 99); 20 Apr 2012 08:21:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 08:21:09 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,HTML_OBFUSCATE_05_10,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of kasunp@opensource.lk designates 209.85.210.46 as permitted sender) Received: from [209.85.210.46] (HELO mail-pz0-f46.google.com) (209.85.210.46) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 08:21:01 +0000 Received: by dadz9 with SMTP id z9so14239330dad.5 for ; Fri, 20 Apr 2012 01:20:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type :x-gm-message-state; bh=PP/MNNhyvsa85ynCYcwljGXmdirOeRXqKS2ruQGy/Vg=; b=dQ+Vf8C9MjzIo1OnObWd07ud2A3QSP7dd/CfUoT2+avO8JA8kgjXsjNBACGeq8i7Jz pazAteJy9I0llADEuRBLSGXsgv0ntqkJr7bfkWMwaf5akeDQ1KDkt4oQd8jY9+iDe4Mh c+cN1XhsABgPbMLQ6cBn4y3EGf5jxMnbliu0OtAULBBF2RJXIkrBLJGidQGtTYtvzoIX hqvecBKK3X7Fz3NsguLSi56DXPYE325MMQckqMWZZQPIkgJCrWWgbbDTL0pMNag18ONO JdWLVmD3J0a0bfFBTet9ZYqyYWtAcykW3qnE1pdV97n2i1gLsEUF39CYyG48tOXQF9ec L4Cg== Received: by 10.68.136.162 with SMTP id qb2mr11536273pbb.67.1334910040331; Fri, 20 Apr 2012 01:20:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.230.69 with HTTP; Fri, 20 Apr 2012 01:20:19 -0700 (PDT) From: Kasun Perera Date: Fri, 20 Apr 2012 13:50:19 +0530 Message-ID: Subject: Weighted cosine similarity calculation using Lucene To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=047d7b10cfefc2d57f04be17f635 X-Gm-Message-State: ALoCoQlfJWVoamyfgbKWratGqGwktN2djEQrYgOb8yL0FB3WEWcq+VJQhpgiVGr5S5pU0KqcoKjd --047d7b10cfefc2d57f04be17f635 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I have documents that are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity, I want to give higher weights to those Taxonomy terms and Ontology terms. When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program. *Field ontologyTerm=3D new Field("fiboterms", fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field taxonomyTerm =3D new Field("taxoterms", taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field document =3D new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);* I=92m using Lucene index .TermFreqVector functions to calculate TFIDF value= s and, then calculate cosine similarity between two documents using TFIDF values. For give weights to Ontology and Taxonomy terms when calculating the cosine similarity, what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation? Are there Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? can I just use the setboost() function for this purpose, then how? --=20 Regards Kasun Perera --047d7b10cfefc2d57f04be17f635--