Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA93E17C9F for ; Tue, 28 Oct 2014 09:09:10 +0000 (UTC) Received: (qmail 64668 invoked by uid 500); 28 Oct 2014 09:09:09 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 64602 invoked by uid 500); 28 Oct 2014 09:09:08 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 64589 invoked by uid 99); 28 Oct 2014 09:09:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Oct 2014 09:09:08 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ralf.bierig@gmail.com designates 209.85.212.173 as permitted sender) Received: from [209.85.212.173] (HELO mail-wi0-f173.google.com) (209.85.212.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Oct 2014 09:08:40 +0000 Received: by mail-wi0-f173.google.com with SMTP id ex7so8606252wid.6 for ; Tue, 28 Oct 2014 02:07:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=DuuZuSGIU9+4TR9Hw8kXFrVk2fp3B3A2v6OSN8Mc6lA=; b=kZUVdTkT0wS0vSXZRRQNYTQAMEMVsS8UJsoc16Z7gcxvBhSkMHbOndMgQnm7iy1Uer XcFCV6q0QUjdf4mbBXVb0dRds19SrHsudG2v/AbV5F5XHVA33zvvzFUwfmQfOW0z6EoQ +lzBRoDw1e3+MHpwFguumqHtxK1B7eooNK+IV4lYG/sRzJcOh3O8NKT74jauU9v5HFCk A3xossksMIq4uYYeS0gKTLPmvXHc+cxFX2ZcIk4JEl2f4les+LgBjIR8/kAY8WMCI8EX 3wGr13Zt1KK6o3cTLOqYq+WacaTdQ3beQ02Se9US2ZHiWfLW3g+oKyjp2qpmYEa6g5JX UeRg== X-Received: by 10.180.39.145 with SMTP id p17mr26576593wik.32.1414487230089; Tue, 28 Oct 2014 02:07:10 -0700 (PDT) Received: from [128.131.168.186] ([128.131.168.186]) by mx.google.com with ESMTPSA id u8sm14626168wiz.14.2014.10.28.02.07.08 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Oct 2014 02:07:09 -0700 (PDT) Message-ID: <544F5CBC.6070600@gmail.com> Date: Tue, 28 Oct 2014 10:07:08 +0100 From: Ralf Bierig User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Indexing Weighted Tags per Document References: <1414420856937-4166094.post@n3.nabble.com> <034c01cff1f6$276c6de0$764549a0$@efendi.ca> <1414424827939-4166116.post@n3.nabble.com> <544E6A1C.9050605@gmail.com> <544E6DDF.6050109@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org The second solution sounds great and a lot more natural than payloads. I know how to overwrite the Similarity class but this one would only be called at search time and then already use the existing term frequency. Looking up the probabilities every time a search is performed is probably also not performing well. So, I suspect I would somehow need to find a way to store the term frequency directly into the index at the time when I am indexing documents. Is that correct? Do you have a code sniplet that would highlight that part of your elegant solution? Thanks in advance, Ralf On 28.10.2014 09:31, Ramkumar R. Aiyengar wrote: > There are a few approaches possible here, we had a similar use case and > went for the second one below. I primarily deal with Solr, so I don't know > of Lucene-only examples, but hopefully you can dig this up.. > > (1) You can attach payloads to each occurrence of the tag, and modify the > scoring to use the payload.. > > (2) Use term frequency as a proxy. You could scale the probability by a > factor and introduce the term as many times as the scaled value > (essentially making it the term frequency). Scoring will know account for > this. Advantage is that you also can achieve score normalisation with > keywords and amongst tags, and you can also filter results by probability. > > (3) There potentially is also a solution using child documents and block > join, but I may be mistaken, haven't explored this a lot.. > On 27 Oct 2014 16:10, "Ralf Bierig" wrote: > >> I want to index documents together with a list of tags (usually between >> 10-30) that represent meta information about this document. Normally, i >> would create an extra field "tag" store every tag, by its name, inside that >> field and create my 10-30 fields that and adding it to the document before >> adding the document to the index and writing the index. >> >> However, I have the following extra requirements: >> >> a) I need to have a weight in the range of [0,1] being associated with the >> tag that represents the probability of this tag being true. >> >> b) These tags must be associated with the document and not with the terms >> of the document. >> >> c) I must be able to associate many tags to a document instance. >> >> d) I must be able to use the weight in the weighting process of the search >> engine. >> >> e) The weight must be for the document instance, as the weight represents >> the probability for that tag for that particular document. E.g. >> >> fieldname: tag >> fieldvalue: tree >> fieldweight: 0.8 >> >> meaning that this particular document is with a probability of 0.8 about >> trees. >> >> What is the best way to do that? >> Can somebody point me to an example or something quite similar that >> captures such a problem? >> >> Best, >> Ralf >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org