Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 16217 invoked from network); 3 Aug 2007 18:36:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Aug 2007 18:36:30 -0000 Received: (qmail 78352 invoked by uid 500); 3 Aug 2007 18:36:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 78313 invoked by uid 500); 3 Aug 2007 18:36:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 78296 invoked by uid 99); 3 Aug 2007 18:36:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2007 11:36:24 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of shailendra.sharma@gmail.com designates 64.233.162.236 as permitted sender) Received: from [64.233.162.236] (HELO nz-out-0506.google.com) (64.233.162.236) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2007 18:36:21 +0000 Received: by nz-out-0506.google.com with SMTP id i28so356454nzi for ; Fri, 03 Aug 2007 11:36:00 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=nvtJ+aFHvVTZ9Q+AT7gClRAP7PdQLAC8xvgzD4UdHt6Xvc86gTDZsRZHR143dRJCFBWiFKMfdaogs/TmY3DfgQ3X8vkdvXbAkTeDZd381LEb+YD4x7IHF6jNdaOczShLnSFYy0nIzDOTLkxCr6jgtOk5Tzuanod5TglsQdp0Zvo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Gw4j2CFNenVh4O9o5tG9cuSN0y+kaHR39W/0AGd4mhDiweUJRDFNylIoL3780jU8POciQfXxVpgcWW05f4iZnHZ9SarMu5L2j6gb5OgkHjFhozjWoTXXeCmlptnwjYgZxjxS7/3vWg5ZDccNG++nhjGgKDt1iVkRB2r/vkXtQn4= Received: by 10.143.16.9 with SMTP id t9mr146669wfi.1186166159796; Fri, 03 Aug 2007 11:35:59 -0700 (PDT) Received: by 10.143.18.7 with HTTP; Fri, 3 Aug 2007 11:35:58 -0700 (PDT) Message-ID: Date: Sat, 4 Aug 2007 00:05:58 +0530 From: "Shailendra Sharma" To: java-user@lucene.apache.org Subject: Re: Can I do boosting based on term postions? In-Reply-To: <200708031649.07438.paul.elschot@xs4all.nl> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_105989_7616100.1186166158743" References: <839ba01c0707312113p278da04bg636e2bfa853e6ff4@mail.gmail.com> <200708020922.55296.paul.elschot@xs4all.nl> <839ba01c0708022038r7bf26d50nf4756773b76a9bdd@mail.gmail.com> <200708031649.07438.paul.elschot@xs4all.nl> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_105989_7616100.1186166158743 Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Content-Disposition: inline Paul, If I understand Cedric right, he wants to have different boosting depending on search term positions in the document. By using SpanFirstQuery he will only be able to consider in terms till particular position; but he won't be able to do something like following: a) Give 100% boosting to matching in first 100 words. b) Give 80% boosting to matching in next 100 words. c) Give 60% boosting to matching in next 100 words. Though it can be done by writing DisjunctionMaxQuery having multiple SpanFirstQuery with different boosting - but I see it as a workaround only and not the direct and efficient solution. Cedric, I am sending you the implementation of SpanTermQuery to your gmail account (lucene mailing list is bouncing email with attachment). I have named the class as VSpanTermQuery (I have followed the same package hierarchy as lucene). You also need to extend VSimilarity class - which would require implementation of method scoreSpan(..). Let me know how it went. Though I did a testing for it, but before submitting to contrib, I need to do extensive testing. Thanks, Shailendra On 8/3/07, Paul Elschot wrote: > > Cedric, > > You can choose the end limit for SpanFirstQuery yourself. > > Regards, > Paul Elschot > > > On Friday 03 August 2007 05:38, Cedric Ho wrote: > > Hi Paul, > > > > Isn't SpanFirstQuery only match those with position less than a > > certain end position? > > > > I am rather looking for a query that would score a document higher for > > terms appear near the start but not totally discard those with terms > > appear near the end. > > > > Regards, > > Cedric > > > > On 8/2/07, Paul Elschot wrote: > > > Cedric, > > > > > > SpanFirstQuery could be a solution without payloads. > > > You may want to give it your own Similarity.sloppyFreq() . > > > > > > Regards, > > > Paul Elschot > > > > > > On Thursday 02 August 2007 04:07, Cedric Ho wrote: > > > > Thanks for the quick response =) > > > > > > > > On 8/1/07, Shailendra Sharma wrote: > > > > > Yes, it is easily doable through "Payload" facility. During > indexing > > > process > > > > > (mainly tokenization), you need to push this extra information in > each > > > > > token. And then you can use BoostingTermQuery for using Payload > value > to > > > > > include Payload in the score. You also need to implement > Similarity > for > > > this > > > > > (mainly scorePayload method). > > > > > > > > If I store, say a custom boost factor as Payload, does it means that > I > > > > will store one more byte per term per document in the index file? So > > > > the index file would be much larger? > > > > > > > > > > > > > > Other way can be to extend SpanTermQuery, this already calculates > the > > > > > position of match. You just need to do something to use this > position > > > value > > > > > in the score calculation. > > > > > > > > I see that SpanTermQuery takes a TermPositions from the indexReader > > > > and I can get the term position from there. However I am not sure > how > > > > to incorporate it into the score calculation. Would you mind give a > > > > little more detail on this? > > > > > > > > > > > > > > One possible advantage of SpanTermQuery approach is that you can > play > > > > > around, without re-creating indices everytime. > > > > > > > > > > Thanks, > > > > > Shailendra Sharma, > > > > > CTO, Ver se' Innovation Pvt. Ltd. > > > > > Bangalore, India > > > > > > > > > > On 8/1/07, Cedric Ho wrote: > > > > > > > > > > > > Hi all, > > > > > > > > > > > > I was wondering if it is possible to do boosting by search > terms' > > > > > > position in the document. > > > > > > > > > > > > for example: > > > > > > search terms appear in the first 100 words, or first 10% words, > or > in > > > > > > first two paragraphs would be given higher score. > > > > > > > > > > > > Is it achievable through using the new Payload function in > lucene > 2.2? > > > > > > Or are there any easier ways to achieve these ? > > > > > > > > > > > > > > > > > > Regards, > > > > > > Cedric > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > > > > For additional commands, e-mail: > java-user-help@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > Cedric > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > > > -- > > 愛@上.Keyboard > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_105989_7616100.1186166158743--