Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 64318 invoked from network); 16 Feb 2008 02:54:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Feb 2008 02:54:56 -0000 Received: (qmail 77272 invoked by uid 500); 16 Feb 2008 02:54:44 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 77239 invoked by uid 500); 16 Feb 2008 02:54:44 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 77228 invoked by uid 99); 16 Feb 2008 02:54:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Feb 2008 18:54:44 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cedric.ho@gmail.com designates 72.14.204.238 as permitted sender) Received: from [72.14.204.238] (HELO qb-out-0506.google.com) (72.14.204.238) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Feb 2008 02:53:59 +0000 Received: by qb-out-0506.google.com with SMTP id o21so313704qba.9 for ; Fri, 15 Feb 2008 18:54:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=yiaPbBe2LVg435bNV4i2x8OFXH0YnS9NIrWf54dAilg=; b=RAPDDUjiqXt06CezkbBoVU7bEK5vNRi08ufyDTLry+Yix/prMyqZFxJqbDzpXVuVSlHx1/GdpDBc4iHhK7mbneOA1pnsOrosC8pYKX+g5kkWWCpm83BH7KxnhVJoWzdsGK0lpjE/p8dEchAmO4W82c94DzaKNJW89qIkk/IjAwk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=eO4rLgC7ZjNA+fdJCDvotZqHyyQMwuwITCLv4dHBr67xMgN2osi2jXZz2I5T8wRXmB44bm/y4hZNnDAbUHsYn7t6bEf1fktxGlSJNWkmME/0TSykqjwW7O9dWls5VUYv1JwjXyRyfKFvEcejBRGBA3pZ/FvfUgawBrSbz2kqFhA= Received: by 10.114.177.1 with SMTP id z1mr3786653wae.7.1203130459180; Fri, 15 Feb 2008 18:54:19 -0800 (PST) Received: by 10.114.156.2 with HTTP; Fri, 15 Feb 2008 18:54:19 -0800 (PST) Message-ID: <839ba01c0802151854j2ba1e4e4u9a4d88a6f6399deb@mail.gmail.com> Date: Sat, 16 Feb 2008 10:54:19 +0800 From: "Cedric Ho" To: java-user@lucene.apache.org Subject: Re: How to pass additional information into Similarity.scorePayload(...) In-Reply-To: <200802152230.47324.paul.elschot@xs4all.nl> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <839ba01c0802130131x3afecdcdna3825cab69adc3a9@mail.gmail.com> <200802150832.47060.paul.elschot@xs4all.nl> <839ba01c0802150045q47a7c4f8h30cff700e584c66f@mail.gmail.com> <200802152230.47324.paul.elschot@xs4all.nl> X-Virus-Checked: Checked by ClamAV on apache.org Thanks ~ Yes it seems this would be quite difficult to achieve with Lucene. Nevermind, I'll try to figure out a workaround for it. Thanks for helping =) Cedric On Feb 16, 2008 5:30 AM, Paul Elschot wrote: > Hi Cedric, > > I think I'm beginning to get the point of the [10/5/2], > and why you called that requirement a bit strange, see below. > > To use both normal position info and paragraph position info > you'll need two separate, one normal, and one for the paragraphs. > > To use the normal field to determine the matches, and the > paragraph field to determine the weightings of these matches > the TermPositions of both fields will have to be advanced > completely in sync. That is possible, but not really nice to do. > If Lucene had multiple positions for an indexed term, it > would be straightforward. > But as long as that is not the case, you'll either have to advance > the two TermPositions in sync, or use payloads with the > paragraph numbers. > > Or you could relax the paragraph numbering requirement > into a positional requirement, and use the modified SpanFirstQuery. > That could be done by using an avarage paragraph length to > determine the weight at the matching position. > As this is easy to implement, I'd first implement this and try to sell > it to the users :) > > At that marketing moment you might as well ask the users > what they think of matches that cross paragraph borders. > Do you already have a firm requirement for that case? > > SpanNotQuery can be used to prevent matches over paragraph > borders when these are indexed as such, but I would not expect > that you would need those, given the fuzzyness of the [10/5/2]. > > Regards, > Paul Elschot > > > Op Friday 15 February 2008 09:45:58 schreef Cedric Ho: > > > Hi Paul, > > > > Do you mean the following? > > > > e.g. to index this: "first second third forth fifth six" > > > > originally it would be indexed as: > > (first,0) (second,1) (third,2) (forth,3) (fifth,4) (six,5) > > > > now it will be: > > (first,0) (second,0) (third,0) (forth,1) (fifth,1) (six,1) > > > > Then those Query classes that depends on the positional information > > (PhraseQuery, SpanQueries) won't work then? unfortunately I'll need > > those Query classes as well. > > > > Cedric > > > > > > > For each word in the input stream make sure that the position > > > at which it is indexed in an extra field is the same as the paragraph > > > number. That will involve only allowing a position increment at > > > a paragraph border during indexing. > > > Call this extra field the paragraph field if you will. > > > > > > Then, during search, search for a Term in paragraph field, and > > > use the position from that field, i.e. the paragraph number > > > to find a weight for the found term. > > > Have a look at PhraseQuery on how to use term positions during > > > search. It computes relative positions, but it works on the absolute > > > positions that it gets from the index. > > > > > > SpanFirstQuery also allows to do that, it's a bit more involved, but > > > in the end it works from the same absolute positions from the index. > > > The version at the jira issue will even allow to use the length of the > > > matching spans as the absolute paragraph number, which, in turn, > > > allows the use of a Similarity for the paragraph weights [10/5/2]. > > > > > > There is nothing special about indexed term positions; any term can > > > be indexed at any position in a field. Lucene will take advantage of > > > the incremental nature of positions by storing only compressed > > > differences of positions in the index, but during search the original > > > positions are directly available, You can do the same with payloads, > > > but why reimplement something that is already available? > > > > > > Payloads have better uses than positional info, for one they are > > > great to avoid disjunctions. For example for verbs, one could > > > index only the stem and use a payload for the actual inflected > > > form (singular/plural, past/present, first/second/third person, etc). > > > > > > Regards, > > > Paul Elschot > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org