Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AFDBBE25A for ; Wed, 28 Nov 2012 00:12:08 +0000 (UTC) Received: (qmail 15569 invoked by uid 500); 28 Nov 2012 00:12:06 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 15461 invoked by uid 500); 28 Nov 2012 00:12:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 15451 invoked by uid 99); 28 Nov 2012 00:12:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 00:12:06 +0000 X-ASF-Spam-Status: No, hits=1.3 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.220.48] (HELO mail-pa0-f48.google.com) (209.85.220.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2012 00:12:01 +0000 Received: by mail-pa0-f48.google.com with SMTP id fa1so4204023pad.35 for ; Tue, 27 Nov 2012 16:11:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding:x-gm-message-state; bh=tMkFSzetvpfuganThLMk4qQYUCJ+WmuP5dN1Z2HXwWE=; b=isHG3980ZT26Il+79Yej10efF758KSIpEa8Tk33Z/b5InB7yd5UqugVDsJvVzhQLWG E9YDaDSXDWTiuDMCyVBjoCi5BI2wVzaZ0GTZ1/LdJ61hn7ve/nYnggjOw/p0LTcdyfWy 5WaWCo1e9ijM5mhu31twsMjCh19W/M1yV6S7lQPwiStm1qxsm+RdRKA3IAlGH12qwhRb 249vr9XP3jG0Cm2KxUcUenZ4Dgh08jvzqUmwvpq4auyqpVX1D+HPiYtVauo8Jn4Nyl+F aXaCGNOiINiZPbFxCoAuGJo+8bJnGG89T4B8COE7blYzRz4rsQLaJBN7watpgDUsrQ/6 m/Ow== Received: by 10.68.241.133 with SMTP id wi5mr52862121pbc.48.1354061499472; Tue, 27 Nov 2012 16:11:39 -0800 (PST) MIME-Version: 1.0 Received: by 10.68.49.133 with HTTP; Tue, 27 Nov 2012 16:11:19 -0800 (PST) In-Reply-To: <5E12F7FD7D58D54DA5542ECC8B14519801DF4D@MSGPEXCHA28B.mfad.mfroot.org> References: <50B4FCE5.60302@spotter.com> <5E12F7FD7D58D54DA5542ECC8B14519801DF4D@MSGPEXCHA28B.mfad.mfroot.org> From: Michael McCandless Date: Tue, 27 Nov 2012 19:11:19 -0500 Message-ID: Subject: Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs? To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQk0YoB97L4MbOnqH0hHa7AEhJFJcHevCExpWmkTW+OhNltjQz3B8/hxk9F/E61F+TsdK7uv X-Virus-Checked: Checked by ClamAV on apache.org Flexible indexing is the ability to make your own codec, which controls the reading and writing of all index parts (postings, stored fields, term vectors, deleted docs, etc.). So for example if you want to store some postings as a bit set instead of the block format that's the default coming up in 4.1, that's easy to do. But what is less easy (as I described below) is changing what is actually stored in the postings, eg adding a new per-position attribute. The original goal was to allow arbitrary attributes beyond the known docs/freqs/positions/offsets that Lucene supports today, so that you could easily make new application-dependent per-term, per-doc, per-position things, pull them from the analyzer, save them to the index, and access them from an IndexReader / query, but while some APIs do expose this, it's not very well explored yet (eg, you'd have to make a custom indexing chain to get the attributes "through" IndexWriter down to your codec). It would be great to make progress making this easier, so ideas are very welcome :) Mike McCandless http://blog.mikemccandless.com On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D. wrote: > Following up on a previous question... > What is "flexible indexing" in Lucene 4.0? We assumed it was the ability= to > easily make new postings formats/codecs -- but a response below says that > would be "tricky"? > > stephen > > > On 11/27/12 11:48 AM, "David Causse" wrote: > >> Hi, >> >> We use payloads but we can't use the whole lucene API. >> For example we use it to do some relation query for example : >> >> @quote(@speaker(obama) @discourse(health)) >> >> Search for all documents that contains a quote by Obama talking about >> health. >> We encode linguistic informations (standoff annotations) inside payloads >> and use custom search API to query the index. >> I didn't found a convenable way to attach my code to lucene >> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole >> Query stack. >> In short if you want to go with Payloads that do more than boosting a >> term there's chances that you'll need to rewrite a big part of the query >> stack. >> >> >> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a =E9crit : >>> I think we're looking at doing something related. I haven't explored t= he >>> Enums or know how to make a postings codec... But what is "flexible >>> indexing" in Lucene 4.0 if it's not the ability to make new postings co= decs? >>> >>> We're trying to incorporate attributes onto terms/spans in indexes. We= 'd >>> also like to try out some interesting ways to score things that go beyo= nd >>> just tokens. >>> >>> We were considering using Attributes instead of Payloads, because it se= ems >>> like using Payloads ties you to a particular kind of scoring -- just a >>> weight on a token. Can Payloads be used for more general scoring funct= ions? >>> E.g., considering a span of text alongside multiple Payloads? >>> >>> Does it make sense to move outside of Payloads here? >>> >>> Thanks! >>> >>> stephen >>> >>> >>> >>> >>> On 11/19/12 8:14 AM, "Michael McCandless" w= rote: >>> >>>> A new postings format would be tricky because you have new attributes >>>> you want to index. >>>> >>>> The DocsAndPositionsEnum does have an attributes source, but this is >>>> not well explored, and there are known problems (they can't be easily >>>> merged in the composite reader case). >>>> >>>> So that's why I suggested packing your information into a payload ... >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy wrote: >>>>> thx, mike. >>>>> about the 3th question, "encode them all into the payload" is better = than >>>>> "a new postings format with the codec" ?? >>>>> I mean replace the orginal posting item (position, startOffset, endOf= fset, >>>>> payload) with my own inverted item such as >>>>> class TestPostingItem >>>>> { >>>>> int termId; >>>>> long startOffset; >>>>> long endOffset; >>>>> float score; >>>>> int segId; >>>>> long timeStamp; >>>>> } >>>>> ? >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in= -DocsA >>>>> nd >>>>> PositionsEnum-for-tp4020933p4020968.html >>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org