Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 63297 invoked from network); 13 Oct 2009 20:58:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Oct 2009 20:58:49 -0000 Received: (qmail 83579 invoked by uid 500); 13 Oct 2009 20:58:48 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 83522 invoked by uid 500); 13 Oct 2009 20:58:47 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 83514 invoked by uid 99); 13 Oct 2009 20:58:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Oct 2009 20:58:47 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of buschmic@gmail.com designates 72.14.220.153 as permitted sender) Received: from [72.14.220.153] (HELO fg-out-1718.google.com) (72.14.220.153) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Oct 2009 20:58:36 +0000 Received: by fg-out-1718.google.com with SMTP id e12so976304fga.5 for ; Tue, 13 Oct 2009 13:58:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=T4KI1Ko7EGps5uwj4bfVxwd0H+jRRwa8pwvlWlsaftQ=; b=FOyFO1OFSD9APtJriZwyeO7LNOryyR66TKLMfX1+NQmk938p9FF/dwUDJ00VV25NOi 2WMl6gjSn6FAJM7jt7MiCrDwlHR8zmarZJdj9jUZ+TuGhZyUnrcGpWqfgvo4GwE0sse6 e7z8FqSpxjO7D9Dvn187/jjP16fX0J97vm/uc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=KMllLks8+8u7jZMYUsjWhbe3IK6Xm8axLM+iz9EoNwShPlfJ5wxnGFweptiCpb5YGh S4uciRpjeRLsSPOGcnAMgxuw9EFEGe1KrIIliyArFWPLm1Yy6n5qKA0sOZYypxMpaLQs 2cez8EmJx61jIi1AivWuo8g6nP1aYkiVwUVj0= Received: by 10.86.227.1 with SMTP id z1mr6888431fgg.56.1255467496265; Tue, 13 Oct 2009 13:58:16 -0700 (PDT) Received: from dyn9030038128.svl.ibm.com ([32.97.110.56]) by mx.google.com with ESMTPS id d8sm101578fga.3.2009.10.13.13.58.14 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 13 Oct 2009 13:58:15 -0700 (PDT) Message-ID: <4AD4E9E3.4030002@gmail.com> Date: Tue, 13 Oct 2009 13:58:11 -0700 From: Michael Busch User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.4pre) Gecko/20090915 Thunderbird/3.0b4 MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing References: <289446807.1255375951357.JavaMail.jira@brutus> <4AD3CC78.4000102@gmail.com> <4AD3EAF0.2010901@gmail.com> <9ac0c6aa0910130547x459df53eg3c94a1acf3039976@mail.gmail.com> <4AD4AE2C.8090406@gmail.com> In-Reply-To: <4AD4AE2C.8090406@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 10/13/09 9:43 AM, Michael Busch wrote: > Shall we first remove the remaining deprecations from the indexer > package? There are not many more left, shouldn't be much work. > I wasn't quick enough for you :) Working on LUCENE-1979 now - that will be the first test on how good svn merge is! Michael > Michael > > On 10/13/09 5:47 AM, Michael McCandless wrote: >> OK I will cut a branch& commit Mark's last patch onto it, unless >> anyone has objections soonish... >> >> I'll also branch (twig?) the back compat branch so we can commit the >> patch there as well. >> >> Mike >> >> On Mon, Oct 12, 2009 at 10:50 PM, Mark Miller >> wrote: >>> SVN is about as good at merging branches as any of us are with a patch >>> and trunk unfortunately. But that can still be somewhat more convenient >>> than all these huge patches, with different people at different stages. >>> >>> Depends on how many people end up working on this though. Any more than >>> 2, and I think the branch has got to be worth it. >>> >>> From my perspective, it doesn't make any of the merging process any >>> easier - but it can be easier than juggling all these patches - you >>> have >>> a central code base that can always be targeted for current merging. >>> >>> Michael Busch wrote: >>>> I think it's supposed to work pretty good - though I have no personal >>>> experience with merging branches with svn. >>>> >>>> I think we should try it - then we'll know! :) >>>> >>>> Michael >>>> >>>> On 10/12/09 12:32 PM, Michael McCandless (JIRA) wrote: >>>>> [ >>>>> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764799#action_12764799 >>>>> >>>>> ] >>>>> >>>>> Michael McCandless commented on LUCENE-1458: >>>>> -------------------------------------------- >>>>> >>>>> bq. Shall we create a flexible-indexing branch and commit this? >>>>> >>>>> I think this is a good idea. >>>>> >>>>> But I haven't played heavily w/ svn& branching. EG if we branch >>>>> now, and trunk moves fast (which it still is w/ deprecation >>>>> removals), are we going to have conflicts? Or... is svn good about >>>>> merging branches? >>>>> >>>>> >>>>>> Further steps towards flexible indexing >>>>>> --------------------------------------- >>>>>> >>>>>> Key: LUCENE-1458 >>>>>> URL: >>>>>> https://issues.apache.org/jira/browse/LUCENE-1458 >>>>>> Project: Lucene - Java >>>>>> Issue Type: New Feature >>>>>> Components: Index >>>>>> Affects Versions: 2.9 >>>>>> Reporter: Michael McCandless >>>>>> Assignee: Michael McCandless >>>>>> Priority: Minor >>>>>> Attachments: LUCENE-1458-back-compat.patch, >>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, >>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, >>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, >>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, >>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, >>>>>> LUCENE-1458.tar.bz2 >>>>>> >>>>>> >>>>>> I attached a very rough checkpoint of my current patch, to get early >>>>>> feedback. All tests pass, though back compat tests don't pass >>>>>> due to >>>>>> changes to package-private APIs plus certain bugs in tests that >>>>>> happened to work (eg call TermPostions.nextPosition() too many >>>>>> times, >>>>>> which the new API asserts against). >>>>>> [Aside: I think, when we commit changes to package-private APIs such >>>>>> that back-compat tests don't pass, we could go back, make a >>>>>> branch on >>>>>> the back-compat tag, commit changes to the tests to use the new >>>>>> package private APIs on that branch, then fix nightly build to >>>>>> use the >>>>>> tip of that branch?o] >>>>>> There's still plenty to do before this is committable! This is a >>>>>> rather large change: >>>>>> * Switches to a new more efficient terms dict format. This >>>>>> still >>>>>> uses tii/tis files, but the tii only stores term& long >>>>>> offset >>>>>> (not a TermInfo). At seek points, tis encodes term& >>>>>> freq/prox >>>>>> offsets absolutely instead of with deltas delta. Also, >>>>>> tis/tii >>>>>> are structured by field, so we don't have to record field >>>>>> number >>>>>> in every term. >>>>>> . >>>>>> On first 1 M docs of Wikipedia, tii file is 36% smaller >>>>>> (0.99 MB >>>>>> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> >>>>>> 68.5 MB). >>>>>> . >>>>>> RAM usage when loading terms dict index is significantly less >>>>>> since we only load an array of offsets and an array of >>>>>> String (no >>>>>> more TermInfo array). It should be faster to init too. >>>>>> . >>>>>> This part is basically done. >>>>>> * Introduces modular reader codec that strongly decouples >>>>>> terms dict >>>>>> from docs/positions readers. EG there is no more TermInfo >>>>>> used >>>>>> when reading the new format. >>>>>> . >>>>>> There's nice symmetry now between reading& writing in >>>>>> the codec >>>>>> chain -- the current docs/prox format is captured in: >>>>>> {code} >>>>>> FormatPostingsTermsDictWriter/Reader >>>>>> FormatPostingsDocsWriter/Reader (.frq file) and >>>>>> FormatPostingsPositionsWriter/Reader (.prx file). >>>>>> {code} >>>>>> This part is basically done. >>>>>> * Introduces a new "flex" API for iterating through the fields, >>>>>> terms, docs and positions: >>>>>> {code} >>>>>> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum >>>>>> {code} >>>>>> This replaces TermEnum/Docs/Positions. SegmentReader >>>>>> emulates the >>>>>> old API on top of the new API to keep back-compat. >>>>>> >>>>>> Next steps: >>>>>> * Plug in new codecs (pulsing, pfor) to exercise the >>>>>> modularity / >>>>>> fix any hidden assumptions. >>>>>> * Expose new API out of IndexReader, deprecate old API but >>>>>> emulate >>>>>> old API on top of new one, switch all core/contrib users to >>>>>> the >>>>>> new API. >>>>>> * Maybe switch to AttributeSources as the base class for >>>>>> TermsEnum, >>>>>> DocsEnum, PostingsEnum -- this would give readers API >>>>>> flexibility >>>>>> (not just index-file-format flexibility). EG if someone >>>>>> wanted >>>>>> to store payload at the term-doc level instead of >>>>>> term-doc-position level, you could just add a new attribute. >>>>>> * Test performance& iterate. >>>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-dev-help@lucene.apache.org >>>> >>> >>> -- >>> - Mark >>> >>> http://www.lucidimagination.com >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-dev-help@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org