lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Thu, 08 Oct 2009 18:48:05 GMT
+1!

Mike

On Thu, Oct 8, 2009 at 2:41 PM, John Wang <john.wang@gmail.com> wrote:
> Hi guys:
>
>      What are your thoughts about contributing Kamikaze as a lucene contrib
> package? We just finished porting kamikaze to lucene 2.9. With the new 2.9
> api, it allows us for some more code tuning and optimization improvements.
>
>      We will be releasing kamikaze, it might a good time to add it to the
> lucene contrib package if there is interest.
>
> Thanks
>
> -John
>
> On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>>
>> By the way: In the last RC of Lucene 2.9 we added a new method to DocIdSet
>> called isCacheable(). It is used by e.g. CachingWrapperFilter to
>> determine,
>> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI
>> (the
>> default is false, so all custom DocIdSets are copied to OpenBitSetDISI by
>> CachingWrapperFilter, even if not needed - if a DocIdSet does not do disk
>> IO
>> and have a fast iterator like e.g. the FieldCache ones in
>> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe this
>> should also be added to Kamikaze, which is a really nice project!
>> Especially
>> filter DocIdSets should pass this method to its delegate (see
>> FilterDocIdSet
>> in Lucene).
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>> > -----Original Message-----
>> > From: John Wang (JIRA) [mailto:jira@apache.org]
>> > Sent: Thursday, September 24, 2009 3:14 PM
>> > To: java-dev@lucene.apache.org
>> > Subject: [jira] Commented: (LUCENE-1458) Further steps towards flexible
>> > indexing
>> >
>> >
>> >     [ https://issues.apache.org/jira/browse/LUCENE-
>> > 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> > tabpanel&focusedCommentId=12759112#action_12759112 ]
>> >
>> > John Wang commented on LUCENE-1458:
>> > -----------------------------------
>> >
>> > Just a FYI: Kamikaze was originally started as our sandbox for Lucene
>> > contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
>> > abstraction that was migrated from Solr)
>> >
>> > It has three components:
>> >
>> > 1) P4Delta
>> > 2) Logical boolean operations on DocIdSet/Iterators (I have created a
>> > jira
>> > ticket and a patch for Lucene awhile ago with performance numbers. It is
>> > significantly faster than DisjunctionScorer)
>> > 3) algorithm to determine which DocIdSet implementations to use given
>> > some
>> > parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the
>> > application behavior if not all parameters are given.
>> >
>> > So please feel free to incorporate anything you see if or move it to
>> > contrib.
>> >
>> >
>> > > Further steps towards flexible indexing
>> > > ---------------------------------------
>> > >
>> > >                 Key: LUCENE-1458
>> > >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>> > >             Project: Lucene - Java
>> > >          Issue Type: New Feature
>> > >          Components: Index
>> > >    Affects Versions: 2.9
>> > >            Reporter: Michael McCandless
>> > >            Assignee: Michael McCandless
>> > >            Priority: Minor
>> > >         Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-
>> > compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-
>> > 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>> > LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
>> > 1458.tar.bz2, LUCENE-1458.tar.bz2
>> > >
>> > >
>> > > I attached a very rough checkpoint of my current patch, to get early
>> > > feedback.  All tests pass, though back compat tests don't pass due to
>> > > changes to package-private APIs plus certain bugs in tests that
>> > > happened to work (eg call TermPostions.nextPosition() too many times,
>> > > which the new API asserts against).
>> > > [Aside: I think, when we commit changes to package-private APIs such
>> > > that back-compat tests don't pass, we could go back, make a branch on
>> > > the back-compat tag, commit changes to the tests to use the new
>> > > package private APIs on that branch, then fix nightly build to use the
>> > > tip of that branch?o]
>> > > There's still plenty to do before this is committable! This is a
>> > > rather large change:
>> > >   * Switches to a new more efficient terms dict format.  This still
>> > >     uses tii/tis files, but the tii only stores term & long offset
>> > >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>> > >     offsets absolutely instead of with deltas delta.  Also, tis/tii
>> > >     are structured by field, so we don't have to record field number
>> > >     in every term.
>> > > .
>> > >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>> > >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
>> > > .
>> > >     RAM usage when loading terms dict index is significantly less
>> > >     since we only load an array of offsets and an array of String (no
>> > >     more TermInfo array).  It should be faster to init too.
>> > > .
>> > >     This part is basically done.
>> > >   * Introduces modular reader codec that strongly decouples terms dict
>> > >     from docs/positions readers.  EG there is no more TermInfo used
>> > >     when reading the new format.
>> > > .
>> > >     There's nice symmetry now between reading & writing in the codec
>> > >     chain -- the current docs/prox format is captured in:
>> > > {code}
>> > > FormatPostingsTermsDictWriter/Reader
>> > > FormatPostingsDocsWriter/Reader (.frq file) and
>> > > FormatPostingsPositionsWriter/Reader (.prx file).
>> > > {code}
>> > >     This part is basically done.
>> > >   * Introduces a new "flex" API for iterating through the fields,
>> > >     terms, docs and positions:
>> > > {code}
>> > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
>> > > {code}
>> > >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>> > >     old API on top of the new API to keep back-compat.
>> > >
>> > > Next steps:
>> > >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>> > >     fix any hidden assumptions.
>> > >   * Expose new API out of IndexReader, deprecate old API but emulate
>> > >     old API on top of new one, switch all core/contrib users to the
>> > >     new API.
>> > >   * Maybe switch to AttributeSources as the base class for TermsEnum,
>> > >     DocsEnum, PostingsEnum -- this would give readers API flexibility
>> > >     (not just index-file-format flexibility).  EG if someone wanted
>> > >     to store payload at the term-doc level instead of
>> > >     term-doc-position level, you could just add a new attribute.
>> > >   * Test performance & iterate.
>> >
>> > --
>> > This message is automatically generated by JIRA.
>> > -
>> > You can reply to this email to add a comment to the issue online.
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message