lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Fri, 11 Sep 2009 17:04:22 GMT

: Could a git branch make things easier for mega-features like this?

why not just start a subversion branch?

: 
: > Further steps towards flexible indexing
: > ---------------------------------------
: >
: >                 Key: LUCENE-1458
: >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
: >             Project: Lucene - Java
: >          Issue Type: New Feature
: >          Components: Index
: >    Affects Versions: 2.9
: >            Reporter: Michael McCandless
: >            Assignee: Michael McCandless
: >            Priority: Minor
: >         Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2
: >
: >
: > I attached a very rough checkpoint of my current patch, to get early
: > feedback.  All tests pass, though back compat tests don't pass due to
: > changes to package-private APIs plus certain bugs in tests that
: > happened to work (eg call TermPostions.nextPosition() too many times,
: > which the new API asserts against).
: > [Aside: I think, when we commit changes to package-private APIs such
: > that back-compat tests don't pass, we could go back, make a branch on
: > the back-compat tag, commit changes to the tests to use the new
: > package private APIs on that branch, then fix nightly build to use the
: > tip of that branch?o]
: > There's still plenty to do before this is committable! This is a
: > rather large change:
: >   * Switches to a new more efficient terms dict format.  This still
: >     uses tii/tis files, but the tii only stores term & long offset
: >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
: >     offsets absolutely instead of with deltas delta.  Also, tis/tii
: >     are structured by field, so we don't have to record field number
: >     in every term.
: > .
: >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
: >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
: > .
: >     RAM usage when loading terms dict index is significantly less
: >     since we only load an array of offsets and an array of String (no
: >     more TermInfo array).  It should be faster to init too.
: > .
: >     This part is basically done.
: >   * Introduces modular reader codec that strongly decouples terms dict
: >     from docs/positions readers.  EG there is no more TermInfo used
: >     when reading the new format.
: > .
: >     There's nice symmetry now between reading & writing in the codec
: >     chain -- the current docs/prox format is captured in:
: > {code}
: > FormatPostingsTermsDictWriter/Reader
: > FormatPostingsDocsWriter/Reader (.frq file) and
: > FormatPostingsPositionsWriter/Reader (.prx file).
: > {code}
: >     This part is basically done.
: >   * Introduces a new "flex" API for iterating through the fields,
: >     terms, docs and positions:
: > {code}
: > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
: > {code}
: >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
: >     old API on top of the new API to keep back-compat.
: >     
: > Next steps:
: >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
: >     fix any hidden assumptions.
: >   * Expose new API out of IndexReader, deprecate old API but emulate
: >     old API on top of new one, switch all core/contrib users to the
: >     new API.
: >   * Maybe switch to AttributeSources as the base class for TermsEnum,
: >     DocsEnum, PostingsEnum -- this would give readers API flexibility
: >     (not just index-file-format flexibility).  EG if someone wanted
: >     to store payload at the term-doc level instead of
: >     term-doc-position level, you could just add a new attribute.
: >   * Test performance & iterate.
: 
: -- 
: This message is automatically generated by JIRA.
: -
: You can reply to this email to add a comment to the issue online.
: 
: 
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-dev-help@lucene.apache.org
: 



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message