lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Tue, 06 Oct 2009 10:55:37 GMT
Merge away - still sleeping over here. Would love to look more again  
but don't know when, so no use waiting on me.

- Mark

http://www.lucidimagination.com (mobile)

On Oct 6, 2009, at 5:54 AM, "Michael McCandless (JIRA)"  
<jira@apache.org> wrote:

>
>    [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762573#action_12762573

>  ]
>
> Michael McCandless commented on LUCENE-1458:
> --------------------------------------------
>
> Whoa thanks for the sudden sprint Mark!
>
> bq. Come on old man, stop clinging to emacs
>
> Hey!  I'm not so old :) But yeah I still cling to emacs.  Hey, I know
> people who still cling to vi!
>
> {quote}
> I didn't really look at the code, but some stuff I noticed:
>
> java 6 in pfor Arrays.copy
>
> skiplist stuff in codecs still have package of index - not sure what  
> is going on there - changed them
>
> in IndexWriter:
> + // Mark: read twice?
> segmentInfos.read(directory);
> + segmentInfos.read(directory, codecs);
> {quote}
>
> Excellent catches!  All of these are not right.
>
> bq. (since you don't include contrib in the tar)
>
> Gak, sorry.  I have a bunch of mods there, cutting over to flex API.
>
> bq. You left getEnum(IndexReader reader) in the MultiTerm queries,  
> but no in PrefixQuery - just checkin'.
>
> Woops, for back compat I think we need to leave it in (it's a
> protected method), deprecated.  I'll put it back if you haven't.
>
> bq. I guess TestBackwardsCompatibility.java has been removed from  
> trunk or something? kept it here for now.
>
> Eek, it shouldn't be -- indeed it is.  When did that happen?  We
> should fix this (separately from this issue!).
>
> Do you have more fixes coming?  If so, I'll let you sprint some  
> more; else, I'll merge in, add contrib & back-compat branch, and  
> post new patch!  Thanks :)
>
>
>> Further steps towards flexible indexing
>> ---------------------------------------
>>
>>                Key: LUCENE-1458
>>                URL: https://issues.apache.org/jira/browse/LUCENE-1458
>>            Project: Lucene - Java
>>         Issue Type: New Feature
>>         Components: Index
>>   Affects Versions: 2.9
>>           Reporter: Michael McCandless
>>           Assignee: Michael McCandless
>>           Priority: Minor
>>        Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back- 
>> compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back- 
>> compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back- 
>> compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,  
>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,  
>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,  
>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,  
>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,  
>> LUCENE-1458.tar.bz2
>>
>>
>> I attached a very rough checkpoint of my current patch, to get early
>> feedback.  All tests pass, though back compat tests don't pass due to
>> changes to package-private APIs plus certain bugs in tests that
>> happened to work (eg call TermPostions.nextPosition() too many times,
>> which the new API asserts against).
>> [Aside: I think, when we commit changes to package-private APIs such
>> that back-compat tests don't pass, we could go back, make a branch on
>> the back-compat tag, commit changes to the tests to use the new
>> package private APIs on that branch, then fix nightly build to use  
>> the
>> tip of that branch?o]
>> There's still plenty to do before this is committable! This is a
>> rather large change:
>>  * Switches to a new more efficient terms dict format.  This still
>>    uses tii/tis files, but the tii only stores term & long offset
>>    (not a TermInfo).  At seek points, tis encodes term & freq/prox
>>    offsets absolutely instead of with deltas delta.  Also, tis/tii
>>    are structured by field, so we don't have to record field number
>>    in every term.
>> .
>>    On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>>    -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
>> .
>>    RAM usage when loading terms dict index is significantly less
>>    since we only load an array of offsets and an array of String (no
>>    more TermInfo array).  It should be faster to init too.
>> .
>>    This part is basically done.
>>  * Introduces modular reader codec that strongly decouples terms dict
>>    from docs/positions readers.  EG there is no more TermInfo used
>>    when reading the new format.
>> .
>>    There's nice symmetry now between reading & writing in the codec
>>    chain -- the current docs/prox format is captured in:
>> {code}
>> FormatPostingsTermsDictWriter/Reader
>> FormatPostingsDocsWriter/Reader (.frq file) and
>> FormatPostingsPositionsWriter/Reader (.prx file).
>> {code}
>>    This part is basically done.
>>  * Introduces a new "flex" API for iterating through the fields,
>>    terms, docs and positions:
>> {code}
>> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
>> {code}
>>    This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>>    old API on top of the new API to keep back-compat.
>>
>> Next steps:
>>  * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>>    fix any hidden assumptions.
>>  * Expose new API out of IndexReader, deprecate old API but emulate
>>    old API on top of new one, switch all core/contrib users to the
>>    new API.
>>  * Maybe switch to AttributeSources as the base class for TermsEnum,
>>    DocsEnum, PostingsEnum -- this would give readers API flexibility
>>    (not just index-file-format flexibility).  EG if someone wanted
>>    to store payload at the term-doc level instead of
>>    term-doc-position level, you could just add a new attribute.
>>  * Test performance & iterate.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message