lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Tue, 13 Oct 2009 16:43:24 GMT
Shall we first remove the remaining deprecations from the indexer 
package? There are not many more left, shouldn't be much work.

  Michael

On 10/13/09 5:47 AM, Michael McCandless wrote:
> OK I will cut a branch&  commit Mark's last patch onto it, unless
> anyone has objections soonish...
>
> I'll also branch (twig?) the back compat branch so we can commit the
> patch there as well.
>
> Mike
>
> On Mon, Oct 12, 2009 at 10:50 PM, Mark Miller<markrmiller@gmail.com>  wrote:
>    
>> SVN is about as good at merging branches as any of us are with a patch
>> and trunk unfortunately. But that can still be somewhat more convenient
>> than all these huge patches, with different people at different stages.
>>
>> Depends on how many people end up working on this though. Any more than
>> 2, and I think the branch has got to be worth it.
>>
>>  From my perspective, it doesn't make any of the merging process any
>> easier - but it can be easier than juggling all these patches - you have
>> a central code base that can always be targeted for current merging.
>>
>> Michael Busch wrote:
>>      
>>> I think it's supposed to work pretty good - though I have no personal
>>> experience with merging branches with svn.
>>>
>>> I think we should try it - then we'll know! :)
>>>
>>>   Michael
>>>
>>> On 10/12/09 12:32 PM, Michael McCandless (JIRA) wrote:
>>>        
>>>>       [
>>>> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764799#action_12764799
>>>> ]
>>>>
>>>> Michael McCandless commented on LUCENE-1458:
>>>> --------------------------------------------
>>>>
>>>> bq. Shall we create a flexible-indexing branch and commit this?
>>>>
>>>> I think this is a good idea.
>>>>
>>>> But I haven't played heavily w/ svn&    branching.  EG if we branch
>>>> now, and trunk moves fast (which it still is w/ deprecation
>>>> removals), are we going to have conflicts?  Or... is svn good about
>>>> merging branches?
>>>>
>>>>
>>>>          
>>>>> Further steps towards flexible indexing
>>>>> ---------------------------------------
>>>>>
>>>>>                   Key: LUCENE-1458
>>>>>                   URL: https://issues.apache.org/jira/browse/LUCENE-1458
>>>>>               Project: Lucene - Java
>>>>>            Issue Type: New Feature
>>>>>            Components: Index
>>>>>      Affects Versions: 2.9
>>>>>              Reporter: Michael McCandless
>>>>>              Assignee: Michael McCandless
>>>>>              Priority: Minor
>>>>>           Attachments: LUCENE-1458-back-compat.patch,
>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
>>>>> LUCENE-1458.tar.bz2
>>>>>
>>>>>
>>>>> I attached a very rough checkpoint of my current patch, to get early
>>>>> feedback.  All tests pass, though back compat tests don't pass due to
>>>>> changes to package-private APIs plus certain bugs in tests that
>>>>> happened to work (eg call TermPostions.nextPosition() too many times,
>>>>> which the new API asserts against).
>>>>> [Aside: I think, when we commit changes to package-private APIs such
>>>>> that back-compat tests don't pass, we could go back, make a branch on
>>>>> the back-compat tag, commit changes to the tests to use the new
>>>>> package private APIs on that branch, then fix nightly build to use the
>>>>> tip of that branch?o]
>>>>> There's still plenty to do before this is committable! This is a
>>>>> rather large change:
>>>>>     * Switches to a new more efficient terms dict format.  This still
>>>>>       uses tii/tis files, but the tii only stores term&    long offset
>>>>>       (not a TermInfo).  At seek points, tis encodes term&    freq/prox
>>>>>       offsets absolutely instead of with deltas delta.  Also, tis/tii
>>>>>       are structured by field, so we don't have to record field number
>>>>>       in every term.
>>>>> .
>>>>>       On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>>>>>       ->    0.64 MB) and tis file is 9% smaller (75.5 MB ->   
68.5 MB).
>>>>> .
>>>>>       RAM usage when loading terms dict index is significantly less
>>>>>       since we only load an array of offsets and an array of String (no
>>>>>       more TermInfo array).  It should be faster to init too.
>>>>> .
>>>>>       This part is basically done.
>>>>>     * Introduces modular reader codec that strongly decouples terms dict
>>>>>       from docs/positions readers.  EG there is no more TermInfo used
>>>>>       when reading the new format.
>>>>> .
>>>>>       There's nice symmetry now between reading&    writing in the
codec
>>>>>       chain -- the current docs/prox format is captured in:
>>>>> {code}
>>>>> FormatPostingsTermsDictWriter/Reader
>>>>> FormatPostingsDocsWriter/Reader (.frq file) and
>>>>> FormatPostingsPositionsWriter/Reader (.prx file).
>>>>> {code}
>>>>>       This part is basically done.
>>>>>     * Introduces a new "flex" API for iterating through the fields,
>>>>>       terms, docs and positions:
>>>>> {code}
>>>>> FieldProducer ->    TermsEnum ->    DocsEnum ->    PostingsEnum
>>>>> {code}
>>>>>       This replaces TermEnum/Docs/Positions.  SegmentReader emulates
the
>>>>>       old API on top of the new API to keep back-compat.
>>>>>
>>>>> Next steps:
>>>>>     * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>>>>>       fix any hidden assumptions.
>>>>>     * Expose new API out of IndexReader, deprecate old API but emulate
>>>>>       old API on top of new one, switch all core/contrib users to the
>>>>>       new API.
>>>>>     * Maybe switch to AttributeSources as the base class for TermsEnum,
>>>>>       DocsEnum, PostingsEnum -- this would give readers API flexibility
>>>>>       (not just index-file-format flexibility).  EG if someone wanted
>>>>>       to store payload at the term-doc level instead of
>>>>>       term-doc-position level, you could just add a new attribute.
>>>>>     * Test performance&    iterate.
>>>>>
>>>>>            
>>>>          
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>        
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>      
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message