lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Tue, 13 Oct 2009 21:29:13 GMT
No problem!  I'm excited about the new branch!
Have to try to write some codecs now...

  Michael

On 10/13/09 2:09 PM, Michael McCandless wrote:
> Woops sorry I missed that!
>
> Yes this'll be our first test :)
>
> Mike
>
> On Tue, Oct 13, 2009 at 4:58 PM, Michael Busch<buschmic@gmail.com>  wrote:
>    
>> On 10/13/09 9:43 AM, Michael Busch wrote:
>>      
>>> Shall we first remove the remaining deprecations from the indexer package?
>>> There are not many more left, shouldn't be much work.
>>>
>>>        
>> I wasn't quick enough for you :) Working on LUCENE-1979 now - that will be
>> the first test on how good svn merge is!
>>
>>   Michael
>>
>>      
>>>   Michael
>>>
>>> On 10/13/09 5:47 AM, Michael McCandless wrote:
>>>        
>>>> OK I will cut a branch&    commit Mark's last patch onto it, unless
>>>> anyone has objections soonish...
>>>>
>>>> I'll also branch (twig?) the back compat branch so we can commit the
>>>> patch there as well.
>>>>
>>>> Mike
>>>>
>>>> On Mon, Oct 12, 2009 at 10:50 PM, Mark Miller<markrmiller@gmail.com>
>>>>   wrote:
>>>>          
>>>>> SVN is about as good at merging branches as any of us are with a patch
>>>>> and trunk unfortunately. But that can still be somewhat more convenient
>>>>> than all these huge patches, with different people at different stages.
>>>>>
>>>>> Depends on how many people end up working on this though. Any more than
>>>>> 2, and I think the branch has got to be worth it.
>>>>>
>>>>>   From my perspective, it doesn't make any of the merging process any
>>>>> easier - but it can be easier than juggling all these patches - you have
>>>>> a central code base that can always be targeted for current merging.
>>>>>
>>>>> Michael Busch wrote:
>>>>>            
>>>>>> I think it's supposed to work pretty good - though I have no personal
>>>>>> experience with merging branches with svn.
>>>>>>
>>>>>> I think we should try it - then we'll know! :)
>>>>>>
>>>>>>   Michael
>>>>>>
>>>>>> On 10/12/09 12:32 PM, Michael McCandless (JIRA) wrote:
>>>>>>              
>>>>>>>       [
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764799#action_12764799
>>>>>>> ]
>>>>>>>
>>>>>>> Michael McCandless commented on LUCENE-1458:
>>>>>>> --------------------------------------------
>>>>>>>
>>>>>>> bq. Shall we create a flexible-indexing branch and commit this?
>>>>>>>
>>>>>>> I think this is a good idea.
>>>>>>>
>>>>>>> But I haven't played heavily w/ svn&      branching.  EG
if we branch
>>>>>>> now, and trunk moves fast (which it still is w/ deprecation
>>>>>>> removals), are we going to have conflicts?  Or... is svn good
about
>>>>>>> merging branches?
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> Further steps towards flexible indexing
>>>>>>>> ---------------------------------------
>>>>>>>>
>>>>>>>>                   Key: LUCENE-1458
>>>>>>>>                   URL:
>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458
>>>>>>>>               Project: Lucene - Java
>>>>>>>>            Issue Type: New Feature
>>>>>>>>            Components: Index
>>>>>>>>      Affects Versions: 2.9
>>>>>>>>              Reporter: Michael McCandless
>>>>>>>>              Assignee: Michael McCandless
>>>>>>>>              Priority: Minor
>>>>>>>>           Attachments: LUCENE-1458-back-compat.patch,
>>>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
>>>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
>>>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
>>>>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
>>>>>>>> LUCENE-1458.tar.bz2
>>>>>>>>
>>>>>>>>
>>>>>>>> I attached a very rough checkpoint of my current patch, to
get early
>>>>>>>> feedback.  All tests pass, though back compat tests don't
pass due to
>>>>>>>> changes to package-private APIs plus certain bugs in tests
that
>>>>>>>> happened to work (eg call TermPostions.nextPosition() too
many times,
>>>>>>>> which the new API asserts against).
>>>>>>>> [Aside: I think, when we commit changes to package-private
APIs such
>>>>>>>> that back-compat tests don't pass, we could go back, make
a branch on
>>>>>>>> the back-compat tag, commit changes to the tests to use the
new
>>>>>>>> package private APIs on that branch, then fix nightly build
to use
>>>>>>>> the
>>>>>>>> tip of that branch?o]
>>>>>>>> There's still plenty to do before this is committable! This
is a
>>>>>>>> rather large change:
>>>>>>>>     * Switches to a new more efficient terms dict format.
 This still
>>>>>>>>       uses tii/tis files, but the tii only stores term&
     long offset
>>>>>>>>       (not a TermInfo).  At seek points, tis encodes term&
>>>>>>>>   freq/prox
>>>>>>>>       offsets absolutely instead of with deltas delta.  Also,
tis/tii
>>>>>>>>       are structured by field, so we don't have to record
field number
>>>>>>>>       in every term.
>>>>>>>> .
>>>>>>>>       On first 1 M docs of Wikipedia, tii file is 36% smaller
(0.99 MB
>>>>>>>>       ->      0.64 MB) and tis file is 9% smaller (75.5
MB ->      68.5
>>>>>>>> MB).
>>>>>>>> .
>>>>>>>>       RAM usage when loading terms dict index is significantly
less
>>>>>>>>       since we only load an array of offsets and an array
of String
>>>>>>>> (no
>>>>>>>>       more TermInfo array).  It should be faster to init
too.
>>>>>>>> .
>>>>>>>>       This part is basically done.
>>>>>>>>     * Introduces modular reader codec that strongly decouples
terms
>>>>>>>> dict
>>>>>>>>       from docs/positions readers.  EG there is no more TermInfo
used
>>>>>>>>       when reading the new format.
>>>>>>>> .
>>>>>>>>       There's nice symmetry now between reading&    
 writing in the
>>>>>>>> codec
>>>>>>>>       chain -- the current docs/prox format is captured in:
>>>>>>>> {code}
>>>>>>>> FormatPostingsTermsDictWriter/Reader
>>>>>>>> FormatPostingsDocsWriter/Reader (.frq file) and
>>>>>>>> FormatPostingsPositionsWriter/Reader (.prx file).
>>>>>>>> {code}
>>>>>>>>       This part is basically done.
>>>>>>>>     * Introduces a new "flex" API for iterating through the
fields,
>>>>>>>>       terms, docs and positions:
>>>>>>>> {code}
>>>>>>>> FieldProducer ->      TermsEnum ->      DocsEnum ->
     PostingsEnum
>>>>>>>> {code}
>>>>>>>>       This replaces TermEnum/Docs/Positions.  SegmentReader
emulates
>>>>>>>> the
>>>>>>>>       old API on top of the new API to keep back-compat.
>>>>>>>>
>>>>>>>> Next steps:
>>>>>>>>     * Plug in new codecs (pulsing, pfor) to exercise the
modularity /
>>>>>>>>       fix any hidden assumptions.
>>>>>>>>     * Expose new API out of IndexReader, deprecate old API
but emulate
>>>>>>>>       old API on top of new one, switch all core/contrib
users to the
>>>>>>>>       new API.
>>>>>>>>     * Maybe switch to AttributeSources as the base class
for
>>>>>>>> TermsEnum,
>>>>>>>>       DocsEnum, PostingsEnum -- this would give readers API
>>>>>>>> flexibility
>>>>>>>>       (not just index-file-format flexibility).  EG if someone
wanted
>>>>>>>>       to store payload at the term-doc level instead of
>>>>>>>>       term-doc-position level, you could just add a new attribute.
>>>>>>>>     * Test performance&      iterate.
>>>>>>>>
>>>>>>>>                  
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>
>>>>>>              
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://www.lucidimagination.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>>            
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>>          
>>>        
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>      
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message