lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Tue, 13 Oct 2009 20:58:11 GMT
On 10/13/09 9:43 AM, Michael Busch wrote:
> Shall we first remove the remaining deprecations from the indexer 
> package? There are not many more left, shouldn't be much work.
>

I wasn't quick enough for you :) Working on LUCENE-1979 now - that will 
be the first test on how good svn merge is!

  Michael

>  Michael
>
> On 10/13/09 5:47 AM, Michael McCandless wrote:
>> OK I will cut a branch&  commit Mark's last patch onto it, unless
>> anyone has objections soonish...
>>
>> I'll also branch (twig?) the back compat branch so we can commit the
>> patch there as well.
>>
>> Mike
>>
>> On Mon, Oct 12, 2009 at 10:50 PM, Mark Miller<markrmiller@gmail.com>  
>> wrote:
>>> SVN is about as good at merging branches as any of us are with a patch
>>> and trunk unfortunately. But that can still be somewhat more convenient
>>> than all these huge patches, with different people at different stages.
>>>
>>> Depends on how many people end up working on this though. Any more than
>>> 2, and I think the branch has got to be worth it.
>>>
>>>  From my perspective, it doesn't make any of the merging process any
>>> easier - but it can be easier than juggling all these patches - you 
>>> have
>>> a central code base that can always be targeted for current merging.
>>>
>>> Michael Busch wrote:
>>>> I think it's supposed to work pretty good - though I have no personal
>>>> experience with merging branches with svn.
>>>>
>>>> I think we should try it - then we'll know! :)
>>>>
>>>>   Michael
>>>>
>>>> On 10/12/09 12:32 PM, Michael McCandless (JIRA) wrote:
>>>>>       [
>>>>> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764799#action_12764799

>>>>>
>>>>> ]
>>>>>
>>>>> Michael McCandless commented on LUCENE-1458:
>>>>> --------------------------------------------
>>>>>
>>>>> bq. Shall we create a flexible-indexing branch and commit this?
>>>>>
>>>>> I think this is a good idea.
>>>>>
>>>>> But I haven't played heavily w/ svn&    branching.  EG if we branch
>>>>> now, and trunk moves fast (which it still is w/ deprecation
>>>>> removals), are we going to have conflicts?  Or... is svn good about
>>>>> merging branches?
>>>>>
>>>>>
>>>>>> Further steps towards flexible indexing
>>>>>> ---------------------------------------
>>>>>>
>>>>>>                   Key: LUCENE-1458
>>>>>>                   URL: 
>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458
>>>>>>               Project: Lucene - Java
>>>>>>            Issue Type: New Feature
>>>>>>            Components: Index
>>>>>>      Affects Versions: 2.9
>>>>>>              Reporter: Michael McCandless
>>>>>>              Assignee: Michael McCandless
>>>>>>              Priority: Minor
>>>>>>           Attachments: LUCENE-1458-back-compat.patch,
>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
>>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
>>>>>> LUCENE-1458.tar.bz2
>>>>>>
>>>>>>
>>>>>> I attached a very rough checkpoint of my current patch, to get early
>>>>>> feedback.  All tests pass, though back compat tests don't pass 
>>>>>> due to
>>>>>> changes to package-private APIs plus certain bugs in tests that
>>>>>> happened to work (eg call TermPostions.nextPosition() too many 
>>>>>> times,
>>>>>> which the new API asserts against).
>>>>>> [Aside: I think, when we commit changes to package-private APIs such
>>>>>> that back-compat tests don't pass, we could go back, make a 
>>>>>> branch on
>>>>>> the back-compat tag, commit changes to the tests to use the new
>>>>>> package private APIs on that branch, then fix nightly build to 
>>>>>> use the
>>>>>> tip of that branch?o]
>>>>>> There's still plenty to do before this is committable! This is a
>>>>>> rather large change:
>>>>>>     * Switches to a new more efficient terms dict format.  This 
>>>>>> still
>>>>>>       uses tii/tis files, but the tii only stores term&    long

>>>>>> offset
>>>>>>       (not a TermInfo).  At seek points, tis encodes term&  
 
>>>>>> freq/prox
>>>>>>       offsets absolutely instead of with deltas delta.  Also, 
>>>>>> tis/tii
>>>>>>       are structured by field, so we don't have to record field 
>>>>>> number
>>>>>>       in every term.
>>>>>> .
>>>>>>       On first 1 M docs of Wikipedia, tii file is 36% smaller 
>>>>>> (0.99 MB
>>>>>>       ->    0.64 MB) and tis file is 9% smaller (75.5 MB ->
   
>>>>>> 68.5 MB).
>>>>>> .
>>>>>>       RAM usage when loading terms dict index is significantly less
>>>>>>       since we only load an array of offsets and an array of 
>>>>>> String (no
>>>>>>       more TermInfo array).  It should be faster to init too.
>>>>>> .
>>>>>>       This part is basically done.
>>>>>>     * Introduces modular reader codec that strongly decouples 
>>>>>> terms dict
>>>>>>       from docs/positions readers.  EG there is no more TermInfo

>>>>>> used
>>>>>>       when reading the new format.
>>>>>> .
>>>>>>       There's nice symmetry now between reading&    writing in

>>>>>> the codec
>>>>>>       chain -- the current docs/prox format is captured in:
>>>>>> {code}
>>>>>> FormatPostingsTermsDictWriter/Reader
>>>>>> FormatPostingsDocsWriter/Reader (.frq file) and
>>>>>> FormatPostingsPositionsWriter/Reader (.prx file).
>>>>>> {code}
>>>>>>       This part is basically done.
>>>>>>     * Introduces a new "flex" API for iterating through the fields,
>>>>>>       terms, docs and positions:
>>>>>> {code}
>>>>>> FieldProducer ->    TermsEnum ->    DocsEnum ->    PostingsEnum
>>>>>> {code}
>>>>>>       This replaces TermEnum/Docs/Positions.  SegmentReader 
>>>>>> emulates the
>>>>>>       old API on top of the new API to keep back-compat.
>>>>>>
>>>>>> Next steps:
>>>>>>     * Plug in new codecs (pulsing, pfor) to exercise the 
>>>>>> modularity /
>>>>>>       fix any hidden assumptions.
>>>>>>     * Expose new API out of IndexReader, deprecate old API but 
>>>>>> emulate
>>>>>>       old API on top of new one, switch all core/contrib users to

>>>>>> the
>>>>>>       new API.
>>>>>>     * Maybe switch to AttributeSources as the base class for 
>>>>>> TermsEnum,
>>>>>>       DocsEnum, PostingsEnum -- this would give readers API 
>>>>>> flexibility
>>>>>>       (not just index-file-format flexibility).  EG if someone 
>>>>>> wanted
>>>>>>       to store payload at the term-doc level instead of
>>>>>>       term-doc-position level, you could just add a new attribute.
>>>>>>     * Test performance&    iterate.
>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>> -- 
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message