lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
Date Tue, 13 Oct 2009 21:09:12 GMT
Woops sorry I missed that!

Yes this'll be our first test :)

Mike

On Tue, Oct 13, 2009 at 4:58 PM, Michael Busch <buschmic@gmail.com> wrote:
> On 10/13/09 9:43 AM, Michael Busch wrote:
>>
>> Shall we first remove the remaining deprecations from the indexer package?
>> There are not many more left, shouldn't be much work.
>>
>
> I wasn't quick enough for you :) Working on LUCENE-1979 now - that will be
> the first test on how good svn merge is!
>
>  Michael
>
>>  Michael
>>
>> On 10/13/09 5:47 AM, Michael McCandless wrote:
>>>
>>> OK I will cut a branch&  commit Mark's last patch onto it, unless
>>> anyone has objections soonish...
>>>
>>> I'll also branch (twig?) the back compat branch so we can commit the
>>> patch there as well.
>>>
>>> Mike
>>>
>>> On Mon, Oct 12, 2009 at 10:50 PM, Mark Miller<markrmiller@gmail.com>
>>>  wrote:
>>>>
>>>> SVN is about as good at merging branches as any of us are with a patch
>>>> and trunk unfortunately. But that can still be somewhat more convenient
>>>> than all these huge patches, with different people at different stages.
>>>>
>>>> Depends on how many people end up working on this though. Any more than
>>>> 2, and I think the branch has got to be worth it.
>>>>
>>>>  From my perspective, it doesn't make any of the merging process any
>>>> easier - but it can be easier than juggling all these patches - you have
>>>> a central code base that can always be targeted for current merging.
>>>>
>>>> Michael Busch wrote:
>>>>>
>>>>> I think it's supposed to work pretty good - though I have no personal
>>>>> experience with merging branches with svn.
>>>>>
>>>>> I think we should try it - then we'll know! :)
>>>>>
>>>>>  Michael
>>>>>
>>>>> On 10/12/09 12:32 PM, Michael McCandless (JIRA) wrote:
>>>>>>
>>>>>>      [
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764799#action_12764799
>>>>>> ]
>>>>>>
>>>>>> Michael McCandless commented on LUCENE-1458:
>>>>>> --------------------------------------------
>>>>>>
>>>>>> bq. Shall we create a flexible-indexing branch and commit this?
>>>>>>
>>>>>> I think this is a good idea.
>>>>>>
>>>>>> But I haven't played heavily w/ svn&    branching.  EG if
we branch
>>>>>> now, and trunk moves fast (which it still is w/ deprecation
>>>>>> removals), are we going to have conflicts?  Or... is svn good about
>>>>>> merging branches?
>>>>>>
>>>>>>
>>>>>>> Further steps towards flexible indexing
>>>>>>> ---------------------------------------
>>>>>>>
>>>>>>>                  Key: LUCENE-1458
>>>>>>>                  URL:
>>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458
>>>>>>>              Project: Lucene - Java
>>>>>>>           Issue Type: New Feature
>>>>>>>           Components: Index
>>>>>>>     Affects Versions: 2.9
>>>>>>>             Reporter: Michael McCandless
>>>>>>>             Assignee: Michael McCandless
>>>>>>>             Priority: Minor
>>>>>>>          Attachments: LUCENE-1458-back-compat.patch,
>>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
>>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
>>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>>>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
>>>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
>>>>>>> LUCENE-1458.tar.bz2
>>>>>>>
>>>>>>>
>>>>>>> I attached a very rough checkpoint of my current patch, to get
early
>>>>>>> feedback.  All tests pass, though back compat tests don't pass
due to
>>>>>>> changes to package-private APIs plus certain bugs in tests that
>>>>>>> happened to work (eg call TermPostions.nextPosition() too many
times,
>>>>>>> which the new API asserts against).
>>>>>>> [Aside: I think, when we commit changes to package-private APIs
such
>>>>>>> that back-compat tests don't pass, we could go back, make a branch
on
>>>>>>> the back-compat tag, commit changes to the tests to use the new
>>>>>>> package private APIs on that branch, then fix nightly build to
use
>>>>>>> the
>>>>>>> tip of that branch?o]
>>>>>>> There's still plenty to do before this is committable! This is
a
>>>>>>> rather large change:
>>>>>>>    * Switches to a new more efficient terms dict format.  This
still
>>>>>>>      uses tii/tis files, but the tii only stores term&
   long offset
>>>>>>>      (not a TermInfo).  At seek points, tis encodes term&
>>>>>>>  freq/prox
>>>>>>>      offsets absolutely instead of with deltas delta.  Also,
tis/tii
>>>>>>>      are structured by field, so we don't have to record field
number
>>>>>>>      in every term.
>>>>>>> .
>>>>>>>      On first 1 M docs of Wikipedia, tii file is 36% smaller
(0.99 MB
>>>>>>>      ->    0.64 MB) and tis file is 9% smaller (75.5
MB ->    68.5
>>>>>>> MB).
>>>>>>> .
>>>>>>>      RAM usage when loading terms dict index is significantly
less
>>>>>>>      since we only load an array of offsets and an array of
String
>>>>>>> (no
>>>>>>>      more TermInfo array).  It should be faster to init too.
>>>>>>> .
>>>>>>>      This part is basically done.
>>>>>>>    * Introduces modular reader codec that strongly decouples
terms
>>>>>>> dict
>>>>>>>      from docs/positions readers.  EG there is no more TermInfo
used
>>>>>>>      when reading the new format.
>>>>>>> .
>>>>>>>      There's nice symmetry now between reading&    writing
in the
>>>>>>> codec
>>>>>>>      chain -- the current docs/prox format is captured in:
>>>>>>> {code}
>>>>>>> FormatPostingsTermsDictWriter/Reader
>>>>>>> FormatPostingsDocsWriter/Reader (.frq file) and
>>>>>>> FormatPostingsPositionsWriter/Reader (.prx file).
>>>>>>> {code}
>>>>>>>      This part is basically done.
>>>>>>>    * Introduces a new "flex" API for iterating through the
fields,
>>>>>>>      terms, docs and positions:
>>>>>>> {code}
>>>>>>> FieldProducer ->    TermsEnum ->    DocsEnum ->
   PostingsEnum
>>>>>>> {code}
>>>>>>>      This replaces TermEnum/Docs/Positions.  SegmentReader
emulates
>>>>>>> the
>>>>>>>      old API on top of the new API to keep back-compat.
>>>>>>>
>>>>>>> Next steps:
>>>>>>>    * Plug in new codecs (pulsing, pfor) to exercise the modularity
/
>>>>>>>      fix any hidden assumptions.
>>>>>>>    * Expose new API out of IndexReader, deprecate old API but
emulate
>>>>>>>      old API on top of new one, switch all core/contrib users
to the
>>>>>>>      new API.
>>>>>>>    * Maybe switch to AttributeSources as the base class for
>>>>>>> TermsEnum,
>>>>>>>      DocsEnum, PostingsEnum -- this would give readers API
>>>>>>> flexibility
>>>>>>>      (not just index-file-format flexibility).  EG if someone
wanted
>>>>>>>      to store payload at the term-doc level instead of
>>>>>>>      term-doc-position level, you could just add a new attribute.
>>>>>>>    * Test performance&    iterate.
>>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://www.lucidimagination.com
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message