lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: custom segment files
Date Fri, 18 Sep 2009 11:02:11 GMT
Thank you very much Michael for the information!

-John

On Fri, Sep 18, 2009 at 6:01 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> > Say you have a type of field with fixed length data per doc, e.g. a
> > 8 bytes.
>
> OK this makes sense -- thanks for the example!  This sounds like
> getting column-stride-fields before that feature is added to Lucene
> "for real".
>
> For flushing, you can plugin your own indexing chain to IndexWriter.
> This (customizing what's indexed per-doc and what's written for the
> new segment) is exactly what the pluggable indexing chain is for.
> BUT: this API is still very experimental and package private.
>
> I suppose, for looser integration we could add a hook that's called in
> IndexWriter giving you a chance to do something at flush.
> Hmm... actually could you use doAfterFlush()?
>
> Merging, however, doesn't yet have hooks / pluggability in place to do
> something custom, and I agree it's sorely needed.  Patches very
> welcome here!
>
> This could enable "loose" customization on what's flushed and how it's
> merged, and you'd have to make your own reader external to Lucene.
>
> LUCENE-1458 is aiming to cover this sort of use case, but in a more
> tightly integrated way.  EG the new enumeration API in LUCENE-1458 (to
> replace TermEnum, TermDocs, TermPositions) is based on AttributeSource
> so that you could add your own attribute at the field, term, doc or
> positions level.  However I haven't explored this at all yet, and eg
> customizable merging is not done.
>
> > It [flush] probably doesn't need to be final Mike?
>
> I agree.  Wanna include un-final'ing it in a patch?
>
> > Is there a wiki or some sort of write up on LUCENE-1458?
>
> Sorry not just yet.  I agree it's badly needed... it's an enormous set
> of changes at this point.  I'll add a wiki page that I'll try to keep
> current as the design iterates.
>
> Mike
>
> On Thu, Sep 17, 2009 at 8:14 PM, John Wang <john.wang@gmail.com> wrote:
> > Sure.
> >
> > A simple example:
> >
> > Say you have a type of field with fixed length data per doc, e.g. a 8
> bytes.
> > It might be good to store in a segment:
> > <numdocs><v1><v2>....<vn>
> >
> > so if you have 1000 docs, your seg file is 8k+4 bytes.
> >
> > Merging would be rather trivial as well.
> >
> > Doing this right now involves storing into payload, which pays a cost of
> > parsing byte[] to say a long per doc.
> >
> > I think this problem is orthogonal to 1458.
> >
> > There are other usecases, so I thought it might be a good idea to
> abstract
> > it out, since on a high level it is rather similar:
> >
> > start
> > write per doc
> > end
> > merge
> >
> > Hopefully I am describing it clearly.
> >
> > Thanks
> >
> > -John
> >
> >
> > On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless
> > <lucene@mikemccandless.com> wrote:
> >>
> >> I'm actively working on LUCENE-1458, to enable differenct codecs for
> >> reading/writing the terms dict and doc/freq/prox/payload postings.
> >> I'm working now towards getting PforDelta working...
> >>
> >> However, that change doesn't [yet] do anything for norms, stored
> >> fields nor term vectors.
> >>
> >> Can you describe more details about what kinds of customization you're
> >> looking to do?
> >>
> >> Mike
> >>
> >> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <john.wang@gmail.com>
> wrote:
> >> > Hi guys:
> >> >
> >> >      I am trying to figure how to add the ability to create custom
> >> > segment
> >> > files. Hopefully it is possible to create a plugin framework where one
> >> > can
> >> > provide some sort of callback to add to a segment given a doc and
> >> > provide
> >> > some sort of merge logic. This is in light of the flexible indexing
> >> > effort.
> >> >
> >> >      After digging thru the latest trunk code in that area, I see a
> >> > Writer/WriterPerThread pattern for different types of segment files,
> >> > e.g.
> >> > Stored data, norms, inverted doc, etc.
> >> >
> >> >      Do you think it is a good idea to consolidate them? Are there
> >> > intricacies where there are cross dependency between different types
> of
> >> > writers?
> >> >
> >> >      Merge logic seems to be in the SegmentMerger class. Seems to do
> >> > this,
> >> > it would be good to separate it out to per writer type.
> >> >
> >> >       I am still trying to understand the code, any help is greatly
> >> > appreciated.
> >> >
> >> > Thoughts?
> >> >
> >> > Thanks
> >> >
> >> > -John
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message