lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: custom segment files
Date Fri, 18 Sep 2009 10:01:08 GMT
> Say you have a type of field with fixed length data per doc, e.g. a
> 8 bytes.

OK this makes sense -- thanks for the example!  This sounds like
getting column-stride-fields before that feature is added to Lucene
"for real".

For flushing, you can plugin your own indexing chain to IndexWriter.
This (customizing what's indexed per-doc and what's written for the
new segment) is exactly what the pluggable indexing chain is for.
BUT: this API is still very experimental and package private.

I suppose, for looser integration we could add a hook that's called in
IndexWriter giving you a chance to do something at flush.
Hmm... actually could you use doAfterFlush()?

Merging, however, doesn't yet have hooks / pluggability in place to do
something custom, and I agree it's sorely needed.  Patches very
welcome here!

This could enable "loose" customization on what's flushed and how it's
merged, and you'd have to make your own reader external to Lucene.

LUCENE-1458 is aiming to cover this sort of use case, but in a more
tightly integrated way.  EG the new enumeration API in LUCENE-1458 (to
replace TermEnum, TermDocs, TermPositions) is based on AttributeSource
so that you could add your own attribute at the field, term, doc or
positions level.  However I haven't explored this at all yet, and eg
customizable merging is not done.

> It [flush] probably doesn't need to be final Mike?

I agree.  Wanna include un-final'ing it in a patch?

> Is there a wiki or some sort of write up on LUCENE-1458?

Sorry not just yet.  I agree it's badly needed... it's an enormous set
of changes at this point.  I'll add a wiki page that I'll try to keep
current as the design iterates.

Mike

On Thu, Sep 17, 2009 at 8:14 PM, John Wang <john.wang@gmail.com> wrote:
> Sure.
>
> A simple example:
>
> Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
> It might be good to store in a segment:
> <numdocs><v1><v2>....<vn>
>
> so if you have 1000 docs, your seg file is 8k+4 bytes.
>
> Merging would be rather trivial as well.
>
> Doing this right now involves storing into payload, which pays a cost of
> parsing byte[] to say a long per doc.
>
> I think this problem is orthogonal to 1458.
>
> There are other usecases, so I thought it might be a good idea to abstract
> it out, since on a high level it is rather similar:
>
> start
> write per doc
> end
> merge
>
> Hopefully I am describing it clearly.
>
> Thanks
>
> -John
>
>
> On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>>
>> I'm actively working on LUCENE-1458, to enable differenct codecs for
>> reading/writing the terms dict and doc/freq/prox/payload postings.
>> I'm working now towards getting PforDelta working...
>>
>> However, that change doesn't [yet] do anything for norms, stored
>> fields nor term vectors.
>>
>> Can you describe more details about what kinds of customization you're
>> looking to do?
>>
>> Mike
>>
>> On Thu, Sep 17, 2009 at 10:00 AM, John Wang <john.wang@gmail.com> wrote:
>> > Hi guys:
>> >
>> >      I am trying to figure how to add the ability to create custom
>> > segment
>> > files. Hopefully it is possible to create a plugin framework where one
>> > can
>> > provide some sort of callback to add to a segment given a doc and
>> > provide
>> > some sort of merge logic. This is in light of the flexible indexing
>> > effort.
>> >
>> >      After digging thru the latest trunk code in that area, I see a
>> > Writer/WriterPerThread pattern for different types of segment files,
>> > e.g.
>> > Stored data, norms, inverted doc, etc.
>> >
>> >      Do you think it is a good idea to consolidate them? Are there
>> > intricacies where there are cross dependency between different types of
>> > writers?
>> >
>> >      Merge logic seems to be in the SegmentMerger class. Seems to do
>> > this,
>> > it would be good to separate it out to per writer type.
>> >
>> >       I am still trying to understand the code, any help is greatly
>> > appreciated.
>> >
>> > Thoughts?
>> >
>> > Thanks
>> >
>> > -John
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message