lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2125) Ability to store and retrieve attributes in the inverted index
Date Mon, 08 Mar 2010 00:32:27 GMT


Michael McCandless commented on LUCENE-2125:

One question...

Say I make an attr and it serializes to variable number of bytes, per

How can we design serializer API so that this attr can do the same
encoding trick we do with payload today?

Ie where we steal 1 bit from the pos-delta to state whether the length
changed from last time?

If we can do this then I think we should remove payload from the
flex postings API and just move directly to attrs?

Though: how, also, should we encode more than 1 attr per position?
Can we somehow make this the responsibility of the serializer (if we
can somehow get one serializer for all attrs that are serializable in
the source)?

This way a user could make their own serializer if they know
interesting things about the attrs they need to serialize.  EG maybe
either A or B needs to be serialized but never both... or C only is
serialized if A is not null... etc.

If we do this then from Lucene's standpoint the serialization will
"feel" just like payload feels today -- an optional byte[] that may or
may not be variable length.

> Ability to store and retrieve attributes in the inverted index
> --------------------------------------------------------------
>                 Key: LUCENE-2125
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: Flex Branch
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: Flex Branch
> Now that we have the cool attribute-based TokenStream API and also the
> great new flexible indexing features, the next logical step is to
> allow storing the attributes inline in the posting lists. Currently
> this is only supported for the PayloadAttribute.
> The flex search APIs already provide an AttributeSource, so there will
> be a very clean and performant symmetry. It should be seamlessly
> possible for the user to define a new attribute, add it to the
> TokenStream, and then retrieve it from the flex search APIs.
> What I'm planning to do is to add additional methods to the token
> attributes (e.g. by adding a new class TokenAttributeImpl, which
> extends AttributeImpl and is the super class of all impls in
> o.a.l.a.tokenattributes):
> - void serialize(DataOutput)
> - void deserialize(DataInput)
> - boolean storeInIndex()
> The indexer will only call the serialize method of an
> TokenAttributeImpl in case its storeInIndex() returns true. 
> The big advantage here is the ease-of-use: A user can implement in one
> place everything necessary to add the attribute to the index.
> Btw: I'd like to introduce DataOutput and DataInput as super classes
> of IndexOutput and IndexInput. They will contain methods like
> readByte(), readVInt(), etc., but methods such as close(),
> getFilePointer() etc. will stay in the super classes.
> Currently the payload concept is hardcoded in 
> TermsHashPerField and FreqProxTermsWriterPerField. These classes take
> care of copying the contents of the PayloadAttribute over into the 
> intermediate in-memory postinglist representation and reading it
> again. Ideally these classes should not know about specific
> attributes, but only call serialze() on those attributes that shall
> be stored in the posting list.
> We also need to change the PositionsEnum and PositionsConsumer APIs to
> deal with attributes instead of payloads.
> I think the new codecs should all support storing attributes. Only the
> preflex one should be hardcoded to only take the PayloadAttribute into
> account.
> We'll possibly need another extension point that allows us to influence 
> compression across multiple postings. Today we use the
> length-compression trick for the payloads: if the previous payload had
> the same length as the current one, we don't store the length
> explicitly again, but only set a bit in the shifted position VInt. Since
> often all payloads of one posting list have the same length, this
> results in effective compression.
> Now an advanced user might want to implement a similar encoding, where
> it's not enough to just control serialization of a single value, but
> where e.g. the previous position can be taken into account to decide
> how to encode a value. 
> I'm not sure yet how this extension point should look like. Maybe the
> flex APIs are actually already sufficient.
> One major goal of this feature is performance: It ought to be more 
> efficient to e.g. define an attribute that writes and reads a single 
> VInt than storing that VInt as a payload. The payload has the overhead
> of converting the data into a byte array first. An attribute on the other 
> hand should be able to call 'int value = dataInput.readVInt();' directly
> without the byte[] indirection.
> After this part is done I'd like to use a very similar approach for
> column-stride fields.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message