lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2125) Ability to store and retrieve attributes in the inverted index
Date Sat, 06 Mar 2010 14:34:27 GMT


Michael McCandless commented on LUCENE-2125:

We may need to allow for stateful serializers?

EG (contribed example) imagine an attr that stays the same for most
docs, so, attr writes 1 byte for "it's the same or not" and then many
bytes when there is a change.  The serializer will want to remember
last value it wrote?  (Hmm though I guess attr could also eg keep a
bit inside noting that it had changed on the last call to .next(), as
well).  (The payload encoding length only when length changes is a
similar example, but, this encoding "takes avantage" of being deeply
tied to the codec since that bit is merged with the position length

Or imagine writing strings to the index, but the strings have dups,
yet you don't know the full universe of strings up front.  So you make
a dict as you go (first time you see a string you assign it the next
int).  This case goes beyond first one because this dict must be
saved on .close() (maybe optionally taking a different DataOutput to
save its state to), and, codec must remember which file that attr had
been .close()d on so that at read time it can seek there and init the
stateful deserializer (which should be lazy... ie if you don't request
the attr it shouldn't load the dict).

Also: codec would need to know if serialization is fixed width... or
maybe expose a .skip() method on deserializer?  EG I may be enuming
only docs/positions but not attrs (say, running a normal PhraseQuery),
and I want to just skip (like how we skip payload today when its not

I wonder if StandardCodec should inline serialized attrs into existing
postings lists, or, make separate file to hold them...?

> Ability to store and retrieve attributes in the inverted index
> --------------------------------------------------------------
>                 Key: LUCENE-2125
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: Flex Branch
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: Flex Branch
> Now that we have the cool attribute-based TokenStream API and also the
> great new flexible indexing features, the next logical step is to
> allow storing the attributes inline in the posting lists. Currently
> this is only supported for the PayloadAttribute.
> The flex search APIs already provide an AttributeSource, so there will
> be a very clean and performant symmetry. It should be seamlessly
> possible for the user to define a new attribute, add it to the
> TokenStream, and then retrieve it from the flex search APIs.
> What I'm planning to do is to add additional methods to the token
> attributes (e.g. by adding a new class TokenAttributeImpl, which
> extends AttributeImpl and is the super class of all impls in
> o.a.l.a.tokenattributes):
> - void serialize(DataOutput)
> - void deserialize(DataInput)
> - boolean storeInIndex()
> The indexer will only call the serialize method of an
> TokenAttributeImpl in case its storeInIndex() returns true. 
> The big advantage here is the ease-of-use: A user can implement in one
> place everything necessary to add the attribute to the index.
> Btw: I'd like to introduce DataOutput and DataInput as super classes
> of IndexOutput and IndexInput. They will contain methods like
> readByte(), readVInt(), etc., but methods such as close(),
> getFilePointer() etc. will stay in the super classes.
> Currently the payload concept is hardcoded in 
> TermsHashPerField and FreqProxTermsWriterPerField. These classes take
> care of copying the contents of the PayloadAttribute over into the 
> intermediate in-memory postinglist representation and reading it
> again. Ideally these classes should not know about specific
> attributes, but only call serialze() on those attributes that shall
> be stored in the posting list.
> We also need to change the PositionsEnum and PositionsConsumer APIs to
> deal with attributes instead of payloads.
> I think the new codecs should all support storing attributes. Only the
> preflex one should be hardcoded to only take the PayloadAttribute into
> account.
> We'll possibly need another extension point that allows us to influence 
> compression across multiple postings. Today we use the
> length-compression trick for the payloads: if the previous payload had
> the same length as the current one, we don't store the length
> explicitly again, but only set a bit in the shifted position VInt. Since
> often all payloads of one posting list have the same length, this
> results in effective compression.
> Now an advanced user might want to implement a similar encoding, where
> it's not enough to just control serialization of a single value, but
> where e.g. the previous position can be taken into account to decide
> how to encode a value. 
> I'm not sure yet how this extension point should look like. Maybe the
> flex APIs are actually already sufficient.
> One major goal of this feature is performance: It ought to be more 
> efficient to e.g. define an attribute that writes and reads a single 
> VInt than storing that VInt as a payload. The payload has the overhead
> of converting the data into a byte array first. An attribute on the other 
> hand should be able to call 'int value = dataInput.readVInt();' directly
> without the byte[] indirection.
> After this part is done I'd like to use a very similar approach for
> column-stride fields.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message