lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2125) Ability to store and retrieve attributes in the inverted index
Date Thu, 09 May 2013 23:06:14 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2125:
----------------------------------

    Fix Version/s:     (was: 4.3)
                   4.4
    
> Ability to store and retrieve attributes in the inverted index
> --------------------------------------------------------------
>
>                 Key: LUCENE-2125
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2125
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>    Affects Versions: 4.0-ALPHA
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 4.4
>
>
> Now that we have the cool attribute-based TokenStream API and also the
> great new flexible indexing features, the next logical step is to
> allow storing the attributes inline in the posting lists. Currently
> this is only supported for the PayloadAttribute.
> The flex search APIs already provide an AttributeSource, so there will
> be a very clean and performant symmetry. It should be seamlessly
> possible for the user to define a new attribute, add it to the
> TokenStream, and then retrieve it from the flex search APIs.
> What I'm planning to do is to add additional methods to the token
> attributes (e.g. by adding a new class TokenAttributeImpl, which
> extends AttributeImpl and is the super class of all impls in
> o.a.l.a.tokenattributes):
> - void serialize(DataOutput)
> - void deserialize(DataInput)
> - boolean storeInIndex()
> The indexer will only call the serialize method of an
> TokenAttributeImpl in case its storeInIndex() returns true. 
> The big advantage here is the ease-of-use: A user can implement in one
> place everything necessary to add the attribute to the index.
> Btw: I'd like to introduce DataOutput and DataInput as super classes
> of IndexOutput and IndexInput. They will contain methods like
> readByte(), readVInt(), etc., but methods such as close(),
> getFilePointer() etc. will stay in the super classes.
> Currently the payload concept is hardcoded in 
> TermsHashPerField and FreqProxTermsWriterPerField. These classes take
> care of copying the contents of the PayloadAttribute over into the 
> intermediate in-memory postinglist representation and reading it
> again. Ideally these classes should not know about specific
> attributes, but only call serialze() on those attributes that shall
> be stored in the posting list.
> We also need to change the PositionsEnum and PositionsConsumer APIs to
> deal with attributes instead of payloads.
> I think the new codecs should all support storing attributes. Only the
> preflex one should be hardcoded to only take the PayloadAttribute into
> account.
> We'll possibly need another extension point that allows us to influence 
> compression across multiple postings. Today we use the
> length-compression trick for the payloads: if the previous payload had
> the same length as the current one, we don't store the length
> explicitly again, but only set a bit in the shifted position VInt. Since
> often all payloads of one posting list have the same length, this
> results in effective compression.
> Now an advanced user might want to implement a similar encoding, where
> it's not enough to just control serialization of a single value, but
> where e.g. the previous position can be taken into account to decide
> how to encode a value. 
> I'm not sure yet how this extension point should look like. Maybe the
> flex APIs are actually already sufficient.
> One major goal of this feature is performance: It ought to be more 
> efficient to e.g. define an attribute that writes and reads a single 
> VInt than storing that VInt as a payload. The payload has the overhead
> of converting the data into a byte array first. An attribute on the other 
> hand should be able to call 'int value = dataInput.readVInt();' directly
> without the byte[] indirection.
> After this part is done I'd like to use a very similar approach for
> column-stride fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message