lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <>
Subject [jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
Date Mon, 09 Aug 2010 16:53:19 GMT


Simon Willnauer commented on LUCENE-2186:

It should be more extensible, ie, you can make your own attrs to
store whatever you want. EG we should be able to use this to
store the flex scoring stats (LUCENE-2392).

This is actually the first real use-case together with the norms which is kind of part of
LUCENE-2392 anyway

The end-user API is rather cumbersome now (ie, that the user must
interact directly w/ attrs). It seems like we should have a sugar
layer on top, eg an IntField(Type) and I can do IntField.set/get.
Yeah I guess lots of users would have a rather hard time with that. I remember Grant saying
that he tries to explain Document and Fields since every in his trainings and with users in
mind this should be done with least amount of changes. Nevertheless this is something which
should be fixed outside of this particular issue, LUCENE-2310 would be one I could think of.
Guess I need to talk to chrismale on Friday about that.


Also... maybe we should use Attrs the way NumericField does. Ie, for
CSF we'd have a TokenStream (single valued, for now anyway), and then
attrs could be added to it. If we can get attr serialization
(LUCENE-2125) online, then we can refactor all the read/write code in
this issue as the default attr serializers? And, then, indexer would
have no special code for CSF in particular. It just asks attrs to
serialize themselves...
LUCENE-2125 is something which would be nice to have together with CSF. Yet I don't think
it depends on each other but it should use the same or very closely related APIs eventually.
LUCENE-2125 has different problems to tackle first I guess - but I am closely following that!
I will update that patch to make use of the {NumericField} - lets call it - work-around to
make this patch "less hairy". Still hairy but I like the idea of using TokenStream to attach
the ValuesAttribute.

Shouldn't FloatsRef be FloatRef (same for IntsRef)? It's ref'ing a
single value right?

Yes and no. I was too lazy to add all the capabilities {BytesRef} has but I could imagine
that this can benefit from being able to hold more values - maybe a entire page when paging
is used.  If it only holds a single value we don't need offset and length too. I will leaf
it like that for now, can still change it later if it turns out that we don't need this flexibility.

I guess I will move the ValuesEnum down to Fields and FieldsEnum soon. I don't think we should
confuse this with an DocsEnum since DocsEnum is so closely related to Terms and has explicit
getters for freq() though. DocIdSetIterator seems to be fine for that purpose - while the
AttributeSource could be pulled up.

> First cut at column-stride fields (index values storage)
> --------------------------------------------------------
>                 Key: LUCENE-2186
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Simon Willnauer
>             Fix For: 4.0
>         Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch,
> I created an initial basic impl for storing "index values" (ie
> column-stride value storage).  This is still a work in progress... but
> the approach looks compelling.  I'm posting my current status/patch
> here to get feedback/iterate, etc.
> The code is standalone now, and lives under new package
> oal.index.values (plus some util changes, refactorings) -- I have yet
> to integrate into Lucene so eg you can mark that a given Field's value
> should be stored into the index values, sorting will use these values
> instead of field cache, etc.
> It handles 3 types of values:
>   * Six variants of byte[] per doc, all combinations of fixed vs
>     variable length, and stored either "straight" (good for eg a
>     "title" field), "deref" (good when many docs share the same value,
>     but you won't do any sorting) or "sorted".
>   * Integers (variable bit precision used as necessary, ie this can
>     store byte/short/int/long, and all precisions in between)
>   * Floats (4 or 8 byte precision)
> String fields are stored as the UTF8 byte[].  This patch adds a
> BytesRef, which does the same thing as flex's TermRef (we should merge
> them).
> This patch also adds basic initial impl of PackedInts (LUCENE-1990);
> we can swap that out if/when we get a better impl.
> This storage is dense (like field cache), so it's appropriate when the
> field occurs in all/most docs.  It's just like field cache, except the
> reading API is a get() method invocation, per document.
> Next step is to do basic integration with Lucene, and then compare
> sort performance of this vs field cache.
> For the "sort by String value" case, I think RAM usage & GC load of
> this index values API should be much better than field caache, since
> it does not create object per document (instead shares big long[] and
> byte[] across all docs), and because the values are stored in RAM as
> their UTF8 bytes.
> There are abstract Writer/Reader classes.  The current reader impls
> are entirely RAM resident (like field cache), but the API is (I think)
> agnostic, ie, one could make an MMAP impl instead.
> I think this is the first baby step towards LUCENE-1231.  Ie, it
> cannot yet update values, and the reading API is fully random-access
> by docID (like field cache), not like a posting list, though I
> do think we should add an iterator() api (to return flex's DocsEnum)
> -- eg I think this would be a good way to track avg doc/field length
> for BM25/lnu.ltc scoring.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message