lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
Date Thu, 25 Nov 2010 10:41:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935713#action_12935713
] 

Simon Willnauer commented on LUCENE-2186:
-----------------------------------------

bq. I think this is very close!!
Heh, I strongly agree!

{quote}
    Using attr source as the way to specify the docValue is nice in
    that we get full extensibility, but, it's also heavyweight
    compared to a dedicated API (ie, .setIntValue, etc.).  So I think
    this means apps that use doc values really must re-use their Field
    instances (if they are using doc values) else indexing performance
    will likely take a good hit.
{quote}
Well it is a nice way of extending field but I am not sure if we
should keep it since it is heavy weight. We could get rid of
ValuesAttribute for landing on trunk and work on making field
extendible - which is desperately needed anyway. I was also thinking
that the ValuesEnum doesn't need the ValuesAttribute per se. it would
be more intuitive to have getter on ValuesEnum too. I just really hate
those instanceof checks on fields.

{quote}
   ValuesField is nice sugar on top (of the attr) :) Can you add some
    jdocs to ValuesField? EG it's not stored/indexed.  It's OK to have
    same field name as existing field (hmm... is it)?  Etc.
{quote}
Yeah - until here I haven't done much javadoc but that is on top of
the list. I will start adding JavaDoc to main classes of the API and
ValuesField is 100% a main class of it.
BTW. it is ok to have the same name as a existing field.

bq. Did you want to make FieldsConsumer.addValuesField abstract?
That is a leftover - I will remove it.

bq. The javadoc above DocValues.Source is wrong -- Source is not just for ints.
True - see above that class had a different purpose back in the days
where it was a patch :)

{quote}
You can change jdocs like "This feature is experimental and the
API is free to change in non-backwards-compatible ways." to
 @lucene.experimental :)  (eg in Values.java)
{quote}

yeah - its good to have stuff like that left!!!!! :) yay!
{quote}
 So, you're not allowed to change the DocValues type for a field
 once you've set it the first time... and, also, segments cannot be
merged if the same field has different value types.  I'm thinking
it's really important now to carry over the same FieldInfos from
the last segment when opening the writer (LUCENE-1737)... because
hitting that IllegalStateExc during merge is a trap.  This would
let us change that IllegalStateExc into an assert (in
SegmentMerger) and also turn the assert back on in FieldsConsumer.
{quote}

I think that should not block us from moving forward and landing on trunk ey?

{quote}
Should we rename MissingValues to MissingValue? Ie it holds the single
value for your type that represents "missing"?
{quote}

True, I was also thinking to rename some of the classes like
Values -> DocValueType
PackedIntsImpl -> Ints


bq. We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4

hehe yeah - lemme change the one I added and lets fix the rest on
trunk. I will open an issue once I have a reliable inet connection
again.

{quote}
 It'd be nice to have a more approachable test case that shows the
"simple" way to index doc values, ie using ValuesField instead of
getting the attr, getting the intsRef, setting it, etc.  I think
such an "example" should be very compact right?
{quote}

done on my checkout!

so on my list there are the following topics until landing:

 * missing testcase for addIndexes and a simple one to show how to use the api
 * split up exiting tests in smaller tests - they test too much and
they are hard to understand
 * Add JavaDoc to main classes like DocValues, Source, ValuesEnum, ValuesField
 * Document the different types
 * Consistent class naming - see above
 * enable ram usage tracking for all DocValuesProducer to support
flush by RAM usage

That seems very very close to me. Lets see how much I get done on my
flight to boston :)

> First cut at column-stride fields (index values storage)
> --------------------------------------------------------
>
>                 Key: LUCENE-2186
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2186
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Simon Willnauer
>             Fix For: CSF branch, 4.0
>
>         Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch,
LUCENE-2186.patch, mem.py
>
>
> I created an initial basic impl for storing "index values" (ie
> column-stride value storage).  This is still a work in progress... but
> the approach looks compelling.  I'm posting my current status/patch
> here to get feedback/iterate, etc.
> The code is standalone now, and lives under new package
> oal.index.values (plus some util changes, refactorings) -- I have yet
> to integrate into Lucene so eg you can mark that a given Field's value
> should be stored into the index values, sorting will use these values
> instead of field cache, etc.
> It handles 3 types of values:
>   * Six variants of byte[] per doc, all combinations of fixed vs
>     variable length, and stored either "straight" (good for eg a
>     "title" field), "deref" (good when many docs share the same value,
>     but you won't do any sorting) or "sorted".
>   * Integers (variable bit precision used as necessary, ie this can
>     store byte/short/int/long, and all precisions in between)
>   * Floats (4 or 8 byte precision)
> String fields are stored as the UTF8 byte[].  This patch adds a
> BytesRef, which does the same thing as flex's TermRef (we should merge
> them).
> This patch also adds basic initial impl of PackedInts (LUCENE-1990);
> we can swap that out if/when we get a better impl.
> This storage is dense (like field cache), so it's appropriate when the
> field occurs in all/most docs.  It's just like field cache, except the
> reading API is a get() method invocation, per document.
> Next step is to do basic integration with Lucene, and then compare
> sort performance of this vs field cache.
> For the "sort by String value" case, I think RAM usage & GC load of
> this index values API should be much better than field caache, since
> it does not create object per document (instead shares big long[] and
> byte[] across all docs), and because the values are stored in RAM as
> their UTF8 bytes.
> There are abstract Writer/Reader classes.  The current reader impls
> are entirely RAM resident (like field cache), but the API is (I think)
> agnostic, ie, one could make an MMAP impl instead.
> I think this is the first baby step towards LUCENE-1231.  Ie, it
> cannot yet update values, and the reading API is fully random-access
> by docID (like field cache), not like a posting list, though I
> do think we should add an iterator() api (to return flex's DocsEnum)
> -- eg I think this would be a good way to track avg doc/field length
> for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message