lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3108) Land DocValues on trunk
Date Thu, 19 May 2011 16:38:47 GMT


Michael McCandless commented on LUCENE-3108:

bq. How come codecID changed from String to int on the branch?

due to DocValues I need to compare the ID to certain fields to see for
what field I stored and need to open docValues. I always had to parse
the given string which is kind of odd. I think its more natural to
have the same datatype on FieldInfo, SegmentCodecs and eventually in
the Codec#files() method. Making a string out of it is way simpler /
less risky than parsing IMO.

OK that sounds great.

bq. Can SortField somehow detect whether the needed field was stored in FC vs DV

This is tricky though. You can have a DV field that is indexed too so its hard to tell if
we can reliably do it. If we can't make it reliable I think we should not do it at all.

It is tricky... but, eg, when someone does SortField("title",
SortField.STRING), which cache (DV or FC) should we populate?

bq. Should we rename oal.index.values.Type -> .ValueType?

agreed. I also think we should rename Source but I don't have a good name yet. Any idea?

ValueSource?  (conflicts w/ FQs though) Though, maybe we can just
refer to it as DocValues.Source, then it's clear?

bq. Since we dynamically reserve a value to mean "unset", does that mean there are some datasets
we cannot index?

Again, tricky! The quick answer is yes, but we can't do that anyway since I have not normalize
the range to be 0 based since PackedInts doesn't allow negative values. so the range we can
store is (2^63) -1. So essentially with the current impl we can store (2^63)-2 and the max
value is Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think but
to get around this we need to have a different impl I think or do I miss something?

OK, but I think if we make a "straight longs" impl (ie no packed ints at all) then we can
handle all long values?  But in that case we'd require the app to pick a sentinel to mean

> Land DocValues on trunk
> -----------------------
>                 Key: LUCENE-3108
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: core/index, core/search, core/store
>    Affects Versions: CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: 4.0
>         Attachments: LUCENE-3108.patch
> Its time to move another feature from branch to trunk. I want to start this process now
while still a couple of issues remain on the branch. Currently I am down to a single nocommit
(javadocs on and a couple of testing TODOs (explicit multithreaded tests and
unoptimized with deletions) but I think those are not worth separate issues so we can resolve
them as we go. 
> The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process
here IMO, we can fix them once we are on trunk. 
> Here is a quick feature overview of what has been implemented:
>  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed
/ variable size each in sorted, straight and deref variations)
>  * Integration into Flex-API, Codec provides a PerDocConsumer->DocValuesConsumer (write)
/ PerDocValues->DocValues (read) 
>  * By-Default enabled in all codecs except of PreFlex
>  * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues
if on DirReader etc.
>  * Integration into IndexWriter, FieldInfos etc.
>  * Random-testing enabled via RandomIW - injecting random DocValues into documents
>  * Basic checks in CheckIndex (which runs after each test)
>  * FieldComparator for int and float variants (Sorting, currently directly integrated
into SortField, this might go into a separate DocValuesSortField eventually)
>  * Extended TestSort for DocValues
>  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential
access) -> /
>  * Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into
RAM only once and freed once IR is closed) ->
> PS: Currently the RAM resident API is named Source ( which seems too generic.
I think we should rename it into RamDocValues or something like that, suggestion welcome!
> Any comments, questions (rants :)) are very much appreciated.

This message is automatically generated by JIRA.
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message