lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3108) Land DocValues on trunk
Date Tue, 17 May 2011 20:07:47 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035017#comment-13035017
] 

Michael McCandless commented on LUCENE-3108:
--------------------------------------------

This is an awesome change!

Phew been a long time since I looked at this branch!

Some questions on a quick pass -- still need to iterate/dig deeper:

  * We have some stale jdocs that reference .setIntValue methods (they
    are now .setInt)

  * Hmm do we have byte ordering problems?  Ie, if I write index on
    machine with little-endian but then try to load values on
    big-endian...?  I think we're OK (we seem to always use
    IndexOutput.writeInt, and we convert float-to-raw-int-bits using
    java's APIs)?

  * Since we dynamically reserve a value to mean "unset", does that
    mean there are some datasets we cannot index?  Or... do we tap
    into the unused bit of a long, ie the sentinel value can be
    negative?  But if the data set spans Long.MIN_VALUE to
    Long.MAX_VALUE, what do we do...?

  * How come codecID changed from String to int on the branch?

  * What are oal.util.Pair and ParallelArray for?

  * FloatsRef should state in the jdocs that it's really slicing a
    double[]?

  * Can SortField somehow detect whether the needed field was stored
    in FC vs DV and pick the right comparator accordingly...?  Kind of
    like how NumericField can detect whether the ints are encoded as
    "plain text" or as NF?  We can open a new issue for this,
    post-landing...

  * It looks like we can sort by int/long/float/double pulled from DV,
    but not by terms?  This is fine for landing... but I think we
    should open a post-landing issue to also make FieldComparators for
    the Terms cases?

  * Should we rename oal.index.values.Type -> .ValueType?  Just
    because... it looks so generic when its imported & used as "Type"
    somewhere?


> Land DocValues on trunk
> -----------------------
>
>                 Key: LUCENE-3108
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3108
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: core/index, core/search, core/store
>    Affects Versions: CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: 4.0
>
>
> Its time to move another feature from branch to trunk. I want to start this process now
while still a couple of issues remain on the branch. Currently I am down to a single nocommit
(javadocs on DocValues.java) and a couple of testing TODOs (explicit multithreaded tests and
unoptimized with deletions) but I think those are not worth separate issues so we can resolve
them as we go. 
> The already created issues (LUCENE-3075 and LUCENE-3074) should not block this process
here IMO, we can fix them once we are on trunk. 
> Here is a quick feature overview of what has been implemented:
>  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, Bytes (fixed
/ variable size each in sorted, straight and deref variations)
>  * Integration into Flex-API, Codec provides a PerDocConsumer->DocValuesConsumer (write)
/ PerDocValues->DocValues (read) 
>  * By-Default enabled in all codecs except of PreFlex
>  * Follows other flex-API patterns like non-segment reader throw UOE forcing MultiPerDocValues
if on DirReader etc.
>  * Integration into IndexWriter, FieldInfos etc.
>  * Random-testing enabled via RandomIW - injecting random DocValues into documents
>  * Basic checks in CheckIndex (which runs after each test)
>  * FieldComparator for int and float variants (Sorting, currently directly integrated
into SortField, this might go into a separate DocValuesSortField eventually)
>  * Extended TestSort for DocValues
>  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only sequential
access) -> Source.java / DocValuesEnum.java
>  * Extensible Cache implementation for RAM-Resident DocValues (by-default loaded into
RAM only once and freed once IR is closed) -> SourceCache.java
>  
> PS: Currently the RAM resident API is named Source (Source.java) which seems too generic.
I think we should rename it into RamDocValues or something like that, suggestion welcome!
  
> Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message