lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4161) Make PackedInts usable by codecs
Date Fri, 29 Jun 2012 11:35:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403843#comment-13403843
] 

Michael McCandless commented on LUCENE-4161:
--------------------------------------------

Wow, this patch is impressive!  Lots of amazing changes... very cool
how you factored out a simple common API for bulk write/read of all
the formats.

Should computeN be non-static method on BulkOperation base class?
(Just seems odd to make a static method whose first arg is an instance
of that class anyway...).

Can we find a better name for computeN?  I think n is the number of
blocks we can buffer up given the RAM "budget"?  computeNumBlocks?
computeNumBufferedBlocks?  computeBufferedBlocksCount?  Something
else...?

I suspect, to use these for codecs, we will want to have versions that
work on int[] values instead (everything we encode are ints:
docIDs/deltas, term freqs, offsets, positions).

Code styling: can we use three lines, ie:
{noformat}
-    if (valueCount > MAX_SIZE) {
-      throw new ArrayIndexOutOfBoundsException("MAX_SIZE exceeded");
-    }
{noformat}
instead of one line:
{noformat}
+    if (valueCount > MAX_SIZE) { throw new ArrayIndexOutOfBoundsException("MAX_SIZE exceeded");
}
{noformat}
in general?

Does this change the on-disk format?  I think no?  (if those
Format.getId()s match?)  If it does change we need back compat (4.0.0 alpha
has left the station...).

                
> Make PackedInts usable by codecs
> --------------------------------
>
>                 Key: LUCENE-4161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4161.patch
>
>
> Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to
read and write fixed-size values efficiently.
> The problem is that the serialization format is self contained, and always writes the
name of the codec, its version, its number of bits per value and its format. For example,
if you want to use packed ints to store your postings list, this is a lot of overhead (at
least ~60 bytes per term, in case you only use one Writer per term, more otherwise).
> Users should be able to externalize the storage of metadata to save space. For example,
to use PackedInts to store a postings list, one should be able to store the codec name, its
version and the number of bits per doc in the header of the terms+postings list instead of
having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message