lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4161) Make PackedInts usable by codecs
Date Fri, 29 Jun 2012 13:57:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403910#comment-13403910
] 

Adrien Grand commented on LUCENE-4161:
--------------------------------------

bq. Can we find a better name for computeN?

The meaning of {{n}} is actually a bit complicated. For every number of bits per value, there
is a minimum number of blocks (b) / values (v) you need to write in order to reach the next
block boundary:
 * 16 bits per value -> b=1, v=4
 * 24 bits per value -> b=3, v=8
 * 50 bits per value -> b=25, v=32
 * 63 bits per value -> b=63, v = 64
 * ...

A bulk read consists in copying {{n*v}} values that are contained in {{n*b}} blocks into a
long[] (higher values of {{n}} are likely to yield a better throughput) => this requires
{{n * (b + v)}} longs in memory, this is why I compute {{n}} as {{ramBudget / (8 * (b + v))}}
(since a long is 8 bytes). I called it {{n}} in the method name because I have no idea how
to name it... "iterations", maybe?

bq. I suspect, to use these for codecs, we will want to have versions that work on int[] values
instead (everything we encode are ints: docIDs/deltas, term freqs, offsets, positions).

I hesitated to do this since it would involve some code duplication, but I guess it can't
be avoided if we want this API to be actually used... What additional methods do you think
we need?
  * {{PackedReaderIterator.nextInts(int count)}}
  * others?

bq. [static computeN], [code style]

You are right, I will fix it!

bq. Does this change the on-disk format?

No, it doesn't. I will add unit tests for that...
                
> Make PackedInts usable by codecs
> --------------------------------
>
>                 Key: LUCENE-4161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4161.patch
>
>
> Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to
read and write fixed-size values efficiently.
> The problem is that the serialization format is self contained, and always writes the
name of the codec, its version, its number of bits per value and its format. For example,
if you want to use packed ints to store your postings list, this is a lot of overhead (at
least ~60 bytes per term, in case you only use one Writer per term, more otherwise).
> Users should be able to externalize the storage of metadata to save space. For example,
to use PackedInts to store a postings list, one should be able to store the codec name, its
version and the number of bits per doc in the header of the terms+postings list instead of
having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message