lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4512) Additional memory savings in CompressingStoredFieldsIndex.MEMORY_CHUNK
Date Tue, 30 Oct 2012 19:34:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487152#comment-13487152
] 

Adrien Grand commented on LUCENE-4512:
--------------------------------------

bq. 72 chunks with bpvs of 16-20 (avg 18 i think).

That is good but I was expecting the distance from average (128kb here) to be less than the
chunk size (16kb), which is clearly not the case. Is there anything in the dataset that could
explain why chunk sizes vary so much? Or maybe we should just decrease the block size or the
average is wrongly computed...

bq. do you think it still makes sense to encode the deltas from the previous value

Good question. Encoding deltas currently requires 14 or 15 bits per values (because it can
grow a little larger than the chunk size which is 2^14) so it is still a little more compact,
and it is less prone to worst cases I think? There is some overhead at read time to build
the packed ints array instead of just deserializing it but I think this is negligible. If
we manage to make bpvs smaller than 14 on "standard" datasets then I think it makes sense.
                
> Additional memory savings in CompressingStoredFieldsIndex.MEMORY_CHUNK
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-4512
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4512
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4512.patch
>
>
> Robert had a great idea to save memory with {{CompressingStoredFieldsIndex.MEMORY_CHUNK}}:
instead of storing the absolute start pointers we could compute the mean number of bytes per
chunk of documents and only store the delta between the actual value and the expected value
(avgChunkBytes * chunkNumber).
> By applying this idea to every n(=1024?) chunks, we would even:
>  - make sure to never hit the worst case (delta ~= maxStartPointer)
>  - reduce memory usage at indexing time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message