lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4512) Additional memory savings in CompressingStoredFieldsIndex.MEMORY_CHUNK
Date Wed, 31 Oct 2012 00:38:12 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adrien Grand updated LUCENE-4512:
---------------------------------

    Attachment: LUCENE-4512.patch

I did some tests with the 1K docs from the wikipedia dump:
 - always 16 or 17 bpvs for start pointers, (my intuition was wrong! :-))
 - the CompressingStoredFieldsIndex instance is 185.3KB (measured with RamusageEstimator)
for 1M docs (0.19 bytes per doc, 3.24 bytes per chunk).

I tried some other block sizes:
 - 256 : 189.2KB
 - 4096 : 204.8 KB

1024 looks like a good setting.

bq. I was just thinking simpler code in the reader.

Hmm good point. It is true that it is already complex enough... Here is a new patch.

bq. And once you get all this baked in aren't you itching to do the vectors files too?

I started thinking to it but I'm not very familiar with the terms vectors file formats yet.
There are probably other places that might benefit from compression (terms dictionary?).
                
> Additional memory savings in CompressingStoredFieldsIndex.MEMORY_CHUNK
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-4512
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4512
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4512.patch, LUCENE-4512.patch
>
>
> Robert had a great idea to save memory with {{CompressingStoredFieldsIndex.MEMORY_CHUNK}}:
instead of storing the absolute start pointers we could compute the mean number of bytes per
chunk of documents and only store the delta between the actual value and the expected value
(avgChunkBytes * chunkNumber).
> By applying this idea to every n(=1024?) chunks, we would even:
>  - make sure to never hit the worst case (delta ~= maxStartPointer)
>  - reduce memory usage at indexing time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message