lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4509) Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl
Date Thu, 01 Nov 2012 02:01:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488418#comment-13488418
] 

Adrien Grand commented on LUCENE-4509:
--------------------------------------

Right, the docBase could be known from the index with MEMORY_CHUNK, but on the other hand
duplicating the information helps validating that we are at the right place in the fields
data file (there are corruption tests that use this docBase). Given that the chunk starts
with a doc base and the number of docs in the chunk, it gives the range of documents it contains.
The overhead should be very small given that this VInt is repeated at most every {compressed
size of 16KB}. But I have no strong feeling about it, if you think we should remove it, then
let's do it.
                
> Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-4509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4509
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4509.patch
>
>
> What would you think of making CompressingStoredFieldsFormat the new default StoredFieldsFormat?
> Stored fields compression has many benefitsĀ :
>  - it makes the I/O cache work for us,
>  - file-based index replication/backup becomes cheaper.
> Things to know:
>  - even with incompressible data, there is less than 0.5% overhead with LZ4,
>  - LZ4 compression requires ~ 16kB of memory and LZ4 HC compression requires ~ 256kB,
>  - LZ4 uncompression has almost no memory overhead,
>  - on my low-end laptop, the LZ4 impl in Lucene uncompresses at ~ 300mB/s.
> I think we could use the same default parameters as in CompressingCodec :
>  - LZ4 compression,
>  - in-memory stored fields index that is very memory-efficient (less than 12 bytes per
block of compressed docs) and uses binary search to locate documents in the fields data file,
>  - 16 kB blocks (small enough so that there is no major slow down when the whole index
would fit into the I/O cache anyway, and large enough to provide interesting compression ratiosĀ ;
for example Robert got a 0.35 compression ratio with the geonames.org database).
> Any concerns?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message