lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4527) CompressingStoredFieldsFormat: encode numStoredFields more efficiently
Date Sun, 04 Nov 2012 21:48:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490303#comment-13490303
] 

Adrien Grand commented on LUCENE-4527:
--------------------------------------

bq. I'm not sure I like 4 vints for min and lengths? If documents (including all fields) are
largish then we might be making it worse.

I hadn't thought much of it. I assume there are 3 main cases:
 1. if document lengths are larger than 16K there is no problem (when chunkDocs==1, it only
encodes 2 vints),
 2. if the numbers of stored fields and document lengths vary by more than 50%, it can waste
3 bytes (given that doc length < 2**14 and assuming numStoredFields < 128),
 3. if the number of stored fields and document lengths vary by less than 50%, it saves at
least 2 bits per document so the savings are 2 * chunkDocs - 3 * 8 bits (if docs are 8K each,
this can waste 2.5 bytes, if docs are 1K each, this can save 1 byte, if docs are 100 bytes
each, this can save 38 bytes).

(I did the math while writing, please correct me if I'm wrong)

Both options seem to have pros and cons so I'm not sure which one to choose... Which maybe
means we should go for the easiest one? (without encoding the min values as VInts)
                
> CompressingStoredFieldsFormat: encode numStoredFields more efficiently
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-4527
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4527
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4527.patch
>
>
> Another interesting idea from Robert: many applications have a schema and all documents
are likely to have the same number of stored fields. We could save space by using packed ints
and the same kind of optimization as {{ForUtil}} (requiring only one VInt if all values are
equal).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message