Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Sun, 4 Nov 2012 21:48:13 +0000 (UTC)
From: "Adrien Grand (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <794544488.66470.1352065693436.JavaMail.jiratomcat@arcas>
In-Reply-To: <1423386203.61189.1351876392734.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (LUCENE-4527) CompressingStoredFieldsFormat:
 encode numStoredFields more efficiently
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-4527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490303#comment-13490303 ] 

Adrien Grand commented on LUCENE-4527:
--------------------------------------

bq. I'm not sure I like 4 vints for min and lengths? If documents (including all fields) are largish then we might be making it worse.

I hadn't thought much of it. I assume there are 3 main cases:
 1. if document lengths are larger than 16K there is no problem (when chunkDocs==1, it only encodes 2 vints),
 2. if the numbers of stored fields and document lengths vary by more than 50%, it can waste 3 bytes (given that doc length < 2**14 and assuming numStoredFields < 128),
 3. if the number of stored fields and document lengths vary by less than 50%, it saves at least 2 bits per document so the savings are 2 * chunkDocs - 3 * 8 bits (if docs are 8K each, this can waste 2.5 bytes, if docs are 1K each, this can save 1 byte, if docs are 100 bytes each, this can save 38 bytes).

(I did the math while writing, please correct me if I'm wrong)

Both options seem to have pros and cons so I'm not sure which one to choose... Which maybe means we should go for the easiest one? (without encoding the min values as VInts)
                
> CompressingStoredFieldsFormat: encode numStoredFields more efficiently
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-4527
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4527
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4527.patch
>
>
> Another interesting idea from Robert: many applications have a schema and all documents are likely to have the same number of stored fields. We could save space by using packed ints and the same kind of optimization as {{ForUtil}} (requiring only one VInt if all values are equal).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org