Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D7CAD76D for ; Sun, 4 Nov 2012 21:48:15 +0000 (UTC) Received: (qmail 33943 invoked by uid 500); 4 Nov 2012 21:48:13 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 33869 invoked by uid 500); 4 Nov 2012 21:48:13 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 33756 invoked by uid 99); 4 Nov 2012 21:48:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Nov 2012 21:48:13 +0000 Date: Sun, 4 Nov 2012 21:48:13 +0000 (UTC) From: "Adrien Grand (JIRA)" To: dev@lucene.apache.org Message-ID: <794544488.66470.1352065693436.JavaMail.jiratomcat@arcas> In-Reply-To: <1423386203.61189.1351876392734.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (LUCENE-4527) CompressingStoredFieldsFormat: encode numStoredFields more efficiently MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-4527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490303#comment-13490303 ] Adrien Grand commented on LUCENE-4527: -------------------------------------- bq. I'm not sure I like 4 vints for min and lengths? If documents (including all fields) are largish then we might be making it worse. I hadn't thought much of it. I assume there are 3 main cases: 1. if document lengths are larger than 16K there is no problem (when chunkDocs==1, it only encodes 2 vints), 2. if the numbers of stored fields and document lengths vary by more than 50%, it can waste 3 bytes (given that doc length < 2**14 and assuming numStoredFields < 128), 3. if the number of stored fields and document lengths vary by less than 50%, it saves at least 2 bits per document so the savings are 2 * chunkDocs - 3 * 8 bits (if docs are 8K each, this can waste 2.5 bytes, if docs are 1K each, this can save 1 byte, if docs are 100 bytes each, this can save 38 bytes). (I did the math while writing, please correct me if I'm wrong) Both options seem to have pros and cons so I'm not sure which one to choose... Which maybe means we should go for the easiest one? (without encoding the min values as VInts) > CompressingStoredFieldsFormat: encode numStoredFields more efficiently > ---------------------------------------------------------------------- > > Key: LUCENE-4527 > URL: https://issues.apache.org/jira/browse/LUCENE-4527 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Fix For: 4.1 > > Attachments: LUCENE-4527.patch > > > Another interesting idea from Robert: many applications have a schema and all documents are likely to have the same number of stored fields. We could save space by using packed ints and the same kind of optimization as {{ForUtil}} (requiring only one VInt if all values are equal). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org