Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 44564 invoked from network); 20 Dec 2010 15:33:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 20 Dec 2010 15:33:42 -0000 Received: (qmail 65329 invoked by uid 500); 20 Dec 2010 15:33:41 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 65225 invoked by uid 500); 20 Dec 2010 15:33:41 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 65218 invoked by uid 99); 20 Dec 2010 15:33:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Dec 2010 15:33:40 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Dec 2010 15:33:34 +0000 Received: by wye20 with SMTP id 20so3176310wye.35 for ; Mon, 20 Dec 2010 07:33:11 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.166.67 with SMTP id f45mr2438848wel.112.1292859191466; Mon, 20 Dec 2010 07:33:11 -0800 (PST) Received: by 10.216.255.81 with HTTP; Mon, 20 Dec 2010 07:33:11 -0800 (PST) In-Reply-To: References: Date: Mon, 20 Dec 2010 10:33:11 -0500 Message-ID: Subject: Re: strange problem of PForDelta decoder From: Michael McCandless To: dev@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Mon, Dec 20, 2010 at 5:49 AM, Li Li wrote: > =A0 I think random test is not sufficient. > =A0 for normal situation, some branches are not executed. I tested > http://code.google.com/p/integer-array-compress-kit/ with many random > int arrays and it works. But when I use it in real indexing, when in > optimize stage, it corrupted. > =A0Because PForDelta will choose best numFrameBits and some bit such as > 31 is hardly generated by random arrays. So I "force" the encoder to > choose all possible numFrameBits to test all the decode1 ...decode32 > and find some bugs in it. Good point -- we need to make sure we cover all numFrameBits. And a series of 128 random ints in a row will heavily bias for the high num bits cases. Maybe if we doing a better job w/ the random source to try to target all numBits, w/ varying numbers of exceptions, etc... I'll put a nocommit for this. > =A0 =A0what's pfor2? using s9/s16 to encode exception and offset? Yeah I just committed pfor2 this morning on the bulk branch. You can check it out from https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings pfor2 came from the patch attached on https://issues.apache.org/jira/browse/LUCENE-1410 by Hao Yan (thanks!). It uses s16 for the exceptions (though, there's a bug somewhere, because it fails the random test), and it takes a different approachy for encoding exceptions. > =A0 =A0In http://code.google.com/p/integer-array-compress-kit/ it's s9 > for NewPForDelta also have many bugs and also need test each branch to > ensure it works well. OK we should have a look at that one still. We need to converge on a good default codec for 4.0. Fortunately it's trivial to take any int block encoder (fixed or variable block) and make a Lucene codec out of it! Mike --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org